Speed of data manipulations in Julia vs R

Published Sep 28, 2017Last updated Dec 16, 2017

This is a living document. It will get updated as new information comes to light.

I have used R's excellent data.table package for a number of years now and I would say the top 3 reasons for using data.table are speed, speed, and speed!

According to the benchmarks on data.table's Github wiki page the king of speed when it comes to Grouping is data.table. The other advantages in terms of speed of the data.table package are discussed in this Kaggle forum thread. No doubt data.table is fast, but can Julia challenge it? The TL;DR version is not yet.

Test - Synthetic Benchmark

I have taken the below synthetic benchmark code from data.table's official benchmark page. I have dialled down the N to 2e8 instead of 2e9 as my laptop failed to generate DT due to insufficient RAM.

require(data.table)
N=2e8; K=100
set.seed(1)
system.time(DT <- data.table(
  id1 = sample(sprintf("id%03d",1:K), N, TRUE),      # large groups (char)
  id2 = sample(sprintf("id%03d",1:K), N, TRUE),      # large groups (char)
  id3 = sample(sprintf("id%010d",1:(N/K)), N, TRUE), # small groups (char)
  id4 = sample(K, N, TRUE),                          # large groups (int)
  id5 = sample(K, N, TRUE),                          # large groups (int)
  id6 = sample(N/K, N, TRUE),                        # small groups (int)
  v1 =  sample(5, N, TRUE),                          # int in range [1,5]
  v2 =  sample(5, N, TRUE),                          # int in range [1,5]
  v3 =  sample(round(runif(100,max=100),4), N, TRUE) # numeric e.g. 23.5749
))
cat("GB =", round(sum(gc()[,2])/1024, 3), "\n")
#system.time(DT[, sum(v1), keyby=id1])  # elapsed is almost doubled
pt <- proc.time()
DT[, sum(v1), keyby=id1]
DT[, sum(v1), keyby="id1,id2"]
timetaken(pt)

The time taken is around 4~5 seconds for the sum by id1 and ~15 seconds for the sum by id1 and id2.

Update: better Julia aggregation code thanks to Shashi Gowda The Julia version takes approximately the same amount of time which is 5~6 seconds for the first test but take about 50 seconds for the second.

One thing to note also is that JuliaDB.jl uses IndexedTables.jl which does not have a lot of the utility functions such as nrow that comes with a standard dataframe.

using Distributions
using PooledArrays
#using DataFrames

N=Int64(2e8); K=100;

pool = [@sprintf "id%03d" k for k in 1:K]
pool1 = [@sprintf "id%010d" k for k in 1:K]

function randstrarray(pool, N)
    PooledArray(PooledArrays.RefArray(rand(UInt8(1):UInt8(100), N)), pool)
end

using JuliaDB
@time DT = IndexedTable(Columns([1:N;]), Columns(
  id1 = randstrarray(pool, N),
  id2 = randstrarray(pool, N),
  id3 = randstrarray(pool1, N),
  id4 = rand(1:K, N),                          # large groups (int)
  id5 = rand(1:K, N),                          # large groups (int)
  id6 = rand(1:(N/K), N),                        # small groups (int)
  v1 =  rand(1:5, N),                          # int in range [1,5]
  v2 =  rand(1:5, N),                          # int in range [1,5]
  v3 =  rand(round.(rand(Uniform(0,100),100),4), N) # numeric e.g. 23.5749
 ))
 
@time JuliaDB.aggregate(+, DT, by=(:id1,), with=:v1) # 5-6 seconds
@elapsed JuliaDB.aggregate(+, DT, by=(:id1,:id2), with=:v1) #~50 seconds

A side note about time taken to generate synthetic testing data

The time it took to generate the synthetic test dataset in R is

   user  system elapsed 
 110.56    6.59  117.68 

and the equivalent code in Julia took only about 40 second, thanks to Shashi Gowda's comment.

A test using real data

Synthetic benchmarks are great but it's also good to benchmark using real data available in the wild to get real sense of working with the language. To do this, I tried to find the largest dataset on Kaggle that uses tabular and structured data as the primary data source. The GE Flight Quest is the comp with the largest dataset that I can find.

I have downloaded the InitialTraningSet_rev1 data and unzipped into a folder. Let's test reading the data in and append all the dataframes together. Here is the R code

library(pipeR)
library(data.table)
library(future)
plan(multiprocess)
pt <- proc.time()
aa1 <- dir("D:/data/InitialTrainingSet_rev1/", full.names = T) %>>% 
  sapply(function(x) file.path(x,"ASDI","asdifpwaypoint.csv")) %>>% 
  future_lapply(fread)
aaa <- rbindlist(aa1);
data.table::timetaken(pt)#1:40
rm(aa1); gc()
system.time(feather::write_feather(aaa,"d:/data/aaa.feather"))

pt <- proc.time()
aa2 <- dir("D:/data/InitialTrainingSet_rev1/", full.names = T) %>>% 
  sapply(function(x) file.path(x,"ASDI","asdifpwaypoint.csv")) %>>% 
  lapply(fread)
data.table::timetaken(pt) # 5:53

In the above I tested future's future_lapply vs vanilla lapply, and it's roughly 3 times faster. So using future is a no-brainer given how unpainful its syntax is.

Next up is Julia

path = "D:/data/InitialTrainingSet_rev1/"
files = path .* readdir("D:/data/InitialTrainingSet_rev1/") .* "/ASDI/asdifpwaypoint.csv"

using CSV;

function readallcsv()
    CSV.read.(files);
end;

@time df = readallcsv();

The above took 585 seconds. That's more than the lapply approach in R. And this didn't include row appending the datasets together. The clear winner is data.table's fread here. Hopefully DataStreams.jl can help change this in the future.

Another promising approach in Julia is JuliaDB.jl (not to be confused with the JuliaDB organization on Github which is a collection of database connectors). However the package is still under development and is not mature yet.

path = "D:/data/InitialTrainingSet_rev1/"
files = path .* readdir("D:/data/InitialTrainingSet_rev1/") .* "/ASDI/asdifpwaypoint.csv"

addprocs()
using JuliaDB
@time @everywhere using Dagger, IndexedTables, JuliaDB # 0.5 second
@time df = JuliaDB.loadfiles(files, indexcols = ["asdiflightplanid", "ordinal"])

In the first run it loads in 125 seconds but it also built a cache so that the next load is faster; indeed the second time it was loaded it took about 13 seconds. I can also save the files into a more efficient binary format using the JuliaDB.save function

@time JuliaDB.save(df,"d:/gcflight")
@time df = JuliaDB.load("d:/gcflight")

which took 83 seconds to save and 0.3 seconds to load which is just memory mapping the file (Update: thanks again to Sashi's comment); however I can't get the aggregation code to work at this stage due to some bugs hence we will need to wait for JuliaDB.jl's bugs to be ironed out first.

Github

The codes in this post can be found at https://github.com/xiaodaigh/data_manipulation_benchmarks

Want to see more tests?

Leave a comment below or like this post!

The Test Rig Setup

I am doing the tests on my Core i7-7700HQ laptop running 64-bit Windows 10 Pro with 32G of RAM. Julia Pro 0.6.0.1 distribution was used vs R 3.4.2 running data.table 1.10.4

Discover and read more posts from ZJ
get started