Speed of data manipulations in Julia vs R
This is a living document. It will get updated as new information comes to light.
I have used R's excellent
data.table package for a number of years now and I would say the top 3 reasons for using
data.table are speed, speed, and speed!
According to the benchmarks on
data.table's Github wiki page the king of speed when it comes to Grouping is
data.table. The other advantages in terms of speed of the
data.table package are discussed in this Kaggle forum thread. No doubt
data.table is fast, but can Julia challenge it? The TL;DR version is not yet.
Test - Synthetic Benchmark
I have taken the below synthetic benchmark code from
data.table's official benchmark page. I have dialled down the
2e8 instead of
2e9 as my laptop failed to generate
DT due to insufficient RAM.
require(data.table) N=2e8; K=100 set.seed(1) system.time(DT <- data.table( id1 = sample(sprintf("id%03d",1:K), N, TRUE), # large groups (char) id2 = sample(sprintf("id%03d",1:K), N, TRUE), # large groups (char) id3 = sample(sprintf("id%010d",1:(N/K)), N, TRUE), # small groups (char) id4 = sample(K, N, TRUE), # large groups (int) id5 = sample(K, N, TRUE), # large groups (int) id6 = sample(N/K, N, TRUE), # small groups (int) v1 = sample(5, N, TRUE), # int in range [1,5] v2 = sample(5, N, TRUE), # int in range [1,5] v3 = sample(round(runif(100,max=100),4), N, TRUE) # numeric e.g. 23.5749 )) cat("GB =", round(sum(gc()[,2])/1024, 3), "\n") #system.time(DT[, sum(v1), keyby=id1]) # elapsed is almost doubled pt <- proc.time() DT[, sum(v1), keyby=id1] DT[, sum(v1), keyby="id1,id2"] timetaken(pt)
The time taken is around 4~5 seconds for the sum by
id1 and ~15 seconds for the sum by
Update: better Julia aggregation code thanks to Shashi Gowda The Julia version takes approximately the same amount of time which is 5~6 seconds for the first test but take about 50 seconds for the second.
One thing to note also is that JuliaDB.jl uses IndexedTables.jl which does not have a lot of the utility functions such as
nrow that comes with a standard dataframe.
using Distributions using PooledArrays #using DataFrames N=Int64(2e8); K=100; pool = [ "id%03d" k for k in 1:K] pool1 = [ "id%010d" k for k in 1:K] function randstrarray(pool, N) PooledArray(PooledArrays.RefArray(rand(UInt8(1):UInt8(100), N)), pool) end using JuliaDB DT = IndexedTable(Columns([1:N;]), Columns( id1 = randstrarray(pool, N), id2 = randstrarray(pool, N), id3 = randstrarray(pool1, N), id4 = rand(1:K, N), # large groups (int) id5 = rand(1:K, N), # large groups (int) id6 = rand(1:(N/K), N), # small groups (int) v1 = rand(1:5, N), # int in range [1,5] v2 = rand(1:5, N), # int in range [1,5] v3 = rand(round.(rand(Uniform(0,100),100),4), N) # numeric e.g. 23.5749 )) JuliaDB.aggregate(+, DT, by=(:id1,), with=:v1) # 5-6 seconds JuliaDB.aggregate(+, DT, by=(:id1,:id2), with=:v1) #~50 seconds
A side note about time taken to generate synthetic testing data
The time it took to generate the synthetic test dataset in R is
user system elapsed 110.56 6.59 117.68
and the equivalent code in Julia took only about 40 second, thanks to Shashi Gowda's comment.
A test using real data
Synthetic benchmarks are great but it's also good to benchmark using real data available in the wild to get real sense of working with the language. To do this, I tried to find the largest dataset on Kaggle that uses tabular and structured data as the primary data source. The GE Flight Quest is the comp with the largest dataset that I can find.
I have downloaded the InitialTraningSet_rev1 data and unzipped into a folder. Let's test reading the data in and append all the dataframes together. Here is the R code
library(pipeR) library(data.table) library(future) plan(multiprocess) pt <- proc.time() aa1 <- dir("D:/data/InitialTrainingSet_rev1/", full.names = T) %>>% sapply(function(x) file.path(x,"ASDI","asdifpwaypoint.csv")) %>>% future_lapply(fread) aaa <- rbindlist(aa1); data.table::timetaken(pt)#1:40 rm(aa1); gc() system.time(feather::write_feather(aaa,"d:/data/aaa.feather")) pt <- proc.time() aa2 <- dir("D:/data/InitialTrainingSet_rev1/", full.names = T) %>>% sapply(function(x) file.path(x,"ASDI","asdifpwaypoint.csv")) %>>% lapply(fread) data.table::timetaken(pt) # 5:53
In the above I tested
future's future_lapply vs vanilla lapply, and it's roughly 3 times faster. So using
future is a no-brainer given how unpainful its syntax is.
Next up is Julia
path = "D:/data/InitialTrainingSet_rev1/" files = path .* readdir("D:/data/InitialTrainingSet_rev1/") .* "/ASDI/asdifpwaypoint.csv" using CSV; function readallcsv() CSV.read.(files); end; df = readallcsv();
The above took 585 seconds. That's more than the
lapply approach in R. And this didn't include row appending the datasets together. The clear winner is data.table's
fread here. Hopefully DataStreams.jl can help change this in the future.
Another promising approach in Julia is JuliaDB.jl (not to be confused with the JuliaDB organization on Github which is a collection of database connectors). However the package is still under development and is not mature yet.
path = "D:/data/InitialTrainingSet_rev1/" files = path .* readdir("D:/data/InitialTrainingSet_rev1/") .* "/ASDI/asdifpwaypoint.csv" addprocs() using JuliaDB # took about 40 seconds to load due to no precompilation using Dagger, IndexedTables, JuliaDB # 0.5 second df = JuliaDB.loadfiles(files, indexcols = ["asdiflightplanid", "ordinal"])
In the first run it loads in 125 seconds but it also built a cache so that the next load is faster; indeed the second time it was loaded it took about 13 seconds. I can also save the files into a more efficient binary format using the
"d:/gcflight") df = JuliaDB.load("d:/gcflight")JuliaDB.save(df,
which took 83 seconds to save and 0.3 seconds to load which is just memory mapping the file (Update: thanks again to Sashi's comment); however I can't get the aggregation code to work at this stage due to some bugs hence we will need to wait for JuliaDB.jl's bugs to be ironed out first.
The codes in this post can be found at https://github.com/xiaodaigh/data_manipulation_benchmarks
Want to see more tests?
Leave a comment below or like this post!
The Test Rig Setup
I am doing the tests on my Core i7-7700HQ laptop running 64-bit Windows 10 Pro with 32G of RAM. Julia Pro 0.6.0.1 distribution was used vs R 3.4.2 running data.table 1.10.4