Speed of data manipulations in Julia vs R
This is a living document. It will get updated as new information comes to light.
** 2019-04-06 update **
I have used R's excellent data.table
package for a number of years now and I would say the top 3 reasons for using data.table
are speed, speed, and speed!
According to this benchmarks funded by H2O, the employer of Matt Dowle, the author of data.table
' data.table
is at or near the top pile. In general, I find the data.table
package to be pretty speedy for most data manipulation tasks. The questions is: can Julia challenge it? The TL;DR version is not yet.
Test - Synthetic Benchmark
I have taken the below synthetic benchmark code from data.table
's official benchmark page. I have dialled down the N
to 2e8
instead of 2e9
as my laptop failed to generate DT
due to insufficient RAM.
require(data.table)
N=2e8; K=100
set.seed(1)
system.time(DT <- data.table(
id1 = sample(sprintf("id%03d",1:K), N, TRUE), # large groups (char)
id2 = sample(sprintf("id%03d",1:K), N, TRUE), # large groups (char)
id3 = sample(sprintf("id%010d",1:(N/K)), N, TRUE), # small groups (char)
id4 = sample(K, N, TRUE), # large groups (int)
id5 = sample(K, N, TRUE), # large groups (int)
id6 = sample(N/K, N, TRUE), # small groups (int)
v1 = sample(5, N, TRUE), # int in range [1,5]
v2 = sample(5, N, TRUE), # int in range [1,5]
v3 = sample(round(runif(100,max=100),4), N, TRUE) # numeric e.g. 23.5749
))
cat("GB =", round(sum(gc()[,2])/1024, 3), "\n")
#system.time(DT[, sum(v1), keyby=id1]) # elapsed is almost doubled
pt <- proc.time()
DT[, sum(v1), keyby=id1]
DT[, sum(v1), keyby="id1,id2"]
timetaken(pt)
The time taken is around 4~5 seconds for the sum by id1
and ~15 seconds for the sum by id1
and id2
.
Update: better Julia aggregation code thanks to Shashi Gowda The Julia version takes approximately the same amount of time which is 5~6 seconds for the first test but take about 50 seconds for the second.
One thing to note also is that JuliaDB.jl uses IndexedTables.jl which does not have a lot of the utility functions such as nrow
that comes with a standard dataframe.
using Distributions
using PooledArrays
#using DataFrames
N=Int64(2e8); K=100;
pool = [@sprintf "id%03d" k for k in 1:K]
pool1 = [@sprintf "id%010d" k for k in 1:K]
function randstrarray(pool, N)
PooledArray(PooledArrays.RefArray(rand(UInt8(1):UInt8(100), N)), pool)
end
using JuliaDB
@time DT = IndexedTable(Columns([1:N;]), Columns(
id1 = randstrarray(pool, N),
id2 = randstrarray(pool, N),
id3 = randstrarray(pool1, N),
id4 = rand(1:K, N), # large groups (int)
id5 = rand(1:K, N), # large groups (int)
id6 = rand(1:(N/K), N), # small groups (int)
v1 = rand(1:5, N), # int in range [1,5]
v2 = rand(1:5, N), # int in range [1,5]
v3 = rand(round.(rand(Uniform(0,100),100),4), N) # numeric e.g. 23.5749
))
@time JuliaDB.aggregate(+, DT, by=(:id1,), with=:v1) # 5-6 seconds
@elapsed JuliaDB.aggregate(+, DT, by=(:id1,:id2), with=:v1) #~50 seconds
A side note about time taken to generate synthetic testing data
The time it took to generate the synthetic test dataset in R is
user system elapsed
110.56 6.59 117.68
and the equivalent code in Julia took only about 40 second, thanks to Shashi Gowda's comment.
A test using real data
Synthetic benchmarks are great but it's also good to benchmark using real data available in the wild to get real sense of working with the language. To do this, I tried to find the largest dataset on Kaggle that uses tabular and structured data as the primary data source. The GE Flight Quest is the comp with the largest dataset that I can find.
I have downloaded the InitialTraningSet_rev1 data and unzipped into a folder. Let's test reading the data in and append all the dataframes together. Here is the R code
library(pipeR)
library(data.table)
library(future)
plan(multiprocess)
pt <- proc.time()
aa1 <- dir("D:/data/InitialTrainingSet_rev1/", full.names = T) %>>%
sapply(function(x) file.path(x,"ASDI","asdifpwaypoint.csv")) %>>%
future_lapply(fread)
aaa <- rbindlist(aa1);
data.table::timetaken(pt)#1:40
rm(aa1); gc()
system.time(feather::write_feather(aaa,"d:/data/aaa.feather"))
pt <- proc.time()
aa2 <- dir("D:/data/InitialTrainingSet_rev1/", full.names = T) %>>%
sapply(function(x) file.path(x,"ASDI","asdifpwaypoint.csv")) %>>%
lapply(fread)
data.table::timetaken(pt) # 5:53
In the above I tested future
's future_lapply vs vanilla lapply, and it's roughly 3 times faster. So using future
is a no-brainer given how unpainful its syntax is.
Next up is Julia
path = "D:/data/InitialTrainingSet_rev1/"
files = path .* readdir("D:/data/InitialTrainingSet_rev1/") .* "/ASDI/asdifpwaypoint.csv"
using CSV;
function readallcsv()
CSV.read.(files);
end;
@time df = readallcsv();
The above took 585 seconds. That's more than the lapply
approach in R. And this didn't include row appending the datasets together. The clear winner is data.table's fread
here. Hopefully DataStreams.jl can help change this in the future.
Another promising approach in Julia is JuliaDB.jl (not to be confused with the JuliaDB organization on Github which is a collection of database connectors). However the package is still under development and is not mature yet.
path = "D:/data/InitialTrainingSet_rev1/"
files = path .* readdir("D:/data/InitialTrainingSet_rev1/") .* "/ASDI/asdifpwaypoint.csv"
addprocs()
using JuliaDB
@time @everywhere using Dagger, IndexedTables, JuliaDB # 0.5 second
@time df = JuliaDB.loadfiles(files, indexcols = ["asdiflightplanid", "ordinal"])
In the first run it loads in 125 seconds but it also built a cache so that the next load is faster; indeed the second time it was loaded it took about 13 seconds. I can also save the files into a more efficient binary format using the JuliaDB.save
function
@time JuliaDB.save(df,"d:/gcflight")
@time df = JuliaDB.load("d:/gcflight")
which took 83 seconds to save and 0.3 seconds to load which is just memory mapping the file (Update: thanks again to Sashi's comment); however I can't get the aggregation code to work at this stage due to some bugs hence we will need to wait for JuliaDB.jl's bugs to be ironed out first.
Github
The codes in this post can be found at https://github.com/xiaodaigh/data_manipulation_benchmarks
Want to see more tests?
Leave a comment below or like this post!
The Test Rig Setup
I am doing the tests on my Core i7-7700HQ laptop running 64-bit Windows 10 Pro with 32G of RAM. Julia Pro 0.6.0.1 distribution was used vs R 3.4.2 running data.table 1.10.4
Here’s the benchmark script that uses JuliaDB
https://gist.github.com/shashi/683d7234668cad67513d98251d6dcce1
table creation:
aggregate(+, T, by=(:id1,), with=:v1)
I am using a
i7-6600U CPU @ 2.60GHz
I’ve used PooledArray for constructing the table - this is what will happen if you read from a CSV file. It’s also a reason for the performance. I won’t be surprised if data.table is doing something like this under the hood.
We’re working on making JuliaDB more accessible and finished. We hope to be at a much better place in a month or so.
This is because it loads by memory-mapping the data, you’re right, you’ll need to run an operation to bring it to memory. Since I didn’t have enough memory to distribute the whole table, I saved and loaded a table with just the id1 and v1 columns. The first aggregation run which also brings it to memory runs in 5.5 seconds, subsequent runs are in the 4.3-4.5 range.
Cheers
What version of Julia are you using? Are you using master branch?
I can’t compile JuliaDB in Julia v0.7–Dev using the master branch and I can’t get aggregate to work in Julia v0.6
I’m using 0.6, yes, JuliaDB won’t work on 0.7-dev as of now. I’m using the latest release versions: JuliaDB v0.3.2 and IndexedTables v0.3.3.
The JuliaDB command should be:
aggregate(+, dt, by=(:id1,), with=:v1)
What was the bug you encountered?
People in the Julia community know this is an issue. It’s kind of odd to benchmark it now given that the Nullable changes are right around the corner…
It’s a living document. I will update it when Union{T, Null} becomes a reality. But as you can see the benchmarks mostly deal with data without nulls/NA hence it’s a fair comparision between R and Julia’s compabilities now.
Yeah, it’s a fair comparison now. But when I saw the link, I was excited that someone was benchmarking how the new Union{T,Null} on v0.7 is doing.
Will do once the packages actually compile! Last time I checked 0.7-Dev wasn’t ready.
I thought DataFrames master does compile on v0.7? You might want to ask in #data in the Slack. It would be nice to have someone willing to run benchmarks!