Machine Learning Data Scientist at Team Lead level

Speed of data manipulations in Julia vs R

Published Sep 28, 2017Last updated Apr 06, 2019

This is a living document. It will get updated as new information comes to light.

** 2019-04-06 update **

I have used R's excellent data.table package for a number of years now and I would say the top 3 reasons for using data.table are speed, speed, and speed!

According to this benchmarks funded by H2O, the employer of Matt Dowle, the author of data.table' data.table is at or near the top pile. In general, I find the data.table package to be pretty speedy for most data manipulation tasks. The questions is: can Julia challenge it? The TL;DR version is not yet.

Test - Synthetic Benchmark

I have taken the below synthetic benchmark code from data.table's official benchmark page. I have dialled down the N to 2e8 instead of 2e9 as my laptop failed to generate DT due to insufficient RAM.

require(data.table)
N=2e8; K=100
set.seed(1)
system.time(DT <- data.table(
  id1 = sample(sprintf("id%03d",1:K), N, TRUE),      # large groups (char)
  id2 = sample(sprintf("id%03d",1:K), N, TRUE),      # large groups (char)
  id3 = sample(sprintf("id%010d",1:(N/K)), N, TRUE), # small groups (char)
  id4 = sample(K, N, TRUE),                          # large groups (int)
  id5 = sample(K, N, TRUE),                          # large groups (int)
  id6 = sample(N/K, N, TRUE),                        # small groups (int)
  v1 =  sample(5, N, TRUE),                          # int in range [1,5]
  v2 =  sample(5, N, TRUE),                          # int in range [1,5]
  v3 =  sample(round(runif(100,max=100),4), N, TRUE) # numeric e.g. 23.5749
))
cat("GB =", round(sum(gc()[,2])/1024, 3), "\n")
#system.time(DT[, sum(v1), keyby=id1])  # elapsed is almost doubled
pt <- proc.time()
DT[, sum(v1), keyby=id1]
DT[, sum(v1), keyby="id1,id2"]
timetaken(pt)

The time taken is around 4~5 seconds for the sum by id1 and ~15 seconds for the sum by id1 and id2.

Update: better Julia aggregation code thanks to Shashi Gowda The Julia version takes approximately the same amount of time which is 5~6 seconds for the first test but take about 50 seconds for the second.

One thing to note also is that JuliaDB.jl uses IndexedTables.jl which does not have a lot of the utility functions such as nrow that comes with a standard dataframe.

using Distributions
using PooledArrays
#using DataFrames

N=Int64(2e8); K=100;

pool = [@sprintf "id%03d" k for k in 1:K]
pool1 = [@sprintf "id%010d" k for k in 1:K]

function randstrarray(pool, N)
    PooledArray(PooledArrays.RefArray(rand(UInt8(1):UInt8(100), N)), pool)
end

using JuliaDB
@time DT = IndexedTable(Columns([1:N;]), Columns(
  id1 = randstrarray(pool, N),
  id2 = randstrarray(pool, N),
  id3 = randstrarray(pool1, N),
  id4 = rand(1:K, N),                          # large groups (int)
  id5 = rand(1:K, N),                          # large groups (int)
  id6 = rand(1:(N/K), N),                        # small groups (int)
  v1 =  rand(1:5, N),                          # int in range [1,5]
  v2 =  rand(1:5, N),                          # int in range [1,5]
  v3 =  rand(round.(rand(Uniform(0,100),100),4), N) # numeric e.g. 23.5749
 ))
 
@time JuliaDB.aggregate(+, DT, by=(:id1,), with=:v1) # 5-6 seconds
@elapsed JuliaDB.aggregate(+, DT, by=(:id1,:id2), with=:v1) #~50 seconds

A side note about time taken to generate synthetic testing data

The time it took to generate the synthetic test dataset in R is

   user  system elapsed 
 110.56    6.59  117.68

and the equivalent code in Julia took only about 40 second, thanks to Shashi Gowda's comment.

A test using real data

Synthetic benchmarks are great but it's also good to benchmark using real data available in the wild to get real sense of working with the language. To do this, I tried to find the largest dataset on Kaggle that uses tabular and structured data as the primary data source. The GE Flight Quest is the comp with the largest dataset that I can find.

I have downloaded the InitialTraningSet_rev1 data and unzipped into a folder. Let's test reading the data in and append all the dataframes together. Here is the R code

library(pipeR)
library(data.table)
library(future)
plan(multiprocess)
pt <- proc.time()
aa1 <- dir("D:/data/InitialTrainingSet_rev1/", full.names = T) %>>% 
  sapply(function(x) file.path(x,"ASDI","asdifpwaypoint.csv")) %>>% 
  future_lapply(fread)
aaa <- rbindlist(aa1);
data.table::timetaken(pt)#1:40
rm(aa1); gc()
system.time(feather::write_feather(aaa,"d:/data/aaa.feather"))

pt <- proc.time()
aa2 <- dir("D:/data/InitialTrainingSet_rev1/", full.names = T) %>>% 
  sapply(function(x) file.path(x,"ASDI","asdifpwaypoint.csv")) %>>% 
  lapply(fread)
data.table::timetaken(pt) # 5:53

In the above I tested future's future_lapply vs vanilla lapply, and it's roughly 3 times faster. So using future is a no-brainer given how unpainful its syntax is.

Next up is Julia

path = "D:/data/InitialTrainingSet_rev1/"
files = path .* readdir("D:/data/InitialTrainingSet_rev1/") .* "/ASDI/asdifpwaypoint.csv"

using CSV;

function readallcsv()
    CSV.read.(files);
end;

@time df = readallcsv();

The above took 585 seconds. That's more than the lapply approach in R. And this didn't include row appending the datasets together. The clear winner is data.table's fread here. Hopefully DataStreams.jl can help change this in the future.

Another promising approach in Julia is JuliaDB.jl (not to be confused with the JuliaDB organization on Github which is a collection of database connectors). However the package is still under development and is not mature yet.

path = "D:/data/InitialTrainingSet_rev1/"
files = path .* readdir("D:/data/InitialTrainingSet_rev1/") .* "/ASDI/asdifpwaypoint.csv"

addprocs()
using JuliaDB
@time @everywhere using Dagger, IndexedTables, JuliaDB # 0.5 second
@time df = JuliaDB.loadfiles(files, indexcols = ["asdiflightplanid", "ordinal"])

In the first run it loads in 125 seconds but it also built a cache so that the next load is faster; indeed the second time it was loaded it took about 13 seconds. I can also save the files into a more efficient binary format using the JuliaDB.save function

@time JuliaDB.save(df,"d:/gcflight")
@time df = JuliaDB.load("d:/gcflight")

which took 83 seconds to save and 0.3 seconds to load which is just memory mapping the file (Update: thanks again to Sashi's comment); however I can't get the aggregation code to work at this stage due to some bugs hence we will need to wait for JuliaDB.jl's bugs to be ironed out first.

Github

The codes in this post can be found at https://github.com/xiaodaigh/data_manipulation_benchmarks

Want to see more tests?

Leave a comment below or like this post!

The Test Rig Setup

I am doing the tests on my Core i7-7700HQ laptop running 64-bit Windows 10 Pro with 32G of RAM. Julia Pro 0.6.0.1 distribution was used vs R 3.4.2 running data.table 1.10.4

R Julia

Report

Enjoy this post? Give ZJ a like if it's helpful.

Machine Learning Data Scientist at Team Lead level

I am a Machine Learning developer and Data scientist Team Lead. I am use R, Python, Julia, and Scala daily and have extensive experience with machine learning both at a user-level and at a developer level. I develop a number of p...

Discover and read more posts from ZJ

get started

Be the first to share your opinion

GitHub flavored markdown supported

submit

Shashi Gowda

7 years ago

Here’s the benchmark script that uses JuliaDB

https://gist.github.com/shashi/683d7234668cad67513d98251d6dcce1

table creation:

 30.915424 seconds (663.84 k allocations: 11.027 GiB, 1.24% gc time)

aggregate(+, T, by=(:id1,), with=:v1)

BenchmarkTools.Trial: 
  memory estimate:  1.49 GiB
  allocs estimate:  164
  --------------
  minimum time:     4.281 s (0.02% GC)
  median time:      4.345 s (1.45% GC)
  mean time:        4.345 s (1.45% GC)
  maximum time:     4.409 s (2.83% GC)
  --------------
  samples:          2
  evals/sample:     1

I am using a i7-6600U CPU @ 2.60GHz

I’ve used PooledArray for constructing the table - this is what will happen if you read from a CSV file. It’s also a reason for the performance. I won’t be surprised if data.table is doing something like this under the hood.

We’re working on making JuliaDB more accessible and finished. We hope to be at a much better place in a month or so.

0.3 seconds to load.

This is because it loads by memory-mapping the data, you’re right, you’ll need to run an operation to bring it to memory. Since I didn’t have enough memory to distribute the whole table, I saved and loaded a table with just the id1 and v1 columns. The first aggregation run which also brings it to memory runs in 5.5 seconds, subsequent runs are in the 4.3-4.5 range.

Cheers

7 years ago

What version of Julia are you using? Are you using master branch?
I can’t compile JuliaDB in Julia v0.7–Dev using the master branch and I can’t get aggregate to work in Julia v0.6

Shashi Gowda

7 years ago

I’m using 0.6, yes, JuliaDB won’t work on 0.7-dev as of now. I’m using the latest release versions: JuliaDB v0.3.2 and IndexedTables v0.3.3.

Shashi Gowda

7 years ago

The JuliaDB command should be: aggregate(+, dt, by=(:id1,), with=:v1)

What was the bug you encountered?

Chris Rackauckas

7 years ago

People in the Julia community know this is an issue. It’s kind of odd to benchmark it now given that the Nullable changes are right around the corner…

7 years ago

It’s a living document. I will update it when Union{T, Null} becomes a reality. But as you can see the benchmarks mostly deal with data without nulls/NA hence it’s a fair comparision between R and Julia’s compabilities now.

Chris Rackauckas

7 years ago

Yeah, it’s a fair comparison now. But when I saw the link, I was excited that someone was benchmarking how the new Union{T,Null} on v0.7 is doing.

7 years ago

Will do once the packages actually compile! Last time I checked 0.7-Dev wasn’t ready.

Chris Rackauckas

7 years ago

I thought DataFrames master does compile on v0.7? You might want to ask in #data in the Slack. It would be nice to have someone willing to run benchmarks!

Show more replies