Machine Learning Data Scientist at Team Lead level

Short Post: The `future` is `fst`: loading and saving Fannie Mae Loan Acquisition data in 10 lines of code

Published Dec 13, 2017

About me

I am an independent consultant with ten years of experience in credit risk modeling/data science/analytics related roles.

How to load data efficiently

I wanted to demonstrate how to conveniently load single-digit gigabyte-sized data into R with speed and ease.

The data I have chosen is the Fannie Mae loan acquisition performance data which you can download from this page. Simply download the zip file "Acquisition_All.zip" and unzip it into a folder.

The next step is to load the data into R efficiently. To do this, I have chosen to use the ever reliable data.table package and two new rising superstar packages of the R world in fst and future.

Here's the code

library(fst)
library(data.table)
library(future)
library(stringr)
plan(multiprocess)

acq_folder = "D:/data/Acquisition_All"
out_acq_fst = "acq_data.fst"

# Note: system.time won't give you accurate timings 
# as the multithreading codes in data.table and fst
# seem to throw it off
pt = proc.time()
all_acq_csvs = dir(acq_folder, full.names = T)

# read the data in parallel using future_lapply
acq_data = rbindlist(future_lapply(all_acq_csvs, fread))

# save as fst for quicker and easier reload later
# use compress = 100 for maximum compression
write_fst(acq_data, out_acq_fst, compress = 100)

# took about 3 minutes on my 4-core Core i7 laptop equipped with an SSD
print(timetaken(pt))

# simply use the below to reload he data again in future
# acq_data = read_fst(file.path(acq_folder, out_acq_fst), as.data.table = T)

What makes the above efficient?

Firstly data.table's fread method is one of fastest implementation of CSV to tabular data reader found in any programming language. Using it to load CSVs is a no-brainer as it is incredibly performant and it is often more "tolerant" of slightly malformed CSVs.

Secondly, future's (soon to be future.apply's) future_lapply makes it dead easy to run multiple R processes in the background so you can utilize your multi-core CPU to its full potential and load multiple files simultaneously.

Last but not least (luckily we have three packages here so the "least" is anonymized), the fst package makes available to mere-mortals a table storage format that is both fast and frugal in terms of the hard-drive space its outputs require. The fst package uses modern compression techniques to minimize the number of bytes required to store the data. Even today the bottleneck for many data tasks is the speed of the hard drive, so the less you store on hard-drive, the faster it is to read the data; although currently the compression can tax the CPU and result in slower read-speeds. The compress parameter in write_fst can take values from 1 to 100; where 100 is the highest level of compression. You may want to experiment with the compress parameter to find the sweet spot for you to achieve the best trade-off between smaller filesize and read/write performance.

Want to learn R & Data Science in person and live in Australia?

We at evalparse.io are passionate about R and data science training. Sign-up today to get notified of upcoming R training in your city.

R Data Science Data

Report

Enjoy this post? Give ZJ a like if it's helpful.

Machine Learning Data Scientist at Team Lead level

I am a Machine Learning developer and Data scientist Team Lead. I am use R, Python, Julia, and Scala daily and have extensive experience with machine learning both at a user-level and at a developer level. I develop a number of p...

Discover and read more posts from ZJ

get started