Short Post: The `future` is `fst`: loading and saving Fannie Mae Loan Acquisition data in 10 lines of code
I am an independent consultant with ten years of experience in credit risk modeling/data science/analytics related roles.
How to load data efficiently
I wanted to demonstrate how to conveniently load single-digit gigabyte-sized data into R with speed and ease.
The data I have chosen is the Fannie Mae loan acquisition performance data which you can download from this page. Simply download the zip file "Acquisition_All.zip" and unzip it into a folder.
The next step is to load the data into R efficiently. To do this, I have chosen to use the ever reliable
data.table package and two new rising superstar packages of the R world in
Here's the code
library(fst) library(data.table) library(future) library(stringr) plan(multiprocess) acq_folder = "D:/data/Acquisition_All" out_acq_fst = "acq_data.fst" # Note: system.time won't give you accurate timings # as the multithreading codes in data.table and fst # seem to throw it off pt = proc.time() all_acq_csvs = dir(acq_folder, full.names = T) # read the data in parallel using future_lapply acq_data = rbindlist(future_lapply(all_acq_csvs, fread)) # save as fst for quicker and easier reload later # use compress = 100 for maximum compression write_fst(acq_data, out_acq_fst, compress = 100) # took about 3 minutes on my 4-core Core i7 laptop equipped with an SSD print(timetaken(pt)) # simply use the below to reload he data again in future # acq_data = read_fst(file.path(acq_folder, out_acq_fst), as.data.table = T)
What makes the above efficient?
fread method is one of fastest implementation of CSV to tabular data reader found in any programming language. Using it to load CSVs is a no-brainer as it is incredibly performant and it is often more "tolerant" of slightly malformed CSVs.
future's (soon to be
future_lapply makes it dead easy to run multiple R processes in the background so you can utilize your multi-core CPU to its full potential and load multiple files simultaneously.
Last but not least (luckily we have three packages here so the "least" is anonymized), the
fst package makes available to mere-mortals a table storage format that is both fast and frugal in terms of the hard-drive space its outputs require. The
fst package uses modern compression techniques to minimize the number of bytes required to store the data. Even today the bottleneck for many data tasks is the speed of the hard drive, so the less you store on hard-drive, the faster it is to read the data; although currently the compression can tax the CPU and result in slower read-speeds. The
compress parameter in
write_fst can take values from 1 to 100; where 100 is the highest level of compression. You may want to experiment with the
compress parameter to find the sweet spot for you to achieve the best trade-off between smaller filesize and read/write performance.
Want to learn R & Data Science in person and live in Australia?
We at evalparse.io are passionate about R and data science training. Sign-up today to get notified of upcoming R training in your city.