Machine Learning Data Scientist at Team Lead level

What I would like to see in R 5.0

Published Sep 20, 2017Last updated Apr 29, 2020

R has been my go to language since 2008 and it has grown into an amazingly productive environment since then with the introduction of Rstudio, Shiny, and dplyr into the ecosystem. One standout change to R for me was the introduction of the data.table package. It allows me to manipulate data at the speed of light really. Also as the amount of RAM that can fit into a desktop/laptop computer increases it is becoming more and more viable to do large-ish data manipulation in R hence the days of SAS look increasingly numbered to me.

However there are things I would like to see improved in the next major-major version of R – 5.0. Here is a list

Multi-threading as a first class citizen

R has introduced the excellent parallel package into the base distribution years ago and I have used it to speed up some programs. Later when I discovered the future package the world became even more amazing as I can write code that works like a go-routine in that it’s not blocking (but not in light-ness, unfortunately). However on Windows many of the functions in these packages rely on starting another session of R in the background to perform calculations in parallel. This is not very efficient as starting up a background R session is expensive.

Native support for light multithreading in R will bring R into the post-Moore’s-law world of multi-core CPUs and it is simply a must for R to continue to be relevant.

Introduce the pipe operator |> into the language

Currently in order to declare a infix operation we would have to do something like

`infix_fn` <- function(left, right) {
  #…some code
}

and then use the function like so

d <- a %infix_fn% b

However some symbols are special in that they can be used as infix operators without the back-ticks, examples include

+, -, /, *.

Yes a+b is just `+`(a,b) in disguise.

There are two notable uses of infix operators in the wild. One is data.table’s use of the := operator which is actually a properly recognized symbol in the R source code and has special no-back-tick-needed status. The data.table package has cleverly exploited this fact to perform in place assignments with :=.

Another example is the excellent magrittr package which uses %>% to pipe the output of a function into the first argument of the next function. Given the popularity (and awesomeness) of pipes it makes sense to introduce |> into the language and perhaps implement it as a first class native function.

Run faster

Ruby’s BDFL wanted Ruby 3.0 to be 3 times as fast as Ruby 2.x. So R 5.0 should be 5 times as fast as R 5.x

Allow more than `2^31-1` rows in `data.frame`s

Currently there is a limit on how many rows can a data.frame contain, which is 2^31-1 or approximately 2 billion rows. Some might argue that if you have that much data you should be using Spark, Dask, HiFrames, JuliaDB, fst, disk.frame or a database. But as RAM gets cheaper and larger in size, it will only be several years before analysing 2 billion rows data become viable on a humble laptop. Also using 3D XPoint storage as RAM is another exciting development that might open door to cheap (albeit a bit slower) billion-row scale data manipulation on laptops. Hence R needs to modernize it's internals in order to compete for these type of workloads; ALTEREP is a positive step towards that direction.

An R Alternative — Julia

At least two of the features that I have mentioned are already available in Julia today. For example multi-threading and |> are both well supported by Julia. The third point about speed is a bit more complicated. No one doubts that Julia is fast at compute-type tasks but it is not that fully developed in terms of data manipulation. So Julia still has some catching up to do in some areas, but I believe it will be HUGE one day just like R and Python are today.

R Julia

Report

Enjoy this post? Give ZJ a like if it's helpful.

Machine Learning Data Scientist at Team Lead level

I am a Machine Learning developer and Data scientist Team Lead. I am use R, Python, Julia, and Scala daily and have extensive experience with machine learning both at a user-level and at a developer level. I develop a number of p...

Discover and read more posts from ZJ

get started