# Julia vs R vs Python: string-sort performance + an unfinished journey to optimizing Julia's performance

## String-sorting is now faster in Julia

A *much* faster new string sorting algorithm `radixsort`

has been published as part of *SortingLab.jl*. It can sort strings up to 3x faster! It is especially effective for fixed length strings.

### Example usage

```
# run once to install SortingLab
# Pkg.add("SortingLab")
using SortingLab
# generate a random id vector with 1 million unique IDs
idstrvec = rand("id".*dec.(1:1_000_000,10), 100_000_000)
# sort the string vector
idstrvec_radixsorted = radixsort(idstrvec)
issorted(idstrvec_radixsorted) #true
# sort the original vector in-place
radixsort!(idstrvec)
issorted(idstrvec) #true
```

## Benchmarks vs R vs Python

The performance of string sorting is a nuanced topic. It appears that Julia is the fastest (as is shown in the cover photo). However, the story can change if we look at a couple more synthetic examples.

Julia is the fastest when the number of unique strings is close to the number strings, and Python is second using Numpy's sort, while R is a distant third.

R is the fastest if there are lots of duplicates values (or put another way if the ratio of unique strings to strings is small, e.g. 1:100) and if there are a large number of elements; Julia can sometimes beat R sometimes even if there are lots of duplicates if the number elements to sort is small (e.g. 10 million); this is shown in the below benchmarks.

Why is R so fast? It is using a form of string interning which is discussed in more detail later in the blog post. Theoretically, this approach requires more set-up time. Julia doesn't have interned strings as default and hence is not able to perform the optimization that R uses out-of-the-box.

Side note about alternative data types: factor/categorical |
---|

If the number of unique strings is small one can use factor/categorical types in Julia/R/Python to represent the string-vector instead of using strings. These can yield significant speed up in sorting performance with optimized algorithms. |

Update 10 May 2018 |
---|

Jeremy Metz in the comments showed me that Numpy has a fast sort. Python is not as slow as in the original post but is still slower than the best Julia implementation. See also this SO post. |

### Benchmarking code

The code below for Julia, R, and Python is here. The R and Python code were adapted from the data.table group-by benchmarks page).

**Sorting performance on string vector of length 10 million**

## The journey towards faster string-sort in Julia

If you are interested in the journey leading up to the implementation of string radix sort, read on. You might find these points interesting

- How to load underlying bytes from strings?
- Some pointers about pointer arithmetic in Julia

## Motivation and previous state of String sorting in Julia

Being able to sort strings fast is a key pillar of modern data manipulation. Although it's often acknowledged that when we **sort** a vector of strings, what we actually want is to **group** them; but it is still valuable to be able to sort strings fast.

However, an initial investigation revealed that string sorting in Julia is slow compared to R when sorting strings with lots of duplicated values. This is probably comparing Julia to C via R but from a user's perspective, to put it most bluntly, they probably don't care. A 3x performance drop is not a fantastic story for Julia. It's meant to be fast, right? Also, Python is also slow (see benchmarks above) and hopefully pandas2 can help address that.

Side note about "slowness" of data-tasks in Julia |
---|

The apparent slowness of Julia is a general theme in my posts, see this (about data manipulation) and this (about group-by). I think it's to do with the immaturity of the Julia data-ecosystem and not a true reflection of Julia's slowness. |

## Towards faster string sorts

With that in mind, I wanted to investigate if Julia can become fast at string-sorting as well; at least get close to R's performance in string sorting. After some research I found that R uses radix sort to sort strings, so a natural starting point is a Julia implementation of string radix sort.

Most of my research point towards some variant of Most Significant Digit (MSD) radix sort for strings, see 1 and 2. Also, there is an LSD radix sort for some bits type (but not strings) already implemented in *SortingAlgorithms.jl*. So I have implemented radix sort algorithms of both MSD and LSD varieties.

### About radix sort

I found these lecture notes to be a good introduction to string radix sort. Even though the source code is in C, one can easily translate it line-by-line to Julia.

The below is a couple of issues that I have encountered while developing the radix string sort algorithms.

### Problem 1: access to underlying bytes

To perform a radix sort, one needs access to the underlying bytes. One way to load the byte of the `n`

th character in a string is via `codeunit(s, n)`

e.g.

```
charAt(s::String, n) = @inbounds codeunit(string, n)
```

I timed the above, and according to my calculations, this will be too slow to match R's performance.

After much experimentation, I found that loading 8 bytes at a time is almost as fast as loading just 1 byte, so that became my preferred approach. E.g. see below

```
# `pointer` returns a pointer to `UInt8` (i.e. a byte) that points to the first byte of a string
# `Ptr{UInt64}` converts the pointer to a pointer of `UInt64` and so `unsafe_load`
# will load exactly 8 bytes (64 bits) starting from the location pointed at by the pointer
load_8bytes(s) = s |> pointer |> Ptr{UInt64} |> unsafe_load
```

There are two sub-issues with this approach as well:

**Sub-issue 1: string shorter than 8 bytes**

One has to be careful about how to address the shorter case. One approach is to load 8 bytes anyway and set the unneeded bits to 0; this approach can result in trying to access memory not available to the program and cause a crash. This was pointed out to me during the code review process. From my testing, loading 8 bytes for shorter length strings are fine most of the time, but we still have to be careful not to crash the program. The way to address it currently is to test if the length is shorter than 8 bytes and then use a slower loader.

There are more optimizations possibilities. For example, most of the time only a small fraction of strings are stored in locations where loading 8 bytes will cause an issue. To understand this better we need some understanding of how memory is organized; below is my summarised understanding

- data is loaded into memory in pages of a certain size (on most 64-bit machine the size will be at least 4kb)
- when byte-loading you can load from anywhere within the same page, but loading across page boundaries
*may*crash the program - therefore only those strings that are 8 bytes or less from a page boundary will cause an issue
- as pointed out by Julia Discourse member @stevengj, one can check if a string
`s`

is near the boundary using`(UInt(pointer(s)) & 0xfff) > 0xff8`

If you want to learn more about computer memory, I recommend brilliant.org's Computer Memory course.

Finally, it's worthwhile to remember that the slowest part of the sorting algorithm is not in the loading of bytes, it is the actual sorting.

**Sub-issue 2: string longer than 8 bytes**

If the string is longer than 8 bytes I can sort the string vector iteratively 8 bytes at a time. There are well-known methods for doing both in the MSD and LSD variants of radix sort, which I shall not repeat here.

### Problem 2: Permuting strings while sorting radix

Once I have loaded the underlying bytes into a bytes-vector I can sort the bytes-vector using radix sort which is quite fast. However, I also need to permute the original string-vector at the same time. To do this I have coded up a `sorttwo!(bytesvec, stringvec)`

function which sorts the bytes-vector `bytesvec`

and permute the string vector in the same way `bytesvec`

is permuted in the sorting process. The `sorttwo!`

function is a simple adaptation of the existing radix sort function in *SortingAlgorithms.jl*.

The other application of `sorttwo!`

is in the implementation of a faster `sortperm`

for strings. For R users, `sortperm`

is the equivalent of R's `order`

.

### Implementation of MSD and LSD algorithms

I have implemented an MSD and an LSD variant. From my research, it is often the case that MSD algorithms work better for variable length strings and LSD algorithms work best for fixed length algorithms.

Some even claim that LSD doesn't work on variable length strings vector. I think this is not true as you can represent an empty byte with 0 (even though that's technically a `null`

). For example, if I load the string "abc" using the 8-byte loader it becomes in, hexadecimal form, `0x6162630000000000`

where `61`

, `62`

and `63`

are the hexadecimal representation of the ASCII codes of "a", "b", and "c". I can then sort that using radix sort along with other strings. However, whether this is the most efficient is the real question, to which I do not have an answer.

My implementation of MSD radixsort is based on radix 3-way quicksort which is well-known and documented in 1 and 2 already.

From my benchmarking, my implementation of MSD is not as performant as the LSD algorithm even for variable length strings. This seems odd as most of my research point toward MSD being more performant than LSD. This could be an indication that my implementation of MSD radix sort is suboptimal.

### Actually, why is R so fast? Opening windows to other approaches

A number of individuals pointed out that R uses a form of string-interning to store its strings. My understanding of how it works is like this: for example consider `a = c("abcdefghi", "abcdefghi")`

is a vector of two strings containing the same content, so `a[1]`

and `a[2]`

just point to the one storage space for "abcdefghi" instead of storing two copies of the same string.

Below is an excerpt from R's `?sort`

documentation

The [radix sort] implementation is orders of magnitude faster than shell sort for character vectors, in part thanks to clever use of the internal CHARSXP table.

which leads me to find the documentation on CHARSXP which states

There is a global cache for CHARSXPs created by mkChar — the cache ensures that most CHARSXPs with the same contents share storage.

If the same string is only stored once, this can lead to space efficiencies. Also, more importantly, one may be able to exploit that (like R has) to make more performant algorithms.

Furthermore, this has the potential of simplifying group-by operations. If the user knows that all strings with the same content have the same pointer, then we can simply group-by the pointer which is of fixed size and is numeric and hence quicker to sort and group.

However, Julia doesn't have intered strings by default (although there is a package *InternedStrings.jl*), and therefore these types of optimization are not readily available and hence why it may be hard for Julia to match R's string sorting performance in all cases. But this does open up an alternative lens of looking at things: R will take longer to load these strings as they also need to load it into the global cache. This longer loading time resulted in faster sorting speeds. Therefore it may be possible in Julia to create a data structure that mimics R's behavior and result in more performant sorting. Therefore currently comparing R's sorting speeds to Julia's is not the complete story, even though on the surface R appears faster, and from a users' perspective, (once the data is loaded) R is still the king of speed.

### Future works

An obvious way to speed things up is to adopt parallel techniques, This paper "Engineering Parallel String Sorting for Multi-Core Systems" is the top Google result for "parallel string sort". Its findings show that "multi-key quicksort" (a multi-pivot variant of the MSD radix (quick)sort I have ported) is the fastest sequential sorting algorithm they found implemented in C/C++. This is worth investigating. They also pointed towards their parallel variant called *Super Scalar String Sample Sort*($S^5$), which is performant for multi-core systems.

As discussed, my implementation of MSD string radix sort might be sub-optimal. This paper points towards Rantala's C/C++ implementations of string radix sort as high quality, so that could be a good benchmark to try and match.

I have experimented with converting Strings to InternedStrings via the *InternedStrings.jl* package and I found the performance to be too slow. There are optimization possible for the conversion process, so we will check back once those optimizations are in to see if we can leverage InternedStrings to create even faster string sorts.