Use STL containers! ...unless you want to go faster

Published Jul 22, 2019Last updated Sep 21, 2020

We all know to use the STL whenever possible, not to reinvent the wheel and to move forward as a community.
However, a few shortcomings of it and delays in the integration process are often reasons to roll your own containers/algorithms (mozilla::vector, eastl::list, etc). We're going to see a real-life example of this.

In this case we have an 2D image (matrix of values) that we need to serialize into a long 1D vector (put the rows in a line).
Each row is contiguous but they aren't necessarily contiguous to each other.

Mind you, in this real-life example the image is huge and we'll need maximum performance, we only wish to pay for what we need.

We'll use the following implementation of image:

const size_t width = 2000000, height = 5; // Small sizes for example sake
std::vector<std::vector<double>> image (height);
image[0].resize(width);
image[1].resize(width);
image[2].resize(width);
image[3].resize(width);
image[4].resize(width);
std::iota(image[0].begin(), image[0].end(), 1); // Giving non-trivial values
std::iota(image[1].begin(), image[1].end(), 1);
std::iota(image[2].begin(), image[2].end(), 1);
std::iota(image[3].begin(), image[3].end(), 1);
std::iota(image[4].begin(), image[4].end(), 1);

We're using a contiguous std::vector for the image but that's just to simplify the access to the rows (subscript operator []) in this example.

So we want to create an std::vector and fill it with the values!
But that's where we get stuck: every option incurs a downside...

std::vector<double> res1 (image.size()*width);
for (size_t i = 0; i < image.size(); ++i)
    std::copy(image[i].begin(), image[i].end(), res1.begin()+i*width);

That example can nicely be parallelized, but the creation of the vector (res1(image.size()*width);) initializes its elements, which is a costly operation when the image gets really large.

Moving on to the famous "reserve & push_back" solution:

std::vector<double> res2;
res2.reserve(width*height);
for (size_t i = 0; i < height; ++i)
    for (size_t j = 0; j < width; ++j)
        res2.push_back(image[i][j]);

Two downsides here: push_back is not parallelizable and each one costs a size check/increment at each step. Same goes for the assign solution.
We are limited by the fact that for more security, the size member of std::vector has no public setter, so we cannot "reserve, set size & copy data", which is essentially what we want to do.

So, here's an idea to have a fast copy:

template <typename T>
class serializedImage
{
public:
    serializedImage(const std::vector<std::vector<T>>& vectors) :
        data   (nullptr),
        width  (vectors[0].size()),
        height (vectors.size()),
        size   (vectors[0].size()*vectors.size())
    {
        data = new T[size]; // 'new T[size]()' would initialize
#pragma omp parallel for
        for (size_t i = 0; i < height; ++i)
            std::copy(vectors[i].begin(), vectors[i].end(), data+i*width);
    }

    ~serializedImage() { delete[] data; data = nullptr; }

private:
    T* data; size_t width; size_t height; size_t size;
};

For readability sake, the code has been reduced to its bare minimum, without the security checks (empty input, operators, namespace, etc).

So here we are! From a large series of big vectors we have a way to efficiently align them, with no superfluous cost and in parallel fashion. And, we respect RAII with a very simple construction!

serializedImage<double> res3 (image);

alt
Source: Giphy

By the way, I tried to make the ctor a Fold Expression that uses a variable number of input vectors (like size = (vectors.size()+...);) but didn't manage so far, if anyone can it'd be fun to see.