The problem
For a long time I had the impression that using a nested std::vector<std::vector...>
for simulating an N-dimensional array is in general bad, since the memory is not guarantee to be contiguous, and one may have cache misses. I thought it's better to use a flat vector and map from multiple dimensions to 1D and vice versa. So, I decided to test it (code listed at the end). It is pretty straightforward, I timed reading/writing to a nested 3D vector vs my own 3D wrapper of an 1D vector. I compiled the code with both g++
and clang++
, with -O3
optimization turned on. For each run I changed the dimensions, so I can get a pretty good idea about the behaviour. To my surprise, these are the results I obtained on my machine MacBook Pro (Retina, 13-inch, Late 2012), 2.5GHz i5, 8GB RAM, OS X 10.10.5:
g++ 5.2
dimensions nested flat
X Y Z (ms) (ms)
100 100 100 -> 16 24
150 150 150 -> 58 98
200 200 200 -> 136 308
250 250 250 -> 264 746
300 300 300 -> 440 1537
clang++ (LLVM 7.0.0)
dimensions nested flat
X Y Z (ms) (ms)
100 100 100 -> 16 18
150 150 150 -> 53 61
200 200 200 -> 135 137
250 250 250 -> 255 271
300 300 300 -> 423 477
As you can see, the "flatten" wrapper is never beating the nested version. Moreover, g++'s libstdc++ implementation performs quite badly compared to libc++ implementation, for example for 300 x 300 x 300
the flatten version is almost 4 times slower than the nested version. libc++ seems to have equal performance.
My questions:
- Why isn't the flatten version faster? Shouldn't it be? Am I missing something in the testing code?
- Moreover, why does g++'s libstdc++ performs so badly when using flatten vectors? Again, shouldn't it perform better?
The code I used:
#include <chrono>
#include <cstddef>
#include <iostream>
#include <memory>
#include <random>
#include <vector>
// Thin wrapper around flatten vector
template<typename T>
class Array3D
{
std::size_t _X, _Y, _Z;
std::vector<T> _vec;
public:
Array3D(std::size_t X, std::size_t Y, std::size_t Z):
_X(X), _Y(Y), _Z(Z), _vec(_X * _Y * _Z) {}
T& operator()(std::size_t x, std::size_t y, std::size_t z)
{
return _vec[z * (_X * _Y) + y * _X + x];
}
const T& operator()(std::size_t x, std::size_t y, std::size_t z) const
{
return _vec[z * (_X * _Y) + y * _X + x];
}
};
int main(int argc, char** argv)
{
std::random_device rd{};
std::mt19937 rng{rd()};
std::uniform_real_distribution<double> urd(-1, 1);
const std::size_t X = std::stol(argv[1]);
const std::size_t Y = std::stol(argv[2]);
const std::size_t Z = std::stol(argv[3]);
// Standard library nested vector
std::vector<std::vector<std::vector<double>>>
vec3D(X, std::vector<std::vector<double>>(Y, std::vector<double>(Z)));
// 3D wrapper around a 1D flat vector
Array3D<double> vec1D(X, Y, Z);
// TIMING nested vectors
std::cout << "Timing nested vectors...
";
auto start = std::chrono::steady_clock::now();
volatile double tmp1 = 0;
for (std::size_t x = 0 ; x < X; ++x)
{
for (std::size_t y = 0 ; y < Y; ++y)
{
for (std::size_t z = 0 ; z < Z; ++z)
{
vec3D[x][y][z] = urd(rng);
tmp1 += vec3D[x][y][z];
}
}
}
std::cout << "Sum: " << tmp1 << std::endl; // we make sure the loops are not optimized out
auto end = std::chrono::steady_clock::now();
std::cout << "Took: ";
auto ms = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();
std::cout << ms << " milliseconds
";
// TIMING flatten vector
std::cout << "Timing flatten vector...
";
start = std::chrono::steady_clock::now();
volatile double tmp2 = 0;
for (std::size_t x = 0 ; x < X; ++x)
{
for (std::size_t y = 0 ; y < Y; ++y)
{
for (std::size_t z = 0 ; z < Z; ++z)
{
vec1D(x, y, z) = urd(rng);
tmp2 += vec1D(x, y, z);
}
}
}
std::cout << "Sum: " << tmp2 << std::endl; // we make sure the loops are not optimized out
end = std::chrono::steady_clock::now();
std::cout << "Took: ";
ms = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();
std::cout << ms << " milliseconds
";
}
EDIT
Changing the Array3D<T>::operator()
return to
return _vec[(x * _Y + y) * _Z + z];
as per @1201ProgramAlarm's suggestion does indeed get rid of the "weird" behaviour of g++, in the sense that the flat and nested versions take now roughly the same time. However it's still intriguing. I thought the nested one will be much worse due to cache issues. May I just be lucky and have all the memory contiguously allocated?
See Question&Answers more detail:os