c++ - Performance degradation due to default initialisation of elements in standard containers

Question

Ask a Question

Welcome To Ask or Share your Answers For Others

c++ - Performance degradation due to default initialisation of elements in standard containers

asked Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

(Yes, I know there is a question with almost the same title, but the answer was not satisfactory, see below)

EDIT Sorry, the original question didn't use compiler optimization. This is now fixed, but to avoid trivial optimization and to come closer to my actual use case, the test has been split into two compilation units.

The fact that the constructor of std::vector<> has linear complexity is a nuisance when it comes to performance-critical applications. Consider this simple code

// compilation unit 1:
void set_v0(type*x, size_t n)
{
  for(size_t i=0; i<n; ++i)
    x[i] = simple_function(i);
}

// compilation unit 2:
std::vector<type> x(n);                     // default initialisation is wasteful
set_v0(x.data(),n);                         // over-writes initial values

when a significant amount of time is wasted by constructing x. The conventional way around this, as explored by this question, seems to be to merely reserve the storage and use push_back() to fill in the data:

// compilation unit 1:
void set_v1(std::vector<type>&x, size_t n)
{
  x.reserve(n);
  for(size_t i=0; i<n; ++i)
    x.push_back(simple_function(i));
}

// compilation unit 2:
std::vector<type> x(); x.reserve(n);        // no initialisation
set_v1(x,n);                                // using push_back()

However, as indicated by my comment, the push_back() is inherently slow, making this second approach actually slower than the first one for sufficiently simply constructible objects, such as size_ts, when for

simple_function = [](size_t i) { return i; };

I get the following timings (using gcc 4.8 with -O3; clang 3.2 produced ~10% slower code)

timing vector::vector(n) + set_v0();
n=10000 time: 3.9e-05 sec
n=100000 time: 0.00037 sec
n=1000000 time: 0.003678 sec
n=10000000 time: 0.03565 sec
n=100000000 time: 0.373275 sec

timing vector::vector() + vector::reserve(n) + set_v1();
n=10000 time: 1.9e-05 sec
n=100000 time: 0.00018 sec
n=1000000 time: 0.00177 sec
n=10000000 time: 0.020829 sec
n=100000000 time: 0.435393 sec

The speed-up actually possible if one could elide the default construction of elements can be estimated by the following cheating version

// compilation unit 2
std::vector<type> x; x.reserve(n);          // no initialisation
set_v0(x,n);                                // error: write beyond end of vector
                                            // note: vector::size() == 0

when we get

timing vector::vector + vector::reserve(n) + set_v0();          (CHEATING)
n=10000 time: 8e-06 sec
n=100000 time: 7.2e-05 sec
n=1000000 time: 0.000776 sec
n=10000000 time: 0.01119 sec
n=100000000 time: 0.298024 sec

So, my first question: Is there any legal way to use a standard library container which would give these latter timings? Or do I have to resort to manage the memory myself?

Now, what I really want, is to use multi-threading to fill in the container. The naive code (using openMP in this example for simplicity, which excludes clang for the moment)

// compilation unit 1
void set_v0(type*x, size_t n)
{
#pragma omp for                       // only difference to set_v0() from above 
  for(size_t i=0; i<n; ++i)
    x[i] = simple_function(i);
}

// compilation unit 2:
std::vector<type> x(n);               // default initialisation not mutli-threaded
#pragma omp parallel
set_v0(x,n);                          // over-writes initial values in parallel

now suffers from the fact that the default initialization of all elements is not multi-threaded, resulting in an potentially serious performance degradation. Here are the timings for set_omp_v0() and a equivalent cheating method (using my macbook's intel i7 chip with 4 cores, 8 hyperthreads):

timing std::vector::vector(n) + omp parallel set_v0()
n=10000 time: 0.000389 sec
n=100000 time: 0.000226 sec
n=1000000 time: 0.001406 sec
n=10000000 time: 0.019833 sec
n=100000000 time: 0.35531 sec

timing vector::vector + vector::reserve(n) + omp parallel set_v0(); (CHEATING)
n=10000 time: 0.000222 sec
n=100000 time: 0.000243 sec
n=1000000 time: 0.000793 sec
n=10000000 time: 0.008952 sec
n=100000000 time: 0.089619 sec

Note that the cheat version is ~3.3 times faster than the serial cheat version, roughly as expected, but the standard version is not.

So, my second question: Is there any legal way to use a standard library container which would give these latter timings in multi-threaded situations?

PS. I found this question, where std::vector is tricked into avoiding the default initialization by providing it with a uninitialized_allocator. This is no longer standard compliant, but works very well for my test case (see my own answer below and this question for details).

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

171 views

1 Answer

深蓝 · Answer 1 · 2021-10-17T02:51:16+0000

With g++ 4.5 I was able to realize an approximate 20% reduction in runtime from v0 (1.0s to 0.8s) and slightly less from 0.95s to 0.8s for v1 by using a generator to construct directly:

struct Generator : public std::iterator<std::forward_iterator_tag, int>
{
    explicit Generator(int start) : value_(start) { }
    void operator++() { ++value_; }
    int operator*() const { return value_; }

    bool operator!=(Generator other) const { return value_ != other.value_; }

    int value_;
};

int main()
{
    const int n = 100000000;
    std::vector<int> v(Generator(0), Generator(n));

    return 0;
}

Categories

c++ - Performance degradation due to default initialisation of elements in standard containers

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags