What is the difference between seq
and par
/par_unseq
?
std::for_each(std::execution::seq, std::begin(v), std::end(v), function_call);
std::execution::seq
stands for sequential execution. It is the default if you do not specify the execution policy at all. It will force the implementation to execute all function calls in sequence. It is also guaranteed that everything is executed by the calling thread.
In contrast, std::execution::par
and std::execution::par_unseq
implies parallel execution. That means you promise that all invocations of the given function can be safely executed in parallel without violating any data dependencies. The implementation is allowed to use a parallel implementation, though it is not forced to do so.
What is the difference between par
and par_unseq
?
par_unseq
requires stronger guarantees than par
, but allows additional optimizations. Specifically, par_unseq
requires the option to interleave the execution of multiple function calls in the same thread.
Let us illustrate the difference with an example. Suppose you want to parallelize this loop:
std::vector<int> v = { 1, 2, 3 };
int sum = 0;
std::for_each(std::execution::seq, std::begin(v), std::end(v), [&](int i) {
sum += i*i;
});
You cannot directly parallelize the code above, as it would introduce a data dependency for the sum
variable. To avoid that, you can introduce a lock:
int sum = 0;
std::mutex m;
std::for_each(std::execution::par, std::begin(v), std::end(v), [&](int i) {
std::lock_guard<std::mutex> lock{m};
sum += i*i;
});
Now all function calls can be safely executed in parallel, and the code will not break when you switch to par
. But what would happen if you use par_unseq
instead, where one thread could potentially execute multiple function calls not in sequence but concurrently?
It can result in a deadlock, for instance, if the code is reordered like this:
m.lock(); // iteration 1 (constructor of std::lock_guard)
m.lock(); // iteration 2
sum += ...; // iteration 1
sum += ...; // iteration 2
m.unlock(); // iteration 1 (destructor of std::lock_guard)
m.unlock(); // iteration 2
In the standard, the term is vectorization-unsafe. To quote from P0024R2:
A standard library function is vectorization-unsafe if it is specified to synchronize with another function invocation, or another function invocation is specified to synchronize with it, and if it is not a memory allocation or deallocation function. Vectorization-unsafe standard library functions may not be invoked by user code called from parallel_vector_execution_policy
algorithms.
One way to make the code above vectorization-safe, is to replace the mutex by an atomic:
std::atomic<int> sum{0};
std::for_each(std::execution::par_unseq, std::begin(v), std::end(v), [&](int i) {
sum.fetch_add(i*i, std::memory_order_relaxed);
});
What are the advantages of using par_unseq
over par
?
The additional optimizations that an implementation can use in par_unseq
mode include vectorized execution and migrations of work across threads (the latter is relevant if task parallelism is used with a parent-stealing scheduler).
If vectorization is allowed, implementations can internally use SIMD parallelism (Single-Instruction, Multiple-Data). For example, OpenMP supports it via #pragma omp simd
annotations, which can help compilers to generate better code.
When should I prefer std::execution::seq
?
- correctness (avoiding data races)
- avoiding parallel overhead (startup costs and synchronization)
- simplicity (debugging)
It is not uncommon that data dependencies will enforce sequential execution. In other words, use sequential execution if parallel execution would add data races.
Rewriting and tuning the code for parallel execution is not always trivial. Unless it is a critical part of your application, you can start with a sequential version and optimize later. You also might want to avoid parallel execution if you are executing the code in a shared environment where you need to be conservative in resource usage.
Parallelism also does not come for free. If the expected total execution time of the loop is very low, sequential execution will most likely be the best even from a pure performance perspective. The bigger the data and the more costly each computation step is, the less important the synchronization overhead will be.
For instance, using parallelism in the example above would not make sense, as the vector only contains three elements and the operations are very cheap. Also note that the original version - before the introduction of mutexes or atomics - contained no synchronization overhead. A common mistake in measuring speedup of a parallel algorithm is to use a parallel version running on one CPU as the baseline. Instead, you should always compare against an optimized sequential implementation without the synchronization overhead.
When should I prefer std::execution::par_unseq
?
First, make sure that it does not sacrifice correctness:
- If there are data races when executing steps in parallel by different threads,
par_unseq
is not an option.
- If the code is vectorization-unsafe, for instance, because it acquires a lock,
par_unseq
is not an option (but par
might be).
Otherwise, use par_unseq
if it is a performance critical part and par_unseq
improves the performance over seq
.
When should I prefer std::execution::par
?
If the steps can be executed safely in parallel, but you cannot use par_unseq
because it is vectorization-unsafe, it is a candidate for par
.
Like seq_unseq
, verify that it is a performance critical part and par
is a performance improvement over seq
.
Sources: