You haven't included the definition, just the documentation. I think you're asking for help understanding why it even exists, rather than the definition.
It stops compilers from CSEing and hoisting work out of repeat-loops, so you can repeat the same work enough times to be measurable. e.g. put something short in a loop that runs 1 billion times, and then you can measure the time for the whole loop easily (a second or so). See Can x86's MOV really be "free"? Why can't I reproduce this at all? for an example of doing this by hand in asm. If you want compiler-generated code like that, you need a function / macro like DoNotOptimizeAway
.
Compiling the whole program with optimization disabled would be useless: storing/reloading everything between C++ statements gives very different bottlenecks (usually store-forwarding latency). See Adding a redundant assignment speeds up code when compiled without optimization
See also Idiomatic way of performance evaluation? for general microbenchmarking pitfalls
Perhaps looking at the actual definition can also help.
This Q&A (Optimization barrier for microbenchmarks in MSVC: tell the optimizer you clobber memory?) describes how one implementation of a DoNotOptimize
macro works (and asks how to port it from GNU C++ to MSVC).
The escape
macro is from Chandler Carruth's CppCon2015 talk, "Tuning C++: Benchmarks, and CPUs, and Compilers! Oh My!". That talk also goes into detail about exactly why it's needed when writing microbenchmarks: to stop whole loops from optimizing away when you compile with optimization enabled.
(Having the compiler hoist things out of loops instead of compute them repeatedly is harder to get right if it's a problem. Making a function __attribute__((noinline))
can help if it's big enough that it didn't need to inline. Check the compiler's asm output to see how much setup it hoisted.)
And BTW, a good definition for GNU C / C++ normally has zero extra cost:
asm volatile("" :: "r"(my_var));
compiles to zero asm instructions, but requires the compiler to have the value of my_var
in a register of its choice. (And because of asm volatile
, has to "run" that many times in the C++ abstract machine).
This will only impact optimization if the compiler could have transformed the calculation it was part of into something else. (e.g. using this on a loop counter would stop the compiler from using just pointer-increments and compare against an end-pointer to do the right number of iterations of for(i=0;i<n;i++) sum+=a[i];
Using a read-modify-write operand like asm volatile("" :"+r"(my_var));
would force the compiler to forget all range-restriction or constant-propagation info it knows about the value, and treat it like an incoming function arg. e.g. that it's 42
, or that it's non-negative. This could impact optimization more.
When they say the "overhead is cancelled out in comparisons", they're hopefully not talking about explicitly subtracting anything from a single timing result, and not talking about benchmarking DoNotOptimizeAway
on its own.
That wouldn't work. Performance analysis for modern CPUs does not work by adding up the costs of each instruction. Out-of-order pipelined execution means that an extra asm instruction can easily have zero extra cost if the front-end (total instruction throughput) wasn't the bottleneck, and if the execution unit it needs wasn't either.
If their portable definition is something like volatile T sink = input;
, the extra asm store would only have a cost if your code bottlenecked on store throughput to cache.
So that claim about cancelling out sounds a bit optimistic. As I explained above, Plus the above context / optimization-dependent factors. It's possible that a DoNotOptimizeAway
)
Related Q&As about the same functions: