What volatile
does:
- Prevents the compiler from optimizing out any access. Every read/write will result in a read/write instruction.
- Prevents the compiler from reordering the access with other volatiles.
What volatile
does not:
- Make the access atomic.
- Prevent the compiler from reordering with non-volatile accesses.
- Make changes from one thread visible in another thread.
Some non-portable behaviors that shouldn't be relied on in cross-platform C++:
- VC++ has extended
volatile
to prevent any reordering with other instructions. Other compilers don't, because it negatively affects optimization.
- x86 makes aligned read/write of pointer-sized and smaller variables atomic, and immediately visible to other threads. Other architectures don't.
Most of the time, what people really want are fences (also called barriers) and atomic instructions, which are usable if you've got a C++11 compiler, or via compiler- and architecture-dependent functions otherwise.
Fences ensure that, at the point of use, all the previous reads/writes will be completed. In C++11, fences are controlled at various points using the std::memory_order
enumeration. In VC++ you can use _ReadBarrier()
, _WriteBarrier()
, and _ReadWriteBarrier()
to do this. I'm not sure about other compilers.
On some architectures like x86, a fence is merely a way to prevent the compiler from reordering instructions. On others they might actually emit an instruction to prevent the CPU itself from reordering things.
Here's an example of improper use:
int res1, res2;
volatile bool finished;
void work_thread(int a, int b)
{
res1 = a + b;
res2 = a - b;
finished = true;
}
void spinning_thread()
{
while(!finished); // spin wait for res to be set.
}
Here, finished
is allowed to be reordered to before either res
is set! Well, volatile prevents reordering with other volatile, right? Let's try making each res
volatile too:
volatile int res1, res2;
volatile bool finished;
void work_thread(int a, int b)
{
res1 = a + b;
res2 = a - b;
finished = true;
}
void spinning_thread()
{
while(!finished); // spin wait for res to be set.
}
This trivial example will actually work on x86, but it is going to be inefficient. For one, this forces res1
to be set before res2
, even though we don't really care about that... we just want both of them set before finished
is. Forcing this ordering between res1
and res2
will only prevent valid optimizations, eating away at performance.
For more complex problems, you'll have to make every write volatile
. This would bloat your code, be very error prone, and become slow as it prevents a lot more reordering than you really wanted.
It's not realistic. So we use fences and atomics. They allow full optimization, and only guarantee that the memory access will complete at the point of the fence:
int res1, res2;
std::atomic<bool> finished;
void work_thread(int a, int b)
{
res1 = a + b;
res2 = a - b;
finished.store(true, std::memory_order_release);
}
void spinning_thread()
{
while(!finished.load(std::memory_order_acquire));
}
This will work for all architectures. res1
and res2
operations can be reordered as the compiler sees fit. Performing an atomic release ensures that all non-atomic ops are ordered to complete and be visible to threads which perform an atomic acquire.