I might be wrong about some things in this answer (proof-reading welcome from people that know this stuff!). It's based on reading the docs and Jeff Preshing's blog, not actual recent experience or testing.
Linus Torvalds strongly recommends against trying to invent your own locking, because it's so easy to get it wrong. It's more of an issue when writing portable code for the Linux kernel, rather than something that's x86-only, so I feel brave enough to try to sort things out for x86.
The normal way to use NT stores is to do a bunch of them in a row, like as part of a memset or memcpy, then an SFENCE
, then a normal release store to a shared flag variable: done_flag.store(1, std::memory_order_release)
.
Using a movnti
store to the synchronization variable will hurt performance. You might want to use NT stores into the Foo
it points to, but evicting the pointer itself from cache is perverse. (movnt
stores evict the cache line if it was in cache to start with; see vol1 ch 10.4.6.2
Caching of Temporal vs. Non-Temporal Data).
The whole point of NT stores is for use with Non-Temporal data, which won't be used again (by any thread) for a long time if ever. The locks that control access to shared buffers, or the flags that producers/consumers use to mark data as read, are expected to be read by other cores.
Your function names also don't really reflect what you're doing.
x86 hardware is extremely heavily optimized for doing normal (not NT) release-stores, because every normal store is a release-store. The hardware has to be good at it for x86 to run fast.
Using normal stores/loads only requires a trip to L3 cache, not to DRAM, for communication between threads on Intel CPUs. Intel's large inclusive L3 cache works as a backstop for cache-coherency traffic. Probing the L3 tags on a miss from one core will detect the fact that another core has the cache line in the Modified or Exclusive state. NT stores would require synchronization variables to go all the way out to DRAM and back for another core to see it.
Memory ordering for NT streaming stores
movnt
stores can be reordered with other stores, but not with older reads.
Intel's x86 manual vol3, chapter 8.2.2 (Memory Ordering in P6 and More Recent Processor Families):
- Reads are not reordered with other reads.
- Writes are not reordered with older reads. (note the lack of exceptions).
- Writes to memory are not reordered with other writes, with the following exceptions:
- ... stuff about clflushopt and the fence instructions
update: There's also a note (in 8.1.2.2 Software Controlled Bus Locking) that says:
Do not implement semaphores using the WC memory type. Do not perform non-temporal stores to a cache line containing a location used to implement a semaphore.
This may just be a performance suggestion; they don't explain whether it can cause a correctness problem. Note that NT stores are not cache-coherent, though (data can sit in the line fill buffer even if conflicting data for the same line is present somewhere else in the system, or in memory). Maybe you could safely use NT stores as a release-store that synchronizes with regular loads, but would run into problems with atomic RMW ops like lock add dword [mem], 1
.
Release semantics prevent memory reordering of the write-release with any read or write operation which precedes it in program order.
To block reordering with earlier stores, we need an SFENCE
instruction, which is a StoreStore barrier even for NT stores. (And is also a barrier to some kinds of compile-time reordering, but I'm not sure if it blocks earlier loads from crossing the barrier.) Normal stores don't need any kind of barrier instruction to be release-stores, so you only need SFENCE
when using NT stores.
For loads: The x86 memory model for WB (write-back, i.e. "normal") memory already prevents LoadStore reordering even for weakly-ordered stores, so we don't need an LFENCE
for its LoadStore barrier effect, only a LoadStore compiler barrier before the NT store.
In gcc's implementation at least, std::atomic_signal_fence(std::memory_order_release)
is a compiler-barrier even for non-atomic loads/stores, but atomic_thread_fence
is only a barrier for atomic<>
loads/stores (including mo_relaxed
). Using an atomic_thread_fence
still allows the compiler more freedom to reorder loads/stores to non-shared variables. See this Q&A for more.
// The function can't be called release_store unless it actually is one (i.e. includes all necessary barriers)
// Your original function should be called relaxed_store
void NT_release_store(const Foo* f) {
// _mm_lfence(); // make sure all reads from the locked region are already globally visible. Not needed: this is already guaranteed
std::atomic_thread_fence(std::memory_order_release); // no insns emitted on x86 (since it assumes no NT stores), but still a compiler barrier for earlier atomic<> ops
_mm_sfence(); // make sure all writes to the locked region are already globally visible, and don't reorder with the NT store
_mm_stream_si64((long long int*)&gFoo, (int64_t)f);
}
This stores to the atomic variable (note the lack of dereferencing &gFoo
). Your function stores to the Foo
it points to, which is super weird; IDK what the point of that was. Also note that it compiles as valid C++11 code.
When thinking about what a release-store means, think about it as the store that releases the lock on a shared data structure. In your case, when the release-store becomes globally visible, any thread that sees it should be able to safely dereference it.
To do an acquire-load, just tell the compiler you want one.
x86 doesn't need any barrier instructions, but specifying mo_acquire
instead of mo_relaxed
gives you the necessary compiler-barrier. As a bonus, this function is portable: you'll get any and all necessary barriers on other architectures:
Foo* acquire_load() {
return gFoo.load(std::memory_order_acquire);
}
You didn't say anything about storing gFoo
in weakly-ordered WC (uncacheable write-combining) memory. It's probably really hard to arrange for your program's data segment to be mapped into WC memory... It would be a lot easier for gFoo
to simply point to WC memory, after you mmap some WC video RAM or something. But if you want acquire-loads from WC memory, you probably do need LFENCE
. IDK. Ask another question about that, because this answer mostly assumes you're using WB memory.
Note that using a pointer instead of a flag creates a data dependency. I think you should be able to use gFoo.load(std::memory_order_consume)
, which doesn't require barriers even on weakly-ordered CPUs (other than Alpha). Once compilers are sufficiently advanced to make sure they don't break the data dependency, they can actually make better code (instead of promoting mo_consume
to mo_acquire
. Read up on this before using mo_consume
in production code, and esp. be careful to note that testing it properly is impossible because future compilers are expected to give weaker guarantees than current compilers in practice do.
Initially I was thinking that we did need LFENCE to get a LoadStore barrier. ("Writes cannot pass earlier LFENCE, SFENCE, and MFENCE instructions". This in turn prevents them from passing (becoming globally visible before) reads that are before the LFENCE).
Note that LFENCE + SFENCE is still weaker than a full MFENCE, because it's not a StoreLoad barrier. SFENCE's own documentation says it's ordered wrt. LFENCE, but that table of the x86 memory model from Intel manual vol3 doesn't mention that. If SFENCE can't execute until after an LFENCE, then sfence
/ lfence
might actually be a slower equivalent to mfence
, but lfence
/ sfence
/ movnti
would give release semantics without a full barrier. Note that the NT store could become globally visible after some following loads/stores, unlike a normal strongly-ordered x86 store.)
Related: NT loads
In x86, every load has acqui