memory_order_acq_rel - which is again no-op on x86
Prevents compiler and hardware from reordering loads with load part of operation and stores with store part of operation.
Operation with memory_order_acq_rel flag means
atomic_thread_fence(memory_order_release);
operation();
atomic_thread_fence(memory_order_acquire);
A
atomic_thread_fence(memory_order_acq_rel) on x86 is just a signal to the compiler not to reorder instructions across it, since any following loads already have an acquire fence, and preceding stores have a release fence.
memory_order_acq_rel is also a no-op when applied to atomic RMW operations on x86. Some RMW instructions (such as XCHG) automatically have full
memory_order_seq_cst semantics. Others (such as
CMPXCHG) need the
LOCK prefix to make them atomic --- without the
LOCK prefix they are just a load followed by a store rather than atomic RMW op --- but then they too have full
memory_order_seq_cst semantics.
Now, on page 18 of above mentioned article Bartosz says that Peterson Lock is an example of algorithm, which will not work without fences. However, Peterson's lock uses only acquire, release and acq-rel.
http://www.justsoftwaresolutions.co.uk/threading/petersons_lock_with_C++0x_atomics.html
If those operations are no-op on x86 then Peterson Lock has to work on x86 without any fence. Am I wrong somewhere?
Peterson's lock can be implemented either with an explicit fence, or by using an RMW operation with
memory_order_acq_rel instead of a plain store to
turn --- in Dmitriy's implementation on that page, the result of the call to
turn.exchange() is not used; the exchange is done purely for the ordering properties of the RMW op. If a plain store is used, or the RMW operation is done on the wrong variable, then the guarantees required by the algorithm are not provided (as explained in the analysis on that page). You could use an explicit fence instead (which probably needs to be a
memory_order_seq_cst fence, but I haven't worked it all through.)
For Dekker's algorithm it's clear, it uses sequentially consistent memory fence, so it requires fencing on x86 too, but what about Peterson lock?
though such a fence can also be achieved with a LOCKed RMW instruction, such as XCHG.
On Visual Studio MemoryBarrier() function is implemented as a not locked RMW instruction.
http://msdn.microsoft.com/en-us/library/ms684208(v=vs.85).aspx
LONG Barrier;
__asm {
xchg Barrier, eax
}
Is it a typeo in MSDN?
No, it's not a typo:
XCHG and
LOCK XCHG are equivalent.
Plain loads and stores on x86 give you acquire (for loads) and release (for stores) ordering. This is sufficient for many algorithms, but not for others. LFENCE and SFENCE are primarily of use with non-temporal stores such as MOVNTI, which don't obey the normal cache coherency rules (that's what "non-temporal" means, in effect --- they don't occur at a particular time).
So it means that there is no analogs for sfence and lfence in C++1x, is it right?
That's right. A call to
atomic_thread_fence(memory_order_seq_cst) will provide the required ordering, because it's a full fence, but you'd have to code your non-temporal stores in assembly anyway. The same applies to WC (Write Combining) memory pages --- loads from and stores to such pages are always relaxed, and you need an explicit fence to force an ordering, but you'd need assembly language or a special OS call to create such memory pages.