Author Topic: sequential consistency for std::atomic on x86 (Read 118368 times)

pebler · « **on:** December 05, 2009, 08:11:52 AM »

Hello,

I was just reading slides from a talk that are available here:

http://www.nwcpp.org/Downloads/2008/Memory_Fences.pdf

On page 16, this person asserts that on x86, the memory model enables the relaxed std::atomic conditions shown on that slide to be implemented as nothing, so there would be no performance penalty for that code (assuming the atomic types had atomic store instructions), only that the compiler be required not to reverse the orders.

My questions:
1. Is this person correct?
2. Would your library for Visual Studio indeed implement as no-ops?
3. How does your library force the compiler not to reorder the stores?

Thanks

Anthony Williams · « **Reply #1 on:** December 07, 2009, 12:52:49 PM »

Quote from: pebler on December 05, 2009, 08:11:52 AM

I was just reading slides from a talk that are available here:

http://www.nwcpp.org/Downloads/2008/Memory_Fences.pdf

On page 16, this person asserts that on x86, the memory model enables the relaxed std::atomic conditions shown on that slide to be implemented as nothing, so there would be no performance penalty for that code (assuming the atomic types had atomic store instructions), only that the compiler be required not to reverse the orders.

My questions:
1. Is this person correct?

Yes, plain loads have acquire semantics and plain stores have release semantics on x86. Note that actually, the ordering on the data variable are superfluous and could be replaced with std::memory_order_relaxed: the operations on ready are sufficient to guarantee the ordering.

Quote

2. Would your library for Visual Studio indeed implement as no-ops?

some_atomic.load(std::memory_order_acquire) does just drop through to a simple load instruction, and some_atomic.store(std::memory_order_release) drops through to a simple store instruction. However, this instruction is wrapped in compiler directives to ensure that the compiler doesn't reorder it in a way that could break the desired semantics.

Quote

3. How does your library force the compiler not to reorder the stores?

By using appropriate compiler directives in the code. For Visual Studio 2008, _ReadWriteBarrier is used.

Anthony

TA · « **Reply #2 on:** January 27, 2011, 08:17:53 AM »

Quote from: Anthony Williams on December 07, 2009, 12:52:49 PM

Yes, plain loads have acquire semantics and plain stores have release semantics on x86. Note that actually, the ordering on the data variable are superfluous and could be replaced with std::memory_order_relaxed: the operations on ready are sufficient to guarantee the ordering.

Hi Anthony.

Does it mean that fences are completely unneccessary on x86? In the article mentioned above I can see this

Quote

There must be algorithms that require fences even on the x86.

and there are mfence, sfence and lfence instructions available on x86. How does your library work if it doesn't use them and where they are used? Could you please explain this in more details?

Thanks.

Anthony Williams · « **Reply #3 on:** January 27, 2011, 08:55:57 AM »

Quote from: TA on January 27, 2011, 08:17:53 AM

Quote from: Anthony Williams on December 07, 2009, 12:52:49 PM
Yes, plain loads have acquire semantics and plain stores have release semantics on x86. Note that actually, the ordering on the data variable are superfluous and could be replaced with std::memory_order_relaxed: the operations on ready are sufficient to guarantee the ordering.
Does it mean that fences are completely unneccessary on x86? In the article mentioned above I can see this
Quote
There must be algorithms that require fences even on the x86.
and there are mfence, sfence and lfence instructions available on x86. How does your library work if it doesn't use them and where they are used? Could you please explain this in more details?

Plain loads and stores on x86 give you acquire (for loads) and release (for stores) ordering. This is sufficient for many algorithms, but not for others. LFENCE and SFENCE are primarily of use with non-temporal stores such as MOVNTI, which don't obey the normal cache coherency rules (that's what "non-temporal" means, in effect --- they don't occur at a particular time).

MFENCE can be used as a full memory barrier, akin to atomic_thread_fence(memory_order_seq_cst), though such a fence can also be achieved with a LOCKed RMW instruction, such as XCHG.

Dekker's algorithm for mutual exclusion requires such ordering. See http://www.justsoftwaresolutions.co.uk/threading/implementing_dekkers_algorithm_with_fences.html for an example implementation, and a description of the required memory ordering constraints.

TA · « **Reply #4 on:** January 27, 2011, 11:03:37 AM »

Thanks for your answer Anthony.

There is still one thing I don't get. To do not ask wrong question based on possible wrong understanding let me at first state what I understand and please fix me if I'm wrong.

We have 4 type of fences in C++1x, they are

memory_order_release - which is no-op on x86
Prevents compiler and hardware from reordering stores, i.e. any store that is done before this should be completed before the fence.
Store to variable with memory_order_release flag means
atomic_thread_fence(memory_order_release);
store();

memory_order_acquire - which is no-op on x86
Prevents compiler and hardware from reordering loads, i.e. any load that is done after this should happen after the fence.
Load from variable with memory_order_acquire flag means
load();
atomic_thread_fence(memory_order_acquire);

memory_order_acq_rel - which is again no-op on x86
Prevents compiler and hardware from reordering loads with load part of operation and stores with store part of operation.
Operation with memory_order_acq_rel flag means
atomic_thread_fence(memory_order_release);
operation();
atomic_thread_fence(memory_order_acquire);

memory_order_seq_cst - which is mfence on x86
Full memory fence, prevents compiler from reordering both loads and stores.

Now, on page 18 of above mentioned article Bartosz says that Peterson Lock is an example of algorithm, which will not work without fences. However, Peterson's lock uses only acquire, release and acq-rel.
http://www.justsoftwaresolutions.co.uk/threading/petersons_lock_with_C++0x_atomics.html
If those operations are no-op on x86 then Peterson Lock has to work on x86 without any fence. Am I wrong somewhere?
For Dekker's algorithm it's clear, it uses sequentially consistent memory fence, so it requires fencing on x86 too, but what about Peterson lock?

Quote

though such a fence can also be achieved with a LOCKed RMW instruction, such as XCHG.

On Visual Studio MemoryBarrier() function is implemented as a not locked RMW instruction.
http://msdn.microsoft.com/en-us/library/ms684208(v=vs.85).aspx

Quote

LONG Barrier;
__asm {
xchg Barrier, eax
}

Is it a typeo in MSDN?

Quote

Plain loads and stores on x86 give you acquire (for loads) and release (for stores) ordering. This is sufficient for many algorithms, but not for others. LFENCE and SFENCE are primarily of use with non-temporal stores such as MOVNTI, which don't obey the normal cache coherency rules (that's what "non-temporal" means, in effect --- they don't occur at a particular time).

So it means that there is no analogs for sfence and lfence in C++1x, is it right?

Thanks a lot.

Anthony Williams · « **Reply #5 on:** January 27, 2011, 12:14:24 PM »

Quote from: TA on January 27, 2011, 11:03:37 AM

memory_order_acq_rel - which is again no-op on x86
Prevents compiler and hardware from reordering loads with load part of operation and stores with store part of operation.
Operation with memory_order_acq_rel flag means
atomic_thread_fence(memory_order_release);
operation();
atomic_thread_fence(memory_order_acquire);

A atomic_thread_fence(memory_order_acq_rel) on x86 is just a signal to the compiler not to reorder instructions across it, since any following loads already have an acquire fence, and preceding stores have a release fence.

memory_order_acq_rel is also a no-op when applied to atomic RMW operations on x86. Some RMW instructions (such as XCHG) automatically have full memory_order_seq_cst semantics. Others (such as CMPXCHG) need the LOCK prefix to make them atomic --- without the LOCK prefix they are just a load followed by a store rather than atomic RMW op --- but then they too have full memory_order_seq_cst semantics.

Quote from: TA on January 27, 2011, 11:03:37 AM

Now, on page 18 of above mentioned article Bartosz says that Peterson Lock is an example of algorithm, which will not work without fences. However, Peterson's lock uses only acquire, release and acq-rel.
http://www.justsoftwaresolutions.co.uk/threading/petersons_lock_with_C++0x_atomics.html
If those operations are no-op on x86 then Peterson Lock has to work on x86 without any fence. Am I wrong somewhere?

Peterson's lock can be implemented either with an explicit fence, or by using an RMW operation with memory_order_acq_rel instead of a plain store to turn --- in Dmitriy's implementation on that page, the result of the call to turn.exchange() is not used; the exchange is done purely for the ordering properties of the RMW op. If a plain store is used, or the RMW operation is done on the wrong variable, then the guarantees required by the algorithm are not provided (as explained in the analysis on that page). You could use an explicit fence instead (which probably needs to be a memory_order_seq_cst fence, but I haven't worked it all through.)

Quote from: TA on January 27, 2011, 11:03:37 AM

For Dekker's algorithm it's clear, it uses sequentially consistent memory fence, so it requires fencing on x86 too, but what about Peterson lock?

Quote
though such a fence can also be achieved with a LOCKed RMW instruction, such as XCHG.
On Visual Studio MemoryBarrier() function is implemented as a not locked RMW instruction.
http://msdn.microsoft.com/en-us/library/ms684208(v=vs.85).aspx
Quote
LONG Barrier;
__asm {
xchg Barrier, eax
}
Is it a typeo in MSDN?

No, it's not a typo: XCHG and LOCK XCHG are equivalent.

Quote from: TA on January 27, 2011, 11:03:37 AM

Quote
Plain loads and stores on x86 give you acquire (for loads) and release (for stores) ordering. This is sufficient for many algorithms, but not for others. LFENCE and SFENCE are primarily of use with non-temporal stores such as MOVNTI, which don't obey the normal cache coherency rules (that's what "non-temporal" means, in effect --- they don't occur at a particular time).
So it means that there is no analogs for sfence and lfence in C++1x, is it right?

That's right. A call to atomic_thread_fence(memory_order_seq_cst) will provide the required ordering, because it's a full fence, but you'd have to code your non-temporal stores in assembly anyway. The same applies to WC (Write Combining) memory pages --- loads from and stores to such pages are always relaxed, and you need an explicit fence to force an ordering, but you'd need assembly language or a special OS call to create such memory pages.

TA · « **Reply #6 on:** January 27, 2011, 12:27:06 PM »

Thanks a lot!

News:

Author Topic: sequential consistency for std::atomic on x86 (Read 118368 times)

pebler

sequential consistency for std::atomic on x86

Anthony Williams

Re: sequential consistency for std::atomic on x86

TA

Re: sequential consistency for std::atomic on x86

Anthony Williams

Re: sequential consistency for std::atomic on x86

TA

Re: sequential consistency for std::atomic on x86

Anthony Williams

Re: sequential consistency for std::atomic on x86

TA

Re: sequential consistency for std::atomic on x86