Who ordered memory fences on an x86? (2008)

davidtgoldblatt · on Nov 7, 2014

By the way, in case anyone was curious about the comment in the article:

> Since fences are so expensive, couldn’t you add a dummy assembly instruction that prevents the X86 from re-ordering? So the pseudo assembly for thread 0 might become:

    > store(zeroWants, 1)
    > store(victim, 0)
    > r0 = load(victim) ; NEW DUMMY INSTRUCTION
    > r0 = load(oneWants)
    > r1 = load(victim)

> Since the dummy instruction is a load, the X86 can’t reorder the other loads before it. Also, since it loads from “victim”, the X86 can’t move it before the store to “victim”.

> If you do this to both threads, does that solve the problem?

This doesn't work. Intel specifically calls out these sorts of attempts to get a fake memory barrier: "The memory-ordering model allows concurrent stores by two processors to be seen in different orders by those two processors; specifically, each processor may perceive its own store occurring before that of the other." This is true in practice as well as in theory.

A related trick that does work in practice (though it is also banned by Intel) is to write to the low-order bytes of a word, read the entire word, and get the high-order bytes. It's sort of a single-word StoreLoad barrier. There's a paper from Sun that documents it further: http://home.comcast.net/~pjbishop/Dave/QRL-OpLocks-BiasedLoc... .

userbinator · on Nov 7, 2014

Reminds me of this interesting paper (2 years later) which found at least one of the x86 memory ordering guarantees was not true: http://www.cl.cam.ac.uk/~pes20/weakmemory/cacm.pdf

wolfgke · on Nov 7, 2014

For an overview of papers for x86 memory ordering of the same group, see http://www.cl.cam.ac.uk/~pes20/weakmemory/index3.html

mattnewport · on Nov 7, 2014

This article is pretty old and I suspect if you asked Bartosz now he'd explain it slightly differently. You certainly need to use instructions that impose additional ordering guarantees on x86 in these kinds of situations but you don't need an mfence and in general it will be slower than an appropriate locked instruction. The appropriate uses of mfence are actually much more limited, it's only really needed with special memory types (e.g. write combined) or when you need ordering on certain non temporal loads and stores (mostly vector instructions) AFAIK.

In regular code you should never require the hammer of mfence for correct synchronization. You can implement C++11 atomics without it.

davidtgoldblatt · on Nov 7, 2014

The relative performance of a lock'd instruction and an mfence will vary across processors and workloads. On my machine, simple peterson locking is ~10% faster when using an mfence as opposed to an xchgl. Maybe if the non-locked code benefited more from pipelining, the results would be different.

As an aside, g++ uses mfence for its C++11 atomics while clang uses xchgl; Does anyone know if Intel/AMD has ever guaranteed that all the usual ordering guarantees apply when mixing these? (Of course given their implementations it almost certainly will, but it would be nice to know this for a fact).

redraga · on Nov 7, 2014

>The appropriate uses of mfence are actually much more limited

This is true for regular race-free multithreaded programs. However, mfences are used fairly commonly in lock-free, concurrent data structures (http://concurrencykit.org/)

pkhuong · on Nov 7, 2014

If you see code compile to a full mfence in CK, please tell us. In my experience, there is no concurrency-related related reason to use mfence on x86. In fact, you'll see that we have code that checks for x86oids and use an atomic RMW instead of non-atomic store/RMW + fence.

amluto · on Nov 7, 2014

> The appropriate uses of mfence are actually much more limited, it's only really needed with special memory types (e.g. write combined) or when you need ordering on certain non temporal loads and stores (mostly vector instructions) AFAIK.

You may be thinking of sfence here, not mfence.

> In regular code you should never require the hammer of mfence for correct synchronization. You can implement C++11 atomics without it.

That's true, but I think that, in many circumstances, mfence is cheaper than atomic RMW instructions.

EDIT: Maybe I'm wrong. lock addl to the stack is faster than mfence on my laptop, but not that much faster. This is strange, because in most cases you can replace mfence with a locked access to an unrelated memory location. sfence is much faster than either.

pkhuong · on Nov 7, 2014

re sfence: standard x86 memory ordering forbids store/store reordering anyway.

amluto · on Nov 7, 2014

sfence is only useful for WC memory and non-temporal stores. I'm not sure what lfence is for other than use in conjuction with rdtsc and possibly IO.

johnbender · on Nov 7, 2014

It doesn't look like the LOCK prefix applies to MOV (from a quick google)? So how does it address write buffering or OOE for stores in a TSO memory model?

[edit] "never require the hammer of mfence for correct synchronization", maybe you're confining this to correct synch. and not recovering sequential consistency (or some other semantic property).

mattnewport · on Nov 7, 2014

It's a little while since I looked closely at this and it's easy to get this stuff wrong but here's what I concluded when digging into this in the context of a codebase I was maintaining that made heavy use of lock free techniques:

- You don't need mfence (or sfence / lfence) on x64 (which is actually what I care about rather than x86, though they're essentially the same in this respect) to correctly implement C++11 atomics with acquire / release semantics. You can get all the guarantees you need with locked instructions.

- Correctly implemented C++11 atomics implemented with locked instructions will be as fast or faster than when implemented with explicit fence instructions.

- The obvious way to implement standalone fences on x64 is with fence instructions but you rarely if ever need standalone fences. Generally you are better off using atomic operations with explicit acquire / release semantics.

In the codebase I was maintaining, the pre-C++11 atomic library was based around explicit fences rather than C++11 style atomic operations with acquire / release semantics attached to the operations themselves. This code was primarily written for Gen 3 consoles (PS3 / Xbox 360) and so was optimized for PowerPC. On x64 (Gen 4 consoles!) there was measurable performance overhead due to unnecessary/redundant standalone fences.

We decided it was too risky to try and rewrite everything in terms of atomic operations with acquire release semantics and remove the standalone fences in the end but it seems to me that if you want to write efficient cross architecture lock free code you want to avoid standalone fences and use a C++11 style atomics library where acquire release semantics are tied to the atomic operations themselves.

mattnewport · on Nov 7, 2014

To answer the first part of your question, a locked xchg is the usual way to implement a sequentially consistent store on x86 without an explicit fence. For sequentially consistent loads, plain mov suffices, providing it synchronizes with a locked store.

http://stackoverflow.com/questions/4972106/sequentially-cons...

redraga · on Nov 7, 2014

Remember that x86 (and SPARC) offer the strongest memory ordering guarantees among modern processors. The POWER and ARM memory models are weaker than x86. This actually leads to correctness issues when virtualizing a multi-core x86 guest on a weaker host (cross-ISA virtualization). Of course, this problem only shows up in truly parallel emulators using multiple threads on the host to emulate a multi-core guest, such as COREMU (http://sourceforge.net/projects/coremu/)

pkhuong · on Nov 7, 2014

The fun thing about membars on x86 is that, unless you're playing with nontemporal stores or non-standard memory types, LOCKed ops are more efficient fences than mfence.

mattnewport · on Nov 7, 2014

Missed this somehow before making my comment saying essentially the same thing. Unfortunately the code base I've been maintaining heavily overuses mfences at a measurable performance penalty on x64.

jhallenworld · on Nov 7, 2014

I'm pretty sure the strong ordering of x86 is all in support of backward compatibility (probably to the first multi-core x86). One related example of this is the cache coherent I/O system. If a PCIe card writes to memory, there is really not much that the driver code needs to worry about compared with other processors. Why is this? So the ancient device drivers in MS-DOS will still work with the ancient floppy disk DMA controller.

amelius · on Nov 7, 2014

> Loads are not reordered with other loads.

Wonder why that guarantee is necessary. Loads have no side-effects (in the memory), after all.

DSingularity · on Nov 7, 2014

They are reordered, but only speculatively.

This requirement is needed for:

T1: rd y; rd x;

T2: wr y 1;Wr x 1;

If you allow reordering its possible for an execution to observe x=0 y=1 on thread one, but that's a sequential consistency violation.

xxs · on Nov 7, 2014

Even a simple seq(alike) lock needs a load-load fence. Without a load-load fence the verification of the condition doesn't offer any guarantees for the "previously" read data, i.e. the actual data load might be reordered past the load of seq.

rayiner · on Nov 7, 2014

To add to the example of 'DSingularity: remember that the memory ordering guarantees specify the behavior of individual processors in a multi-processor system. You have to extrapolate the system behavior from there. The guarantee on reordering of loads with respect to each other is only unnecessary if you assume that no other processor modifies those locations between loads.

amelius · on Nov 7, 2014

Ah yes, I understand it now. Took me some time to get myself thinking on that abstraction level again :) Thanks!

m00dy · on Nov 7, 2014

You can apply Peterson`s lock for more than 2 threads.