The fun thing about membars on x86 is that, unless you're playing with nontemporal stores or non-standard memory types, LOCKed ops are more efficient fences than mfence.
Missed this somehow before making my comment saying essentially the same thing. Unfortunately the code base I've been maintaining heavily overuses mfences at a measurable performance penalty on x64.