> > On x86 a spinlock release doesn't need a memory barrier (unless you do insane things)...

anarazel • today at 4:57 PM • 1 reply • view on HN

> > On x86 a spinlock release doesn't need a memory barrier (unless you do insane things) / lock prefix, but a futex based lock does (because you otherwise may not realize you need to futex wake).

> Now you've gotten me wondering. This issue is, in some sense, artificial: the actual conceptual futex unlock operation does not require sequential consistency. What's needed is (roughly, anyway) an release operation that synchronizes with whoever subsequently acquires the lock (on x86, any non-WC store is sufficient) along with a promise that the kernel will get notified eventually (and preferably fairly quickly) if there was a non-spinning sleeper. But there is no requirement that the notification occur in any particular order wrt anything else except that the unlock must be visible by the time the notification occurs [0]; there isn't even a requirement that the notification not occur if there is no futex waiter.

Hah.

> ... > But maybe there are sneaky tricks. I'm wondering whether CMPXCHG (no lock) is secretly good enough for this. Imagine a lock word where bit 0 set means locked and bit 1 set means that there is a waiter. The wait operation observes (via plain MOV?) that bit 0 is set and then sets bit 1 (let's say this is done with LOCK CMPXCHG for simplicity) and then calls futex_wait(), so it thinks the lock word has the value 3. The unlock operation does plain CMPXCHG to release the lock. The failure case would be that it reports success while changing the value from 1 to 0. I don't know whether this can happen on Intel or AMD architectures.

I suspect the problem isn't so much the lock prefix, but that the non-futex spinlock release just is a store, whereas a futex release has to be a RMW operation.

I'm talking out of my ass here, but my guess is that the reason for the performance gain of the plain-store-is-a-spinlock-release on x86 comes from being able to do the release via the store buffer, without having to wait for exclusive ownership of the cache line. Due to being a somewhat contended simple spinlock, often embedded on the same line as the to-be-protected data, it's common for the line not not be in modified ownership anymore at release.

Replies

amluto • today at 6:22 PM

> I suspect the problem isn't so much the lock prefix, but that the non-futex spinlock release just is a store, whereas a futex release has to be a RMW operation.

> I'm talking out of my ass here, but my guess is that the reason for the performance gain of the plain-store-is-a-spinlock-release on x86 comes from being able to do the release via the store buffer, without having to wait for exclusive ownership of the cache line.

I don’t think so. The CPU is pretty good about hiding that kind of latency — reading a contended cache line and doing a correctly predicted branch shouldn’t stall anything after it.

But LOCK and MFENCE are quite expensive.

alt Hacker News

Replies