I really dislike the use of spinlocks in postgres (and have been replacing a lot of uses over time),...

anarazel • today at 1:22 PM • 3 replies • view on HN

I really dislike the use of spinlocks in postgres (and have been replacing a lot of uses over time), but it's not always easy to replace them from a performance angle.

On x86 a spinlock release doesn't need a memory barrier (unless you do insane things) / lock prefix, but a futex based lock does (because you otherwise may not realize you need to futex wake). Turns out that that increase in memory barriers causes regressions that are nontrivial to avoid.

Another difficulty is that most of the remaining spinlocks are just a single bit in a 8 larger byte atomic. Futexes still don't support anything but 4 bytes (we could probably get away with using it on a part of the 8 byte atomic with some reordering) and unfortunately postgres still supports platforms with no 8 byte atomics (which I think is supremely silly), and the support for a fallback implementation makes it harder to use futexes.

The spinlock triggering the contention in the report was just stupid and we only recently got around to removing it, because it isn't used during normal operation.

Edit: forgot to add that the spinlock contention is not measurable on much more extreme workloads when using huge pages. A 100GB buffer pool with 4KB pages doesn't make much sense.

Replies

anarazel • today at 2:29 PM

Addendum big enough to warrant a separate post: The fact the contention is a spinlock, rather than a futex is unrelated to the "regression".

A quick hack shows the contended performance to be nearly indistinguishable with a futex based lock. Which makes sense, non-PI futexes don't transfer the scheduler slice the lock owner, because they don't know who the lock owner is. Postgres' spinlock use randomized exponential backoff, so they don't prevent the lock owner from getting scheduled.

Thus the contention is worse with PREEMPT_LAZY, even with non-PI futexes (which is what typical lock implementations are based on), because the lock holder gets scheduled out more often.

Probably worth repeating: This contention is due to an absurd configuration that should never be used in practice.

amluto • today at 4:08 PM

> On x86 a spinlock release doesn't need a memory barrier (unless you do insane things) / lock prefix, but a futex based lock does (because you otherwise may not realize you need to futex wake).

Now you've gotten me wondering. This issue is, in some sense, artificial: the actual conceptual futex unlock operation does not require sequential consistency. What's needed is (roughly, anyway) an release operation that synchronizes with whoever subsequently acquires the lock (on x86, any non-WC store is sufficient) along with a promise that the kernel will get notified eventually (and preferably fairly quickly) if there was a non-spinning sleeper. But there is no requirement that the notification occur in any particular order wrt anything else except that the unlock must be visible by the time the notification occurs [0]; there isn't even a requirement that the notification not occur if there is no futex waiter.

I think that, in common cache coherence protocols, this is kind of straightforward -- the unlock is a store-release, and as long as the cache line ends up being written locally, the hardware or ucode or whatever simply [1] needs to check whether a needs-notification flag is set in the same cacheline. Or the futex-wait operation needs to do a super-heavyweight barrier to synchronize with the releasing thread even though the releasing thread does not otherwise have any barrier that would do the job.

One nasty approach that might work is to use something like membarrier, but I'm guessing that membarrier is so outrageously expensive that this would be a huge performance loss.

But maybe there are sneaky tricks. I'm wondering whether CMPXCHG (no lock) is secretly good enough for this. Imagine a lock word where bit 0 set means locked and bit 1 set means that there is a waiter. The wait operation observes (via plain MOV?) that bit 0 is set and then sets bit 1 (let's say this is done with LOCK CMPXCHG for simplicity) and then calls futex_wait(), so it thinks the lock word has the value 3. The unlock operation does plain CMPXCHG to release the lock. The failure case would be that it reports success while changing the value from 1 to 0. I don't know whether this can happen on Intel or AMD architectures.

I do expect that it would be nearly impossible to convince an x86 CPU vendor to commit to an answer either way.

(Do other architectures, e.g. the most recent ARM variants, have an RMW release operation that naturally does this? I've tried, and entirely failed AFAICT, to convince x86 HW designers to add lighter weight atomics.)

[0] Visible to the remote thread, but the kernel can easily mediate this, effectively for free.

[1] Famous last words. At least in ossified microarchitectures, nothing is simple.

➕ show 1 reply

jcalvinowens • today at 5:28 PM

That 64-bit atomic in the buffer head with flags, a spinlock, and refcounts all jammed into it is nasty. And there are like ten open coded spin waits around the uses... you certainly have my empathy :)

This got me thinking about 64-bit futexes again. Obviously that can't work with PI... but for just FUTEX_WAIT/FUTEX_WAKE, why not?

Somebody tried a long time ago, it got dropped but I didn't actually see any major objection: https://lore.kernel.org/lkml/[email protected]...

➕ show 1 reply

alt Hacker News

Replies