> That 64-bit atomic in the buffer head with flags, a spinlock, and refcounts all jammed into it is nasty.
Turns out to be pretty crucial for performance though... Not manipulating them with a single atomic leads to way way worse performance.
For quite a while it was a 32bit atomic, but I recently made it a 64bit one, to allow the content lock (i.e. protecting the buffer contents, rather than the buffer header) to be in the same atomic var. That's for one nice for performance, it's e.g. very common to release a pin and a lock at the same time and there are more fun perf things we can do in the future. But the real motivation was work on adding support for async writes - an exclusive locker might need to consume an IO completion for a write that's in flight that is prevent it from acquiring the lock. And that was hard to do with a separate content lock and buffer state...
> And there are like ten open coded spin waits around the uses... you certainly have my empathy :)
Well, nearly all of those are all to avoid needing to hold a spinlock, which, as lamented a lot around this issue, don't perform that well when really contended :)
We're on our way to barely ever need the spinlock for the buffer header, which then should allow us to get rid of many of those loops.
> This got me thinking about 64-bit futexes again. Obviously that can't work with PI... but for just FUTEX_WAIT/FUTEX_WAKE, why not?
It'd be pretty nice to have. There are lot of cases where one needs more lock state than one can really encode into a 32bit lock state.
I'm quite keen to experiment with the rseq time slice extension stuff. Think it'll help with some important locks (which are not spinlocks...).
> Turns out to be pretty crucial for performance though...
I don't doubt it. I just meant nasty with respect to using futex() to sleep instead of spin, I was having some "fun" trying.
I can certainly see how pushing that state into one atomic would simplify things, I didn't really mean to question that.
> We're on our way to barely ever need the spinlock for the buffer header, which then should allow us to get rid of many of those loops.
I'm cheering you on, I hadn't looked at this code before and its been fun looking through some of the recent work on it.
> It'd be pretty nice to have. There are lot of cases where one needs more lock state than one can really encode into a 32bit lock state.
I've seen too much open coded spinning around 64-bit CAS in proprietary code, where it was a real demonstrable problem, and similar to here it was often not straightforward to avoid. I confess to some bias because of this experience ("not all spinlocks...") :)
I remember a lot of cases where FUTEX_WAIT64/FUTEX_WAKE64 would have been a drop-in solution, that seems compelling to me.