> [re weak vs strong cas] But, while that’s the guidance you’ll find all over the internet, I don’t actually know which CPUs this would affect. Maybe it’s old news, I dunno. But it does still seem to make a shade of difference in real-world tests
A CAS implemented with LL/SC (ARM, POWER) is weak as LL/SC an spuriously fail. So it always needs to be retried in a loop. Such a weak CAS might only be lock-free, not wait free as it might not provide global progress guarantees ; in practice some platforms give stronger progress guarantees as they might convert an LL/SC to a strong CAS via idiom recognition.
A strong CAS (x86, SPARC I thnk) is implemented directly in the architecture and it is typically strong. It also usually gives strong fairness guarantees.
If your algorithm needs to CAS in a loop might as well use a weak CAS to avoid a loop-of-loops. Otherwise a strong CAS might generate better code on some architectures.
> 32 bits is not enough space for the epoch if we are building something general-purpose.
Note that as long as your buffer can contain less than 31*2 items, 32 bits is usually enough (that's how TCP works for example) as even after overflow you can sequence before and after, unless you can have stale flight messages of more than one overflow ago.
>However, the num_cells field and last_slot field are not tagged _Atomic. That’s because these should be set by one thread during initialization, and then never changed. As long as the memory has synced before other threads start to use these fields, we definitely don’t need them to be treated specially. Usually, if we do initialization in a proper function, the call boundary is going to be a memory barrier that makes sure they’re sync’d when other threads start getting a handle on our ring.
Your threading library likely guarantees that anything sequenced before the start of your thread happens-before the first instruction of the new thread is executed. So you do not need explicit memory barriers. In any case, a function call is at best a compiler barrier, not a full barrier as required on many architectures.
[sorry, I wasn't really going to do a review, these were my notes when reading the algo].
The algo is quite interesting, a lot of corner cases covered. The biggest issue is that the ticketing system is a single memory location where all producers and consumers attempt to write, so it is never going to scale.
If you really really need a lock-free MPMC queue that guarantees a total order, then it can be useful, but outside some hard-realtime scenarios, I can't really see the benefits. Wouldn't a collection of SPSC queues work for the logging scenario given in the introduction?
From my reading of the article I think they understood why we'd want these two primitives for CAS, but they weren't clear (whereas your answer is better here) on whether that's a thing we care about today in 2025. ARM vs x86-64 matters for many people today whereas if we only wanted the other primitive for the M68k well, sorry Amiga fans but who cares.
Without immersion in the "Why" of each technological niche it can be hard to judge whether you're reading advice that really hasn't been relevant in decades ("The ASCII character set is not supported on all computers") or that's still important to your work today ("The file naming conventions may vary from one system to another")