Why would you want an MPMC queue primitive, instead of just using MPSC per consumer? I don't really see a discussion of the cache contention issues. (There is some mention of contention, but it seems the author is using the word in a different way.)
It looks like both enqueue and dequeue are RMWing the "epochs" variable? This can pretty easily lead to something like O(N^2) runtime if N processors are banging this cache line at the same time.
For me, I’ve got use cases where it’s valuable to keep event data interleaved because it will also get used in flight. It works well enough I also use it for things where it’s not necessary like in memory debug rings (which requires a bit of additional work).
The epoch isn’t CAS’d; it is FAA. The epoch is then used to determine if there is contention due to the tail meeting the head, or due to a wrap-around due to slow writes.
There’s also a back-off scheme to ease contention for a full queue.
Though, I did originally have a variant that adds a fairly straightforward ‘help’ mechanism that makes the algorithm wait-free and reduces the computational complexity.
However, the extra overhead didn’t seem worth it, so I took it out pretty quickly. Iirc, the only place where the ring in the queue wouldn’t out-perform it are on tiny queues with a huge write imbalance.
If you go run the tests in the repo associated w the article, you probably will see that a ring with only 16 entries will tend to start being non-performant at about a 4:1 writer to reader ratio. But iirc that effect goes away before 128 slots in the ring.
There, the ring still fits in a page, and even with a big imbalance I can’t remember seeing less than 1m ops per second on my Mac laptop.
Real world observational data beats worst case analysis, and I’ve never seen an issue for scenarios I consider reasonable.
But, if my unrealistic is realistic to you, let me know and I can break out the wait free version for you.