Put a ring on it: a lock-free MPMC ring buffer

76 points • by signa11 • last Tuesday at 1:32 PM • 42 comments • view on HN

Comments

Looks like a good write up, but I'd caution that some of the statements about memory models aren't completely accurate.

The terms relaxed, acquire, and release refer to how an atomic operation is ordered against other accesses to memory.

Counter to what the article states, a relaxed atomic is still atomic, meaning that it cannot tear and, for RMW atomic, no other access can go between the read and the write. But a relaxed atomic does not order other accesses, which can lead to unintuitive outcomes.

By contrast, once you've observed another thread's release store with an acquire load, you're guaranteed that your subsequent memory accesses "happen after" all of the other thread's accesses from before that release store -- which is what you'd intuitively expect, it's just that in modern systems (which are really highly distributed systems even on a single chip) there's a cost to establishing this kind of guarantee, which is why you can opt out of it with relaxed atomics if you know what you're doing.

➕ show 2 replies

loeg • last Tuesday at 6:37 PM

Why would you want an MPMC queue primitive, instead of just using MPSC per consumer? I don't really see a discussion of the cache contention issues. (There is some mention of contention, but it seems the author is using the word in a different way.)

It looks like both enqueue and dequeue are RMWing the "epochs" variable? This can pretty easily lead to something like O(N^2) runtime if N processors are banging this cache line at the same time.

➕ show 1 reply

qbane • last Tuesday at 2:30 PM

I wrote my first SPSC circular buffer implementation in TS upon reading the previous post on futex [0] from the same author. It was more intricate than it had seemed, until I finally sat down and wrote my own code.

[0]: https://h4x0r.org/futex/ discussion: https://news.ycombinator.com/item?id=44951563

➕ show 2 replies

drob518 • last Tuesday at 4:27 PM

Seems hung up on defining a ring buffer as something non-blocking (dropping). Having used ring buffers in both software and hardware systems for 40 years, we always called them ring buffers without this distinction. We would have called them all ring buffers. One nice thing about them is that it’s very easy to pass buffers back and forth between a single producer and single consumer without any locks or fancy memory atomics. All you need is a guarantee that memory writes occur in order. The AMD LANCE Ethernet controller (and many later derivatives) used this scheme to allow smooth overlapping of software frame processing with hardware transmission and reception way back in the 1980s.

➕ show 1 reply

WeaselNo7 • last Tuesday at 2:25 PM

Strange to see a lock-free ring buffer without seeing mention of LMAX/Martin Thompson's Java Disruptor (https://github.com/LMAX-Exchange/disruptor)

➕ show 3 replies

nly • last Tuesday at 7:04 PM

A good write up, but a preallocated MPMC queue can be built without using CAS on every push and without fixed size slots and without multiple rings.

While even SPSC ring buffers are simple they are not particularly efficient in the case where the consumer is keeping up with the producer (the ideal case for a queue) due to all the cacheline ping pong. They are not the lowest latency solution

➕ show 1 reply

withinboredom • last Tuesday at 1:46 PM

This isn't that new. (see: FASTER: A Concurrent Key-Value Store with In-Place Updates. 2018 ACM SIGMOD International Conference on Management of Data and related papers)

However, this is well written and very easy to read.

➕ show 2 replies

xavierxwang • last Tuesday at 3:01 PM

Is there any chance to modify Vyukov's MPMC queue implement (https://www.1024cores.net/home/lock-free-algorithms/queues/b...) to support drop handler? That work doesn't need 128 bit CAS.

➕ show 1 reply

j_seigh • last Tuesday at 7:21 PM

I did a lock-free MPMC ring buffer with 1 128 bit CAS and 1 64 bit CAS for enqueue and 1 64 bit CAS for dequeue. The payload is an unrestricted uintptr_t (64 bit) value so no way to avoid the 128 bit CAS in the enqueue.

gpderetta • last Tuesday at 7:20 PM

> [re weak vs strong cas] But, while that’s the guidance you’ll find all over the internet, I don’t actually know which CPUs this would affect. Maybe it’s old news, I dunno. But it does still seem to make a shade of difference in real-world tests

A CAS implemented with LL/SC (ARM, POWER) is weak as LL/SC an spuriously fail. So it always needs to be retried in a loop. Such a weak CAS might only be lock-free, not wait free as it might not provide global progress guarantees ; in practice some platforms give stronger progress guarantees as they might convert an LL/SC to a strong CAS via idiom recognition.

A strong CAS (x86, SPARC I thnk) is implemented directly in the architecture and it is typically strong. It also usually gives strong fairness guarantees.

If your algorithm needs to CAS in a loop might as well use a weak CAS to avoid a loop-of-loops. Otherwise a strong CAS might generate better code on some architectures.

> 32 bits is not enough space for the epoch if we are building something general-purpose.

Note that as long as your buffer can contain less than 31*2 items, 32 bits is usually enough (that's how TCP works for example) as even after overflow you can sequence before and after, unless you can have stale flight messages of more than one overflow ago.

>However, the num_cells field and last_slot field are not tagged _Atomic. That’s because these should be set by one thread during initialization, and then never changed. As long as the memory has synced before other threads start to use these fields, we definitely don’t need them to be treated specially. Usually, if we do initialization in a proper function, the call boundary is going to be a memory barrier that makes sure they’re sync’d when other threads start getting a handle on our ring.

Your threading library likely guarantees that anything sequenced before the start of your thread happens-before the first instruction of the new thread is executed. So you do not need explicit memory barriers. In any case, a function call is at best a compiler barrier, not a full barrier as required on many architectures.

[sorry, I wasn't really going to do a review, these were my notes when reading the algo].

The algo is quite interesting, a lot of corner cases covered. The biggest issue is that the ticketing system is a single memory location where all producers and consumers attempt to write, so it is never going to scale.

If you really really need a lock-free MPMC queue that guarantees a total order, then it can be useful, but outside some hard-realtime scenarios, I can't really see the benefits. Wouldn't a collection of SPSC queues work for the logging scenario given in the introduction?

➕ show 1 reply

alt Hacker News

Put a ring on it: a lock-free MPMC ring buffer

Comments