As far as I know, the last approach is the only way to implement efficient lock-free ring-buffer
There is one more way that is truly lock free. Most lock free implementations relying on atomic compare and swap instructions are not lock free afaik; they have a lock on the cache line in the CPU (in a way you go away from global lock to many distributed locks).
There is one more mechanism that allows implementing ring buffers without having to compare head and tail buffers at all (and doesn’t rely on counters or empty/full flags etc) that piggybacks on the cache consistency protocol
The middle approach is the only one that is not lock-free.
The first approach is lock-free, but as the author says, it wastes an element.
But here's the thing. If your element is a character, and your buffer size is, say, 256 bytes, and you are using 8-bit unsigned characters for indices, the one wasted byte is less than one percent of your buffer space, and also is compensated for by the simplicity and reduced code size.