There is one more way that is truly lock free. Most lock free implementations relying on atomic compare and swap instructions are not lock free afaik; they have a lock on the cache line in the CPU (in a way you go away from global lock to many distributed locks).
There is one more mechanism that allows implementing ring buffers without having to compare head and tail buffers at all (and doesn’t rely on counters or empty/full flags etc) that piggybacks on the cache consistency protocol
Those hardware-level locks are typically not considered because they work quite differently. A standard software mutex can cause other threads to block indefinitely if, for example, the thread holding the mutex gets preempted for a long time. "Lock free" isn't really about the locks, it's about a guarantee that the system makes progress.
In this sense, the hardware locks used for atomic instructions don't really count, because they're implemented such that they can only be held for a brief, well defined time. There's no equivalent to suspending a thread while it holds a lock, causing all other threads to wait for an arbitrary amount of time.
Interesting! Do you know of an example implementation of this?
That's not how "lock free" is defined/used. If you are considering the MESI M state to be a "lock" then you also have to grant that any write instruction is a "lock".