There are rw lock implementations where waiters (whether or readers or writers) don't contend on a shared cache line (they only touch it once to enqueue themselves, not to spin/wait)
These are usually called "scaleable locks" and the algorithms for them have been out there for decades. They are optimal from a cache coherence point of view.
The issue with them is it's impossible to support the same API as you're used to with std::shared_mutex, as every thread needs it's own line.