Until you get high memory contention from the rest of the code. Once eviction gets high you get some pretty counterintuitive improvements by fixing things that seem like they shouldn’t need to be fixed.
My best documented case was a 10x speed up from removing a double lookup that was killing caches.
My best improvment was just bit-interleaving both axes of a 2x32bit integer coordinate (aka z-curve). I obtained factor ~100x (yes factor not percent) throughput improvement over locality in only one dimension. All it took was ~10 lines of bit twiddling. The runtime went from a bit above 300ms to slightly less then 3ms.