What pattern in the data shows that's what's being measured? I would expect to see basically 0 latency between adjacent "cores" then since L1 is shared per thread?
Co resident threads might not get any speed up here since coherency instructions are functionally operations on the L2 cache.
Co resident threads might not get any speed up here since coherency instructions are functionally operations on the L2 cache.