Co resident threads might not get any speed up here since coherency instructions are functionally operations on the L2 cache.