I haven't had time to see the whole thing yet, but I'm quite surprised this yielded good results. If this works I would have expected CPU implementations to do some optimization around this by default given the memory latency bottleneck of the last 1.5 decades. What am I missing here?
Turning on mirroring does this for the low, low price of doubling your RAM cost.