Are there programming languages or optimizers that simplify this kind of plumbing?
Does it use switch in place? x = x XOR y, y = x XOR y, x = x XOR y
Is transposition a common enough operation that it might be better to avoid it by having versions of the operations/functions that take matrices that do the necessary transpositions implicitly?
[flagged]
That last diagram almost looks like an FFT shuffle.