Neat result. The symmetry exploitation here reminds me of recent work connecting neural network training dynamics to renormalization group theory. Charles Martin's SETOL paper https://arxiv.org/abs/2507.17912 shows that well-trained layers converge to something like an RG fixed point—the eigenvalue spectrum of the weight matrix develops power-law tails with exponent α ≈ 2, which is the signature of scale invariance. At this fixed point, the "effective correlation space" is low-dimensional: you can truncate the SVD aggressively and recover nearly identical test accuracy.
I wonder if there's a connection to your Taylor truncation order. In RG terms, higher-order polynomial interactions are "irrelevant operators"—they get suppressed as you flow toward the fixed point. If trained attention heads are sitting near this fixed point, that might explain why modest truncation orders work: the network has already learned to concentrate its computation in the lower-order terms. A testable prediction: layers with α closer to 2 (measurable via weightwatcher https://github.com/CalculatedContent/WeightWatcher) might need fewer Taylor terms for accurate approximation than layers with α far from 2. If true, you could potentially use the spectral statistics to adaptively choose truncation order per-head.
Yes, there must be a connection. While adaptive truncation may prove impractical, it should be possible to measure spectral statistics on sample data, and specify a different fixed truncation order per layer, per head, etc. The github repository lists many other possible improvements: https://github.com/glassroom/sata_attention#proof-of-concept