thanks :) the braiding approach is super clever too, this was one of those weird moments where you find something and then have to triple check your results because how could i accidentally find something better than the algorithm that hasn't been touched in decades...
the part i really like is that it gives us small improvement on the pclmul too, as the non-accelerated algorithm doesn't really stand a chance against the accelerated opcode on newer hardware so it probably isn't going to see much use in practice. however... i think hardware solutions could possibly benefit (e.g. ethernet cards)