The expression fusion win is huge for cache locality. Since you're using Rayon for the multicore side, I'm curious if the generated Rust expression tree is 'flat' enough for LLVM to trigger auto-vectorization (SIMD) on the individual cores or if the tree traversal adds enough branching to break that?