Look into the Eigen library. They use template meta programming to chain linear algebra operations in a way that the compiler should be able to optimize memory layout and kernels for vector instructions. Might give you some ideas.
Though you can expect very verbose compiler output. (I had 35 pages of compiler output output for a single type error once). Probably Nbd with llms.