The expression fusion win is huge for cache locality. Since you're using Rayon for the multicore side, I'm curious if the generated Rust expression tree is 'flat' enough for LLVM to trigger auto-vectorization (SIMD) on the individual cores or if the tree traversal adds enough branching to break that?
Do you have benchmarks? Naively I would compare this to Numba? But maybe I am way off the mark here
For the love of god, don't use these ai generated infographics/diagrams.
If that's your bar for quality, I'll think less of your code. I can't help it.
Also your saxpy example seems to be daxpy. s and d are short for single or double precision.
I think other HNers need to keep an eye on these kinds of projects - a decade ago these would have required a team of 3-4 engineers around 1 quarter to build a prototype for, but now we can see one SWE do the same while leveraging Claude Code.
Plenty of people on HN wish to bury their head under the sand, but this highlights how critical it is becoming to be both a good engineer and adept at using agentic tooling within your development lifecycle.
I built this after watching 7/8 CPU cores idle during a Monte Carlo sim. multiprocessing added 189ms serialization overhead to a 9ms computation.
ironkernel lets you write element-wise expressions with a Python decorator, compiles them to a Rust expression tree at definition time, and executes via rayon on all cores. ~2k lines of Rust, ~500 lines of Python.
The win is expression fusion: NumPy evaluates `where(x > 0, sqrt(abs(x)) + sin(x), 0)` as 5 passes with 4 temporaries. ironkernel fuses into 1 pass, zero temporaries, and skips dead branches (no NaN from sqrt of negatives). 2.25x NumPy on compound expressions at 10M elements. For BLAS ops like SAXPY, NumPy is faster — ironkernel doesn't call BLAS.
Early stage: f64 only, 1-D only, expression subset only (intentional — parallel safety guarantee). Numba warm is 3.2x faster (LLVM JIT vs interpreter).