lambench is single-attempt one shot per problem.
I don't think they understand how the LLM models work. To truly benchmark a non-deterministic probabilistic model, they are going to need to run each about 45 times. LLM models are distributions and behave accordingly.
The better story is how do the models behave on the same problem after 5 samples, 15 samples, and 45 samples.
That said, using lambda calculus is a brilliant subject for benchmarking.
The models are reliably incorrect. [0]