logoalt Hacker News

philipp-gayrettoday at 8:13 AM0 repliesview on HN

Nice work, I applied my own benchmarking tools to it.

On my single NVidia Spark I get 173.3 tokens/s on baseline config, 372.4 tokens/s with added tuning/parallel options. Most notably time to first token is incredibly low, similar models take ~6000ms. Bonsai was 70ms (almost 100x reduction) with flash attention

Having said all that, gemma4-e4b-q4km did much better and I can achieve 70% of the tokens/s on the same machine, specifically in context of tool use and for running agents.