logoalt Hacker News

simonwtoday at 1:24 AM1 replyview on HN

I'd be interested in seeing numbers that split out the speed of reading input (aka prefill) and the speed of generating output (aka decode). Those numbers are usually different and I remember from this Exo article that they could be quite radically different on Mac hardware: https://blog.exolabs.net/nvidia-dgx-spark/


Replies

geerlingguytoday at 3:09 AM

See https://github.com/geerlingguy/beowulf-ai-cluster/issues/17 for more data — I didn't save all the prompt processing times (Exo just outputs a time in ms, no other data for that), but will try to have another pass. Maybe also convince the Exo team to add a proper benchmarking capability ala `llama-bench` :)

show 1 reply