I'd be interested in seeing numbers that split out the speed of reading input (aka prefill) and...

simonw • today at 1:24 AM • 1 reply • view on HN

I'd be interested in seeing numbers that split out the speed of reading input (aka prefill) and the speed of generating output (aka decode). Those numbers are usually different and I remember from this Exo article that they could be quite radically different on Mac hardware: https://blog.exolabs.net/nvidia-dgx-spark/

Replies

geerlingguy • today at 3:09 AM

See https://github.com/geerlingguy/beowulf-ai-cluster/issues/17 for more data — I didn't save all the prompt processing times (Exo just outputs a time in ms, no other data for that), but will try to have another pass. Maybe also convince the Exo team to add a proper benchmarking capability ala `llama-bench` :)

➕ show 1 reply

alt Hacker News

Replies