Almost nobody using llama.cpp does batch inference. I wouldn’t be surprised if the change is somewha...

zargon • yesterday at 4:19 PM • 2 replies • view on HN

Almost nobody using llama.cpp does batch inference. I wouldn’t be surprised if the change is somewhat involved to integrate with all of llama.cpp’s other features. Combined with lack of interest and keeping up with code churn, that would probably make it difficult to get included, with the number of PRs the maintainers are flooded with.

Replies

buildxyz • yesterday at 5:16 PM

Any speed up that is 2x is definitely worth fixing. Especially since someone has already figured out the issue and performance testing [1] shows that llamacpp* is lagging behind vLLM by 2x. This is a positive for all running LLMs locally using llamacpp.

Even if llamacpp isnt used for batch inference now, this can allow those to finally run llamacpp for batching and on any hardware since vLLM supports only select hardware. Maybe finally we can stop all this gpu api software fragmentation and cuda moat as llamacpp benchmarks have shown Vulkan to be as or more performant than cuda or sycl.

[1] https://miro.medium.com/v2/resize:fit:1400/format:webp/1*lab...

➕ show 1 reply

tough • yesterday at 4:24 PM

if you open a PR, even if it doesnt get merged, anyone with the same issue can find it, and use your PR/branch/fix if it suits better their needs than master

➕ show 1 reply

alt Hacker News

Replies