did you do a PR to integrate these changes back into llama.cpp ? 2x speedup would be absolutely wild

jcelerier • yesterday at 3:39 PM • 2 replies • view on HN

Replies

Almost nobody using llama.cpp does batch inference. I wouldn’t be surprised if the change is somewhat involved to integrate with all of llama.cpp’s other features. Combined with lack of interest and keeping up with code churn, that would probably make it difficult to get included, with the number of PRs the maintainers are flooded with.

➕ show 2 replies

zozbot234 • yesterday at 5:18 PM

It depends, if the optimization is too hardware-dependent it might hurt/regress performance on other platforms. One would have to find ways to generalize and auto-tune it based on known features of the local hardware architecture.

➕ show 1 reply

alt Hacker News

Replies