logoalt Hacker News

jcelerieryesterday at 3:39 PM2 repliesview on HN

did you do a PR to integrate these changes back into llama.cpp ? 2x speedup would be absolutely wild


Replies

zargonyesterday at 4:19 PM

Almost nobody using llama.cpp does batch inference. I wouldn’t be surprised if the change is somewhat involved to integrate with all of llama.cpp’s other features. Combined with lack of interest and keeping up with code churn, that would probably make it difficult to get included, with the number of PRs the maintainers are flooded with.

show 2 replies
zozbot234yesterday at 5:18 PM

It depends, if the optimization is too hardware-dependent it might hurt/regress performance on other platforms. One would have to find ways to generalize and auto-tune it based on known features of the local hardware architecture.

show 1 reply