logoalt Hacker News

barrkeltoday at 7:42 AM1 replyview on HN

I found it interesting that vLLM was dismissed as slower than llama.cpp.

IME vLLM is quite a bit faster than llama.cpp but where it really wipes the floor with it is in batching concurrent load. The downside is that it is dramatically less flexible in terms of tweaking. It gives you very few options for running quantized weights. It takes a lot longer to start up because it optimizes the compute graph. So for single user experimentation on a model that's a bit too big for your box, vLLM is just going to be frustrating.


Replies

chartered_stacktoday at 8:39 AM

One could say: vLLM isn't a worse Llama.cpp, it's a different tool