logoalt Hacker News

eurekintoday at 8:47 AM1 replyview on HN

> The model is running so hot, that it shoots past the goal and starts looping

later:

> My latest experiment was setting up vLLM (the gold standard for production and concurrent serving) and even with an NVLink (175GBP) and tensor parallelism turned on, it was 3 tokens/second slower than llama.cpp during generation for an equivalent setup.

In all my tests, getting vllm to run is worth it. It was the single biggest thing, that helped for looping issues, agents going whack and losing focus on the task, long context being essentially useless.

FP8 model, unquantized cache in vllm an you have a league better overall experience, with any other stack I tested. Then, you can actually focus on using the model for other things and stop tinkering with settings.


Replies

trey-jonestoday at 9:52 AM

I'm really curious about this, not because I disagree, but because I want to avoid agents going whack. Are you running vllm for yourself only, or a for a team, or for an application, etc? And do you feel there is a minimum hardware requirement for vllm to be useful in this way?

My weekend project is going to be building a home inference server (from ancient datacenter parts) and I'm still massaging in my head what the end result will be.

show 1 reply