I have tried llama-cpp, vllm is nicer (ray, handles queueing, doesn't have the cache invalidati...

verdverm • today at 1:57 PM • 0 replies • view on HN

I have tried llama-cpp, vllm is nicer (ray, handles queueing, doesn't have the cache invalidation bug for qwen/gemma models) and unsloth has toxic employees in their discord.

I've run 2 qwen/gemma @8bit with full context window side-by-side. Right now I have 4 models on my spark (qwen36moe, embedding, reranker, qwen3-1.7B) to support my markdown kb tool.

The setup is not as capable, but still good and gets better with models/algos. To me, it's more about the freedom to tinker, freedom from token bill anxiety, and potential right to compute should the government/oligarchy decides it gets to decide who can access which models.

alt Hacker News