Your point about caliber/quality is fair, but I have been pretty astonished by some of the newe...

xscott • yesterday at 7:19 PM • 1 reply • view on HN

Your point about caliber/quality is fair, but I have been pretty astonished by some of the newer/better models (Gemma 4 variants, GPT-OSS before that).

However, there's not a lot of memory increase to have multiple sessions in parallel with one model. It's an HTTP server, and other than some caching, basically stateless.

Replies

iib • yesterday at 8:21 PM

Doesn't llama.cpp (or similar) have to evict the kv cache for this, so that performance is degraded when running multiple sessions? Or how do you load a model in memory and then use it in multiple sessions? I am still learning this stuff

➕ show 2 replies

alt Hacker News

Replies