Very interesting. The state management is the really insightful find here. I always wondered how t...

glitchc • today at 4:14 AM • 3 replies • view on HN

Very interesting. The state management is the really insightful find here.

I always wondered how these large AI companies managed access for millions of simultaneous users without having to allocate a dedicated LLM instance for each user. Pushing the complete state down to the user after every call makes perfect sense. The LLM itself stays memoryless and ready to respond to an arbitrary prompt. Very nice.

Replies

geocar • today at 4:42 AM

N.B. This is exactly how seaside, vba, and even arc[1] do server-side state generally: by encrypting the blob-representing-state and sending to the client to be sent back on future requests (where it will be decrypted and rehydrated).

It's an old trick that everyone designing protocols should know, since there are lots of applications beyond AI companies.

[1]: As in, pg's lisp: https://arclanguage.github.io/ref/srv.html#:~:text=The%20pre...

b65e8bee43c2ed0 • today at 6:42 AM

the exchange rate between text and its representation in memory is brutal. here's a bit from a recent article:

>An 82 GB footprint in DDR3 on a 2016 Xeon. About 25 GB of weights and 56 GB of KV cache at the full 262K context. The KV cache is larger than the model.

262k tokens is not much at all. with ~5 characters per token, that's only 1.3 MB of plaintext.

➕ show 1 reply

londons_explore • today at 8:12 AM

Except the providers also cache the parsing of the prompt (the KV cache), and that has substantial cost savings (easily an 80% saving on typical coding use cases).

That caching is done server side and not passed to the client. Which in turn means they still need state management on the server side, although it perhaps doesn't need the same level of global replication and availability.

alt Hacker News

Replies