By caching they mean “cached in GPU memory”. That’s a very very scarce resource. Caching to RAM an...

kiratp • today at 1:37 AM • 0 replies • view on HN

By caching they mean “cached in GPU memory”. That’s a very very scarce resource.

Caching to RAM and disk is a thing but it’s hard to keep performance up with that and it’s early days of that tech being deployed anywhere.

Disclosure: work on AI at Microsoft. Above is just common industry info (see work happening in vLLM for example)

alt Hacker News