32B model in 19.3GB matters is really cool imo. Memory and cold start are what gate production deplo...

7777777phil • today at 8:00 AM • 0 replies • view on HN

32B model in 19.3GB matters is really cool imo. Memory and cold start are what gate production deployments.

I did a piece (1) on how Netflix and Spotify worked this out a while ago, cheap classical methods handle 90%+ of their recommendation requests and LLMs only get called when the payoff justifies it.

(1) https://philippdubach.com/posts/bandits-and-agents-netflix-a...

alt Hacker News