32B model in 19.3GB matters is really cool imo. Memory and cold start are what gate production deployments.
I did a piece (1) on how Netflix and Spotify worked this out a while ago, cheap classical methods handle 90%+ of their recommendation requests and LLMs only get called when the payoff justifies it.
(1) https://philippdubach.com/posts/bandits-and-agents-netflix-a...