> Most production AI applications aren't running 405B models. They're running 7B-70B mo...

cubefox • yesterday at 8:31 PM • 1 reply • view on HN

> Most production AI applications aren't running 405B models. They're running 7B-70B models that need low latency and high throughput.

Really? At least for LLMs, most actual usage is concentrated on huge SOTA models. 1 trillion parameters or more. And LLMs seem to be the lion's share of AI compute demand.

Replies

wmf • yesterday at 9:12 PM

OpenAI is trying to move as many requests as they can to a "smaller" model (still suspected to be ~200B).

➕ show 1 reply

alt Hacker News

Replies