It's not even a fixed cost per token (even though it's billed that way, and that's st...

zozbot234 • today at 6:08 PM • 1 reply • view on HN

It's not even a fixed cost per token (even though it's billed that way, and that's still miles better than a fixed-price all you can eat). You're incurring a cost that's proportional to generated tokens times the context for each (plus the prefill cost for any uncached input), so the expense grows quadratically with your average generated context.

This all becomes extremely visible when trying to do agentic coding with local language models - you quickly realize that controlling context length and model size is just as important as avoiding wasted effort. The real scam is not AI Q&A ala ChatGPT, that's actually quite viable - though marginally less so as conversations grow longer. It's agentic coding with SOTA models and huge contexts.

Replies

GaggiX • today at 6:41 PM

Using larger contexts often costs more in the APIs or consume more of your quota but this is becoming less of a problem with models using more clever attention mechanisms and not just full attention on all layers.

You can look at: https://sebastianraschka.com/llm-architecture-gallery/ and see how much things have changed.

➕ show 1 reply

alt Hacker News

Replies