Partially related, is charging by token sustainable for LLM shops? If the compute requirements go up quadratically, doesn't that mean cost should as well?
Typically requests are binned by context length so that they can be batched together. So you might have a 10k bin and a 50k bin and a 500k bin, and then you drop context past 500k. So the costs are fixed per-bin.
Typically requests are binned by context length so that they can be batched together. So you might have a 10k bin and a 50k bin and a 500k bin, and then you drop context past 500k. So the costs are fixed per-bin.