Clearly they are batching reasoning inference in a few multiples of 512 tokens as a throughput optim...

kleton • yesterday at 10:49 PM • 2 replies • view on HN

Clearly they are batching reasoning inference in a few multiples of 512 tokens as a throughput optimization

Replies

My first thought would be an adjustment to a reasoning budget parameter (using llama.cpp as my reference) which would lead to these results. But no way to know precisely without an OpenAI statement.

It could be a very dishonest way of scaling to demand during peak hours. I know that some people already scoff in this topic about the subjective nature of perceived performance of models. But the model seemed less smart when US comes online (at least from my testing over the month of May).

On my company blog post from a few weeks ago I felt the need to point this out because it had a perceptively more consistent pattern during those overlap times. Should have saved the session logs for further analysis https://webesque.agency/blog/2026-06-19-llms.html

kbdiaz • yesterday at 11:16 PM

Isn't the standard to use continuous batching? If they are using continuous batching -- I'm curious why generated token length matters, and why they might be clustering them. If not -- I'm curious why they aren't and what is the tradeoff here.

➕ show 1 reply

alt Hacker News

Replies