logoalt Hacker News

kbdiazyesterday at 11:16 PM1 replyview on HN

Isn't the standard to use continuous batching? If they are using continuous batching -- I'm curious why generated token length matters, and why they might be clustering them. If not -- I'm curious why they aren't and what is the tradeoff here.


Replies

ACCount37today at 12:13 AM

This "~512 batching" makes me think of things like diffusion or prefill.

If they managed to put together some dirty hack that lets them generate about 512 tokens worth of reasoning in parallel instead of in sequence? That would explain it.