logoalt Hacker News

jychangtoday at 11:49 AM0 repliesview on HN

Well, Deepseek batch sizes are something like 8192, so 128 isn't much.

https://arxiv.org/html/2412.19437v1 "the batch size per expert is relatively small (usually within 256 tokens)"