logoalt Hacker News

titanomachyyesterday at 8:59 PM1 replyview on HN

The bandwidth argument is compelling, do we have benchmarks for these models? I’m curious what it translates to in tokens per second


Replies

mips_avataryesterday at 10:40 PM

I benchmarked mine for a deep research workload I was running. Concurrency 1 is the speed you'd get if you're chatting with one agent,

2x3090 (has an nvlink bridge though it didn't seem to matter hugely for inference)

Qwen 3.6 27b int4: Concurrency 1: 68 tok/s output Concurrency 32: 363 tok/s output Prompt processing speed: 1520 tok/s

Qwen 3.6 35ba3b int4: Concurrency 1: 150 tok/s output Concurrency 32: 1083 tok/s output Prompt processing speed: 4324 tok/s

Macbook Pro m3 36gb RAM: Qwen 3.6 27b int4: Concurrency 1: 18 tok/s output didn't measure the other metrics and it was a slightly different benchmark.