The bandwidth argument is compelling, do we have benchmarks for these models? I’m curious what it tr...

titanomachy • yesterday at 8:59 PM • 1 reply • view on HN

The bandwidth argument is compelling, do we have benchmarks for these models? I’m curious what it translates to in tokens per second

Replies

mips_avatar • yesterday at 10:40 PM

I benchmarked mine for a deep research workload I was running. Concurrency 1 is the speed you'd get if you're chatting with one agent,

2x3090 (has an nvlink bridge though it didn't seem to matter hugely for inference)

Qwen 3.6 27b int4: Concurrency 1: 68 tok/s output Concurrency 32: 363 tok/s output Prompt processing speed: 1520 tok/s

Qwen 3.6 35ba3b int4: Concurrency 1: 150 tok/s output Concurrency 32: 1083 tok/s output Prompt processing speed: 4324 tok/s

Macbook Pro m3 36gb RAM: Qwen 3.6 27b int4: Concurrency 1: 18 tok/s output didn't measure the other metrics and it was a slightly different benchmark.

alt Hacker News

Replies