logoalt Hacker News

LuxBennutoday at 1:03 PM14 repliesview on HN

The title is misleading — there's no trained 100B model, just an inference framework that claims to handle one. But the engineering is worth paying attention to. I run quantized 70B models locally (M2 Max 96GB, llama.cpp + LiteLLM), and memory bandwidth is always the bottleneck. The 1.58-bit approach is interesting because ternary weights turn matmuls into additions — a fundamentally different compute profile on commodity CPUs. If 5-7 tok/s on a single CPU for 100B-class models is reproducible, that's a real milestone for on-device inference. Framework is ready. Now we need someone to actually train the model.


Replies

embedding-shapetoday at 1:22 PM

> Framework is ready. Now we need someone to actually train the model.

If Microslop aren't gonna train the model themselves to prove their own thesis, why would others? They've had 2 years (I think?) to prove BitNet in at least some way, are you really saying they haven't tried so far?

Personally that makes it slightly worrisome to just take what they say at face value, why wouldn't they train and publish a model themselves if this actually led to worthwhile results?

show 5 replies
deepsquirrelnettoday at 5:09 PM

The title being misleading is important as well, because this has landed on the front page, and the only thing that would be the only notable part of this submission.

The "new" on huggingface banner has weights that were uploaded 11 months ago, and it's 2B params. Work on this in the repo is 2 years old.

The amount of publicity compared to the anemic delivery for BitNet is impressive.

wongarsutoday at 1:19 PM

I've also always though that it's an interesting opportunity for custom hardware. Two bit addition is incredibly cheap in hardware, especially compared to anything involving floating point. You could make huge vector instructions on the cheap, then connect it to the fastest memory you can buy, and you have a capable inference chip.

You'd still need full GPUs for training, but for inference the hardware would be orders of magnitude simpler than what Nvidia is making

show 2 replies
DrBazzatoday at 4:55 PM

> memory bandwidth is always the bottleneck

I'm hoping that today's complaints are tomorrow's innovations. Back when 1Mb hard drive was $100,000, or when Gates said 640kb is enough.

Perhaps some 'in the (chip) industry' can comment on what RAM manufacturers are doing at the moment - better, faster, larger? Or is there not much headroom left and it's down to MOBO manufacturers, and volume?

show 1 reply
riidomtoday at 4:11 PM

Text is misleading too. 5-7 tok/sec is not reading speed, it's a tad slower. For me, at least, and I am an experienced reader, not especially schooled in quick-reading though.

I happened to "live" on 7.0-7.5 tok/sec output speed for a while, and it is an annoying experience. It is the equivalent of walking behind someone slightly slower on a footwalk. I dealt with this by deliberately looking away for a minute until output was "buffered" and only then started reading.

For any local setup I'd try to reach for 10 tok/sec. Sacrifice some kv cache and shove a few more layers on your GPU, it's worth it.

WithinReasontoday at 1:30 PM

> a fundamentally different compute profile on commodity CPU

In what way? On modern processors, a Fused Multiply-Add (FMA) instruction generally has the exact same execution throughput as a basic addition instruction

show 3 replies
rustyhancocktoday at 1:16 PM

Yes. I had to read it over twice, it does strike me as odd that there wasn't a base model to work with.

But it seems the biggest model available is 10B? Somewhat unusual and does make me wonder just how challenging it will be to train any model in the 100B order of magnitude.

show 2 replies
cubefoxtoday at 1:22 PM

LLM account

show 7 replies
august11today at 1:33 PM

In their demo they're running 3B model.

webXLtoday at 2:16 PM

It comes from (intentionally?) misleading docs: https://github.com/microsoft/BitNet/issues/391

(only suggesting that it's intentional because it's been there so long)

show 1 reply
RandomTeaPartytoday at 3:12 PM

> The 1.58-bit approach

can we stop already with these decimals and just call it "1 trit" which it exactly is?

show 1 reply
cyanydeeztoday at 2:36 PM

Check out the new QWEN coder model.

Also, isnt there different affinities to 8bit vs 4bit for inferences

butILoveLifetoday at 1:24 PM

>. I run quantized 70B models locally (M2 Max 96GB, llama.cpp + LiteLLM), and memory bandwidth is always the bottleneck.

I imagine you got 96gb because you thought you'd be running models locally? Did you not know the phrase Unified Memory is marketing speak?