Might as well add my own experience since I just set up a local llm this week. I went with a 32GB c...

ineptech • yesterday at 8:25 PM • 1 reply • view on HN

Might as well add my own experience since I just set up a local llm this week. I went with a 32GB card made by Intel called Arc B70, which is cheaper than a 3090 and more has ram, at the cost of a slower memory bus. edited to remove something incorrect, thanks diablod3

I went with this because a) the models I wanted to use are a little too big to fit comfortably in 24gb, plus I need room for a few additional small models for autocomplete and speech recognition, and b) I already had a cheap server to use and dual gpus would've required upgrading the mobo and power supply and probably the case as well.

It was definitely a little tricky to set up. The Intel line requires a driver package called "level zero" to support something called SYCL (Intel's version of CUDA basically, AFAICT) that was tricky to get working. I am running llama.cpp in a docker container, which also required some fiddling to get the container to see the card. You also need a kernel from the last few months.

Once I got it working though, the results are very impressive for a $1k investment. Qwen 3.6 35B at q4 quantization takes about 3/4 of the ram and delivers like 88 tokens/sec. So, if you want a decent-sized model for cheap, this is one way to go.

Replies

DiabloD3 • yesterday at 8:36 PM

That is incorrect.

They both have GDDR6.

The B70 has 256 bit it bus at a clock speed of 2375mhz (608 GB/s), the 3090 has a 384 bit bus at a clock speed of 2438mhz (936 GB/s).

It isn't slower, it just has less channels, ie, it is less wide.

➕ show 1 reply

alt Hacker News

Replies