logoalt Hacker News

dirtikitiyesterday at 8:26 PM3 repliesview on HN

"Local AI should be a default, not a privilege: private data, no per-token bill, no vendor lock-in. The hardware to run capable models already sits on desks. The software to run those chips well doesn't."

So figure out how to run it on Vulkan instead of requiring the user to be locked into expensive CUDA cards.


Replies

Aurornisyesterday at 8:57 PM

So everyone is aware, you can already run Qwen3.5-27B on Vulkan or Apple's hardware. Every major inference engine supports it right now.

This repo is a vibecoded demo implementation of some recent research papers combined with some optimizations that sacrifice quality for speed to get a big number that looks impressive. The 207 tok/s number they're claiming only appears in the headline. The results they show are half that or less, so I already don't trust anything they're saying they accomplished.

If you want to run Qwen3.5-27B you can do it with a project llama.cpp on CUDA, Vulkan, Apple, or even CPU.

show 1 reply
SwellJoeyesterday at 10:33 PM

You can run pretty much every model on Vulkan, including the Qwen MoE models. You can also run pretty much every model on ROCm, Apple Silicon via MLX, and Intel hardware via OpenVINO. Nvidia got there first, but they're no longer clearly dominant in the self-hosting space, simply because of the high cost. I think Apple probably has the lead there, due to unified memory allowing big models to run without multiple big dedicated GPUs, but stuff like Strix Halo with 128GB of unified memory is also pretty much sold out everywhere. There's a lower bound on how small a model can be and still be useful.

Anyway, I don't have any Nvidia hardware, and I've got several local models running and/or training at all times.

andsoitisyesterday at 8:29 PM

Why doesn’t Apple?

show 4 replies