At what point do the OEMs begin to realize they don’t have to follow the current mindset of attaching a GPU to a PC and instead sell what looks like a GPU with a PC built into it?
Not sure what was unexpected about the multi GPU part.
It's very well known that most LLM frameworks including llama.cpp splits models by layers, which has sequential dependency, and so multi GPU setups are completely stalled unless there are n_gpu users/tasks running in parallel. It's also known that some GPUs are faster in "prompt processing" and some in "token generation" that combining Radeon and NVIDIA does something sometimes. Reportedly the inter-layer transfer sizes are in kilobyte ranges and PCIe x1 is plenty or something.
It takes appropriate backends with "tensor parallel" mode support, which splits the neural network parallel to the direction of flow of data, which also obviously benefit substantially from good node interconnect between GPUs like PCIe x16 or NVlink/Infinity Fabric bridge cables, and/or inter-GPU DMA over PCIe(called GPU P2P or GPUdirect or some lingo like that).
Absent those, I've read somewhere that people can sometimes see GPU utilization spikes walking over GPUs on nvtop-style tools.
Looking for a way to break up tasks for LLMs so that there will be multiple tasks to run concurrently would be interesting, maybe like creating one "manager" and few "delegated engineers" personalities. Or simulating multiple different domains of brain such as speech center, visual cortex, language center, etc. communicating in tokens might be interesting in working around this problem.
I've been kicking this around in my head for a while. If I want to run LLMs locally, a decent GPU is really the only important thing. At that point, the question becomes, roughly, what is the cheapest computer to tack on the side of the GPU? Of course, that assumes that everything does in fact work; unlike OP I am barely in a position to understand eg. BAR problems, let alone try to fix them, so what I actually did was build a cheap-ish x86 box with a half-decent GPU and called it a day:) But it still is stuck in my brain: there must be a more efficient way to do this, especially if all you need is just enough computer to shuffle data to and from the GPU and serve that over a network connection.
Datapoints like this really make me reconsider my daily driver. I should be running one of those $300 mini PCs at <20W. With ~flat CPU performance gains, would be fine for the next 10 years. Just remote into my beefy workstation when I actually need to do real work. Browsing the web, watching videos, even playing some games is easily within their wheelhouse.
So glad someone did this. Have been running big gpus on egpus connected to spare laptops and thinking why not pis.
I wish for a hardware + software solution to enable direct PCIe interconnect using lanes independent from the chipset/CPU. A PCIe mesh of sorts.
With the right software support from say pytorch this could suddenly make old GPUs and underpowered PCs like in TFA into very attractive and competitive solutions for training and inference.
I currently have a £500 laptop hooked up to an egpu box with a £700 gpu. It's not a bad setup.
I'd be interested to see if workloads like Folding@home could be efficiently run this way. I don't think they need a lot of bandwidth.
Of course, just go to any computer store where most gamer setups on affordable bugets go with the combo "beefy GPU + an i5", instead of using an i7 or i9 Intel CPUs.
I really would have liked to see gaming performance, although I realize it might be difficult to find a AAA game that supports ARM. (Forcing the Pi to emulate x86 with FEX doesn't seem entirely fair.)
Really why have the PCI/CPU artifice at all? Apple and Nvidia have the right idea: put the MPP on the same die/package as the CPU.
What about constrained decoding (with JSON schemas)? I noticed my vLLM instance is using 1 CPU 100%.
Thats what SHE said.
PCIe 3.0 is the nice easy convenient generation where 1 lane = 1GBps. Given the overhead, thats pretty close to 10Gb ethernet speeds (low latency though).
I do wonder how long the cards are going to need host systems at all. We've already seen GPUs with m.2 ssd attached! Radeon Pro SSG hails back from 2016! You still need a way to get the model on that in the first place to get work in and out, but a 1Gbe and small RISC-V chip (which Nvidia already uses formanagement cores) could suffice. Maybe even an rpi on the card. https://www.techpowerup.com/224434/amd-announces-the-radeon-...
Given the gobs of memory cards have, they probably don't even need storage; they just need big pipes. Intel had 100Gbe on their Xeon & Xeon Phi cores (10x what we saw here!) in 2016! GPUs that just plug into the switch and talk across 400Gbe or UltraEthernet or switched CXL, that run semi independently, feel so sensible, so not outlandish. https://www.servethehome.com/next-generation-interconnect-in...
It's far off for now, but flash makers are also looking at radically many channel flash, which can provide absurdly high GB/s, High Bandwidth Flash. And potentially integrated some extremely parallel tensorcores on each channel. Switching from DRAM to flash for AI processing could be a colossal win for fitting large models cost effectively (& perhaps power efficiently) while still having ridiculous gobs of bandwidth. With that possible win of doing processing & filtering extremely near to the data too. https://www.tomshardware.com/tech-industry/sandisk-and-sk-hy...
Now compare batched training performance. Or batched inference.
Of course prefill is going to be GPU bound. You only send a few thousand bytes to it, and don't really ask to return much. But after prefill is done, unless you use batched mode, you aren't really using your GPU for anything more that it's VRAM bandwidth.
The most interesting takeaway for me is that PCIe bandwidth really doesn't bottleneck LLM inference for single-user workloads. You're essentially just shuttling the model weights once, then the GPU churns through tokens using its own VRAM.
This is huge for home lab setups. You can run a Pi 5 with a high-end GPU via external enclosure and get 90% of the performance of a full workstation at a fraction of the power draw and cost.
The multi-GPU results make sense too - without tensor parallelism, you're just pipeline parallelism across layers, which is inherently sequential. The GPUs are literally sitting idle waiting for the previous layer's output. Exo and similar frameworks are trying to solve this but it's still early days.
For anyone considering this: watch out for ResizeBAR requirements. Some older boards won't work at all without it.