Not sure what was unexpected about the multi GPU part.
It's very well known that most LLM frameworks including llama.cpp splits models by layers, which has sequential dependency, and so multi GPU setups are completely stalled unless there are n_gpu users/tasks running in parallel. It's also known that some GPUs are faster in "prompt processing" and some in "token generation" that combining Radeon and NVIDIA does something sometimes. Reportedly the inter-layer transfer sizes are in kilobyte ranges and PCIe x1 is plenty or something.
It takes appropriate backends with "tensor parallel" mode support, which splits the neural network parallel to the direction of flow of data, which also obviously benefit substantially from good node interconnect between GPUs like PCIe x16 or NVlink/Infinity Fabric bridge cables, and/or inter-GPU DMA over PCIe(called GPU P2P or GPUdirect or some lingo like that).
Absent those, I've read somewhere that people can sometimes see GPU utilization spikes walking over GPUs on nvtop-style tools.
Looking for a way to break up tasks for LLMs so that there will be multiple tasks to run concurrently would be interesting, maybe like creating one "manager" and few "delegated engineers" personalities. Or simulating multiple different domains of brain such as speech center, visual cortex, language center, etc. communicating in tokens might be interesting in working around this problem.
> Looking for a way to break up tasks for LLMs so that there will be multiple tasks to run concurrently would be interesting, maybe like creating one "manager" and few "delegated engineers" personalities.
This is pretty much what "agents" are for. The manager model constructs prompts and contexts that the delegated models can work on in parallel, returning results when they're done.
> Reportedly the inter-layer transfer sizes are in kilobyte ranges and PCIe x1 is plenty or something.
Not an expert, but napkin math tells me that more often that not this will be in the order of megabytes—not kilobytes—since it scales with sequence length.
Example: Qwen3 30B has a hidden state size of 5120, even if quantized to 8 bits that's 5120 bytes per token. It would pass the MB boundary with just a little over 200 tokens. Still not much of an issue when a single PCIe lane is ~2GB/s.
I think device to device latency is more of an issue here, but I don't know enough to assert that with confidence.
> Not sure what was unexpected about the multi GPU part. It's very well known that most LLM frameworks including llama.cpp splits models by layers, which has sequential dependency, and so multi GPU setups are completely stalled
Oh, I thought the point of transformers was being able to split the load veritcally to avoid seqential dependancies. Is it true just for training or not at all?
Theres some technical implementations that makes it more efficient like EXO [1]. Jeff Geerling recently did a review on a 4 MAC Studio cluster with RDMA support and you can see that EXO has a noticeable advantage [2].
[1] https://github.com/exo-explore/exo [2] https://www.youtube.com/watch?v=x4_RsUxRjKU