The most interesting takeaway for me is that PCIe bandwidth really doesn't bottleneck LLM inference for single-user workloads. You're essentially just shuttling the model weights once, then the GPU churns through tokens using its own VRAM.
This is huge for home lab setups. You can run a Pi 5 with a high-end GPU via external enclosure and get 90% of the performance of a full workstation at a fraction of the power draw and cost.
The multi-GPU results make sense too - without tensor parallelism, you're just pipeline parallelism across layers, which is inherently sequential. The GPUs are literally sitting idle waiting for the previous layer's output. Exo and similar frameworks are trying to solve this but it's still early days.
For anyone considering this: watch out for ResizeBAR requirements. Some older boards won't work at all without it.