SSD bandwidth will ultimately be limited by the amount of PCIe lanes you have available (for something other than the Apple Silicon internal storage). So the approach has inherent limitations. You can of course scale out to multiple systems to get more throughput.
You can use this approach with Intel Optane, which is wearout-resistant unlike NAND and can thus substitute for RAM. Last I checked, it was available quite cheap on the secondary market, ~$1/GB as opposed to ~$15/GB or more for DRAM. (Of course that's nowhere near as cheap as NAND, which is around ~$0.1/GB but quite wearout-prone with heavy writes.)
Yeah, PCIe is the bottleneck. The point being that whether the data originates from RAM or from NVME or Optane, you cannot get data to the GPU faster with RAM than with SSDs.
Meanwhile PCIe switches exist. So why not build:
1 CPU + memory + ...
N PCIe switch with each 1 low-memory GPU + 6 NVME drives (in theory 5 can saturate the GPU)
Each of those should only bother the CPU when they have some tokens produced and have plenty of PCIe lanes to get at their data.
Such a setup should be able to get a 6 to 8 times speedup from the solution detailed here, and a model compute increase should make relatively little difference in performance.