The problem is there’s a real wall on the vram side. While fused main memory is ok the inference speeds on larger models are impractical. With vram on a GPU the machine class, power requirement, GPU costs, and other factors put them out of most people’s reach. Cloud GPUs require a second job to keep available and hot. What closed providers offer is packing and scale advantages as well as infrastructure. The scaling laws here aren’t the same as Moore’s law - in fact they predict more required hardware and more scale over time. Moore’s laws isn’t keeping up with expanded needs and the ability to fab and produce at scale the specific things that weren’t needed a few years ago are lagging. So it’s not a 6-8 month lag; it’s a lag that will be induced by hardware scarcity and an ever increasing lag until something fundamentally changes with matmul.