The bottleneck in training and inference isn’t matmul, and once a chip isn’t a kindergarten toy you don’t go from FPGA to tape out by clicking a button. For local memory he’s going to have to learn to either stack DRAM (not “3000 lines of verilog” and requires a supply chain which openai just destroyed) or diffuse block RAM / SRAM like Groq which is astronomically expensive bit for bit and torpedoes yields, compounding the issue. Then comes interconnect.
The main point is that it will not be an nvidia’s monopoly for too long.