I think it's important to note that there's nothing forbidding LPU style determinism from being used in training. They just didn't make that choice.
Also tenstorrent could be a viable challenger in this space. It seems to me that their NoC and their chips could be mostly deterministic as long as you don't start adding in branches
Would SRAM make weight updates prohibitive vs DRAM?
You're right but my understanding is that Groq's LPU architecture makes it inference-only in practice.
Like Groq's chips only have 230MB of SRAM per chip vs 80GB on an H100, training is memory hungry as you need to hold model weights + gradients + optimizer states + intermediate activations.