It sounds like those workloads are memory bandwidth bound. In my experience with generative models, the compute units end up waiting on VRAM throughput, so throwing more wattage at the cores hits diminishing returns very quickly.
If they were memory bandwidth bound wouldn't that in itself push the wattage and thermals down comparatively, even on a "pegged to 100%" workload? That's the very clear pattern on CPU at least.
I thought so but no, iterative small matrix multiplication kernel in tensor cores, or pure (generative) compute with ultra-late reduction and ultra-small working memory. nsight-compute says everything is in L1 or small register file, no spilling, and that I am compute bound, good ILP. Can't find a way to get more than 10% for the 300W difference. Thus asking if anyone did better and how and how reliable the HW stays.