Hardware isn't even close to being out of steam. There are some breathtakingly obvious premature optimizations that we can undo to get at least 99% power reduction for the same amount of compute.
For example, FPGAs use a lot of area and power routing signals across the chip. Those long lines have a large capacitance, and thus cause a large amount of dynamic power loss. So does moving parameters around to/from RAM instead of just loading up a vast array of LUTs with the values once.