Not trying to take it literally, but aren't there costs vs performance tradeoffs? Like the py.toHDL would have like (maxSize,maxCost,minThroughput) as free and that would determine energy usage?
And a GPU is already pretty optimized for inference, no? Like isn't it a bunch of FP mults? I don't think HDLs do well with that, either.