This is very cool.
I have been lamenting for a while that the memory-bandwidth <-> tps relationship was pretty much working for small models on consumer cards, but not at all on datacenter hardware.
It's great to see that with proper care on the inference engine implementation the relationship can be restored.