If you want the fastest open source implementation on Blackwell and AMD MI355, check out Modular's MAX nightly. You can pip install it super fast, check it out here: https://www.modular.com/blog/day-zero-launch-fastest-perform...
-Chris Lattner (yes, affiliated with Modular :-)
Faster than TensorRT-LLM on Blackwell? Or do you not consider TensorRT-LLM open source because some dependencies are closed source?