2x-4x improvements are normal when starting from a naive kernel, but sometimes we see gains well over 10x. Every kernel is profiled live on real GPUs (serverless), so you get accurate performance data for the specific architecture.
Before-and-after examples would definitely help, and we’re adding those soon. Thanks for the feedback.