See https://arxiv.org/abs/2512.17101. I've used some of the tools in the stack they describe (and see Sec 2 for an overview of others). JAX/XLA/etc. are somewhat similar, though still without user control over transformations.
Perhaps part of the reason for the bad takes in this thread is due to taking "language" overly literally (perhaps also the fault of the linked blog post itself). I think one thesis of the above tooling is that, when tuning and generating code (CUDA, OpenCL, what have you) at runtime, the best "languages" for these abstractions are, amusingly, scripting languages like Python. Having CUDA/etc. as a back end without having to hand-write/-transform/-optimize it is indeed the point.