It's too hard to define what "works" even means in this case. Look at the example savings output. A lot of it is kubectl output.
Your suggestion to using coding benchmarks doesn't really capture the whole picture. I haven't seen a benchmark using kubectl.
> AFAIK, none of the major players do. That's a sign to me these don't work in general.
It's a lose/lose for major players. If it works well, it will lower their revenue. Also there's a high risk it'll significantly worsen results for some people, even if it improves results for others.