But we are still reliant on the LLM correctly interpreting the choice to pick the right skill. So "known to work" should be understood in the very limited context of "this sub-function will do what it was designed to do reliably" rather than "if the user asks to use this sub-function it will do was it was designed to do reliably".
Skills feel like a non-feature to me. It feels more valuable to connect a user to the actual tool and let them familiarize themselves with it (and not need the LLM to find it in the future) rather than having the tool embedded in the LLM platform. I will carve out a very big exception of accessibility here - I love my home device being an egg timer - it's a wonderful egg timer (when it doesn't randomly play music) and I could buy an egg timer but having a hands-free egg timer is actually quite valuable to me while cooking. So I believe there is real value in making these features accessible through the LLM over media that the feature would normally be difficult to use in.
Choice to pick right tool -- there is a benchmark which tracks the accuracy of this.
"Known to work" -- if it has a hardcoded code, it will work 100% of the time - that's the point of Skills. If it's just markdown then yes, some sort of probability will be there and it will keep on improving.
This is no different to an MCP, where you rely on the model to use the metadata provided to pick the right tool, and understand how to use it.
Like with MCP, you can provide a deterministic, known-good piece of code to carry out the operation once the LLM decides to use it.
But a skill can evolve from pure Markdown via inlining some shell commands, up to a large application. And if you let it, with Skills the LLM can also inspect the tool, and modify it if it will help you.
All the Skills I use now have evolved bit by bit as I've run into new use-cases and told Claude Code to update the script the skills references or the SKILL.md itself. I can evolve the tooling while I'm using it.