Please stop spreading this "AI evals" terminology. "evals" is what providers like OpenAI and Anthropic do with their models. If you wrote a test for a feature that uses an LLM, it's just a test, there's no need to say "evals." Having a separate term only further confuses people who already have no idea what that actually means.
I respectfully disagree. I think there needs to be a common term for the aspects around LLM testing and saying "It's just integration/system tests" doesn't really reach audiences well. They don't disambiguate the differences.
Words win when they're used. Just because Agent Skills is just a pattern for standarization and saving context doesn't mean it wasn't incredibly useful.
Think beyond software developers by trade. Think beyond people those who realized they needed tests instead of those who thought "the models will just get smarter" and "they told me there's guardrails".