I'm having an hard time getting my mind to see this.
> Users should re-tune their prompts and harnesses accordingly.
I read this in the press release and my mind thought it meant test harness. Then there was a blog post about long running harnesses with a section about testing which lead me to a little more confusion.
Yes, the word 'harness' is consistently used in the context as a wrapper around the LLM model not as 'test harness'.
Some people also call evaluations "tests". There are unexpected things that come along with new models, like the model in a workflow you'd set up suddenly starts calling a tool and never stops or decides to no longer call a particular tool, so running your existing evaluations to catch regressions like this and potentially updating the prompts is considered "testing" your prompts and harnesses.
I understood this concept with this simple equation: Agent = LLM + harness
This field is chock full of people using terms incorrectly, defining new words for things that already had well known names, overloading terms already in use. E.g. shard vs partition. TUI which already meant "telephony user interface ". "Client" to mean "server" in blockchain.