So, you would perhaps ask AI to write a set of unit-tests, and then to create the implementation, then ask the AI to evaluate that implementation against the unit-tests it wrote. Right? But then again the unit-tests now, might be completetly different from the previous unit-tests? Right?
Or would it help if a different LLM wrote the unit-tests than the one writing the implementation? Or, should the unit-tests perhaps be in an .md file?
I also have a question about using .md files with AI: Why .md, why not .txt?
Not quite unit tests. Evals should be created by humans, as they are measuring quality of the solution.
Let's take the example of the GitHub pr slack bot from the blog post. I would expect 2-3 evals out of that.
Starting at the core, the first eval could be that, given a list of slack messages, it correctly identifies the PRs and calls the correct tool to look up the status of said PR. None of this has to be real and the tool doesn't have to be called, but we can write a test, much like a unit test, that confirms that the AI is responding correctly in that instance.
Next, we can setup another scenario for the AI using effectively mocked history that shows what happens when the AI finds slack messages with open PRs, slack messages with merged PRs and no PR links and determine again, does the AI try to add the correct reaction given our expectations.
These are both deterministic or code-based evals that you could use to iterate on your solutions.
The use for an LLM-as-a-Judge eval is more nuanced and usually there to measure subjective results. Things like: did the LLM make assumptions not present in the context window (hallucinate) or did it respond with something completely out of context? These should be simple yes or no questions that would be easy for a human but hard to code up a deterministic test case.
Once you have your evals defined, you can begin running these with some regularity and you're to a point where you can iterate on your prompts with a higher level of confidence than vibes
Edit: I did want to share that if you can make something deterministic, you probably should. The slack PR example is something that id just make a simple script that runs on a cron schedule, but it was easy to pull on as an example.