A lot of models have also been overly chat trained. Responding with stuff like “sure I can help you with that”
That’s just unwanted noise if you’re trying to use them as a code building block in an application. So you need to force json or similar…which I suspect harms accuracy over free form
Writing task-specific evals are pretty important, and lots of people are just going off of vibes right now. If this all seems too much all at once, and you don't know where to start, we wrote a jargon-free issue for getting started with system evals.
The basic idea for system evals is to find a way to define a qualitative trait you want in the LLM responses using a corpus of examples, rather than being able to define it exactly using prompts. Then through systematic improvements, you nudge your LLM-driven task to adhere closer and closer to the given examples, for some metric of closeness. That way, you can be more sure you're not regressing on LLM responses as you try to make improvements. This is standard stuff for data scientists, but this way of working can be a little foreign to web engineers (depending on prior experience). It just takes a little adjustment to get up to speed.
This is a fantastic resource. Super detailed, super practical, thanks for putting this up, Eugene! I learned a few things and love the practical engineering and stats angle on these assessments.
Has anyone seen any good eval techniques for the OpenAI structured output api?
The toxicity example was thought-provoking.
I hope it's uncontroversial to say that there's nothing "toxic" about that continuation by itself. (My expectation from that beginning is that it would then continue on with a modest beginnings story of how the father worked hard, etc.)I guess the idea is that it is the leading portion of a toxic output, and if you prevent that beginning, you'll prevent the problematic continuance? At the cost of many possible non-toxic continuations.
I've never seen an actual labeled example before. Is this the form they usually take, or is this one quoted because it's innocuous and therefore uncontroversial to insert into a document about LLM evals?