Writing task-specific evals are pretty important, and lots of people are just going off of vibes right now. If this all seems too much all at once, and you don't know where to start, we wrote a jargon-free issue for getting started with system evals.
The basic idea for system evals is to find a way to define a qualitative trait you want in the LLM responses using a corpus of examples, rather than being able to define it exactly using prompts. Then through systematic improvements, you nudge your LLM-driven task to adhere closer and closer to the given examples, for some metric of closeness. That way, you can be more sure you're not regressing on LLM responses as you try to make improvements. This is standard stuff for data scientists, but this way of working can be a little foreign to web engineers (depending on prior experience). It just takes a little adjustment to get up to speed.