How do you guys manage regressions as a whole with every new model update? A massive test set of e2e problem solving seeing how the models compare?
I use a self-documenting recursive workflow: https://github.com/doubleuuser/rlm-workflow
A mix of evals and vibes.
I use a self-documenting recursive workflow: https://github.com/doubleuuser/rlm-workflow