How do you measure whether it works better day to day without benchmarks?
Manually labeling answers maybe? There exist a lot of infrastructure built around and as it's heavily used for 2 decades and it's relatively cheap.
That's still benchmarking of course, but not utilizing any of the well known / public ones.
Internal evals, Big AI certainly has good, proprietary training and eval data, it's one reason why their models are better
Subscriptions.
Manually labeling answers maybe? There exist a lot of infrastructure built around and as it's heavily used for 2 decades and it's relatively cheap.
That's still benchmarking of course, but not utilizing any of the well known / public ones.