Really appreciate the thoughtful feedback!
1. Agree it's important, fwiw the proxy model doesn't blow this up though - only incurs a 1 time cost when switching models and we're aware of that when making routing decisions
2. The agents are model aware yes but they are not incentivized to optimize too heavily here (in particular they don't use OS models even when they would be better). I think that's where this router comes in and brings genuine improvement.
3. Two parts here: 1 is continuing to grow our golden dataset over time, 2 is using reward signals from production traffic (on a per-customer basis or, if allowed, across all users)
4. Yes we have these internally, great callout that we should publish! Will do + will link from the repo soon. (Fwiw I think these benchmarks are useful but don't fully capture vibes - you should try it out yourself for that!)