For the agent working, we're focusing on the user outcome, we think that the raw usage, number of turns, function calls are useful operationally but think of those as more observability than the core evaluation target. We do show some of these stats in our conversation view but don't aggregate to compare agents. Longer term we will look to add in more of these features so we can compare quality vs cost, for example