Trust is the hardest part to scale here.
We're building something similar and found that no matter how good the agent loop is, you still need "canonical metrics" that are human-curated. Otherwise non-technical users (marketing, product managers) are playing a guessing game with high-stakes decisions, and they can't verify the SQL themselves.
Our approach: 1. We control the data pipeline and work with a discrete set of data sources where schemas are consistent across customers 2. We benchmark extensively so the agent uses a verified metric when one exists, falls back to raw SQL when it doesn't, and captures those gaps as "opportunities" for human review
Over time, most queries hit canonical metrics. The agent becomes less of a SQL generator and more of a smart router from user intent -> verified metric.
The "Moving fast without breaking trust" section resonates, their eval system with golden SQL is essentially the same insight: you need ground truth to catch drift.
Wrote about the tradeoffs here: https://www.graphed.com/blog/update-2
Yes, I’ve been working on this and you need a clear semantic layer.
If there are multiple paths or perceived paths to an answer, you’ll get two answers. Plus, LLMs like to create pointless “xyz_index” metrics that are not standard, clear, or useful. Yet i see users just go “that sounds right” and run with it.