Hey! Co-author here. The benchmark currently only measures retrieval accuracy. We’re interested in...

stephantul • yesterday at 5:32 PM • 1 reply • view on HN

Hey! Co-author here. The benchmark currently only measures retrieval accuracy.

We’re interested in measuring it end to end and also optimizing, e.g. the prompt and tools, for this, but we just haven’t gotten around to it.

esafranchik • yesterday at 5:49 PM

Two follow-ups:

1) How do you compare accuracy? by checking if the answer is in any of the returned grep/bm25/semble snippets?

2) How do you measure token use without the agent, prompt, and tools?

➕ show 1 reply

alt Hacker News