It’s just hard to make them not part of the training data. We see this a bit with BrowseComp plus an...

softwaredoug • yesterday at 6:23 PM • 2 replies • view on HN

It’s just hard to make them not part of the training data. We see this a bit with BrowseComp plus and other deep research datasets. Not because frontier labs are trying to cheat, but just from training on the full web.

You need new datasets perpetually.

Replies

cpard • yesterday at 6:55 PM

That’s true. it also depends heavily on the type of task, not everything is equally represented on the web today and it remains to be seen if this is going to change or not.

stavros • yesterday at 6:36 PM

Or hidden benchmarks, though it's then harder to get people to trust the results.

➕ show 1 reply

alt Hacker News

Replies