I wonder if it would be possible to improve even further on the benchmark by simply showing Claude t...

BrunoDCDO • yesterday at 1:59 PM • 1 reply • view on HN

I wonder if it would be possible to improve even further on the benchmark by simply showing Claude the current hardest problems and asking it to improve the prompt without including any specifics related to the problems

Replies

blndrt • today at 7:29 AM

I think there's a chance we could squeeze a better benchmark score, although there's a risk of overfitting which I wanted to avoid.

The simplest test would be to make previously “unreachable” tasks succeed through obvious prompt tweaks — like reordering instructions or emphasizing key parts.

That said, my methodology intentionally avoided exposing the model to actual tasks. Instead, I focused on the domain as a whole: refining the instructions so a smaller model could understand and act reliably.

alt Hacker News

Replies