We need a benchmark that tests a models ability to do LLM research.

alt Hacker News

djfergus • today at 1:16 AM • 0 replies • view on HN