Don’t presume this study has anything to do with programming. They measured an agent’s ability to se...

quinncom • today at 5:00 PM • 0 replies • view on HN

Don’t presume this study has anything to do with programming. They measured an agent’s ability to search long conversations, not code.

> We evaluate on a 116-question representative subset of the LongMemEval benchmark (Wu et al., 2025), which tests an agent’s ability to answer questions over long conversations spanning multiple sessions.

alt Hacker News