Curious what kinds of evals you focus on? We're finding investigating to be same-but-differen...

lmeyerov • last Friday at 3:48 PM • 1 reply • view on HN

Curious what kinds of evals you focus on?

We're finding investigating to be same-but-different to coding. Probably the most close to ours that has a bigger evals community is AI SRE tasks.

Agreed wrt all these things being contextual. The LLM needs to decide whether to trigger tools like self-planning and todo lists, and as the talk gives examples of, which kind of strategies to use with them.

Replies

veselin • yesterday at 3:43 PM

I am taking for SWE bench style problems where Todo doesn't help, except for more parallelism.

➕ show 1 reply

alt Hacker News

Replies