Cool, what’s a good first task to try this on where it’s likely to beat a single agent?

christinetyip • yesterday at 4:49 PM • 2 replies • view on HN

Replies

Math proofs are really easy to run with this specific harness. Our next experiments are going to be bigger, think full code base refactors. We're working on applying RLM to improve context window limits so we can keep more of the actual code in RAM,

Any workloads you want to see? The best are ones that have ways to measure the output being successful, thinking about recreating the C compiler example Anthropic did, but doing it for less than the $20k in tokens they used.

➕ show 2 replies

miligauss • yesterday at 5:06 PM

we tried putnam a2

alt Hacker News

Replies