Effective harnesses for long-running agents

33 points • by diwank • today at 7:05 PM • 6 comments • view on HN

Comments

One of the things that makes it very difficult to have reasonable conversations about what you can do with LLMs is the effort-to-outcome curve is basically exponential - with almost no effort, you can get 70% of the way there. This looks amazing, and so people (mostly executives) look at this and think, “this changes everything!”

The problem is the remaining 30% - the next 10-20% starts to require things like multi-agent judge setups, external memory, context management, and that gets you to something that’s probably working but you sure shouldn’t ship to production. As to the last 10% - I’ve seen agentic workflows with hundreds of different agents, multiple models, and fantastically complex evaluation frameworks to try to reduce the error rates past the ~10% mark. By a certain point, the amount of infrastructure and LLM calls are running into several hundred dollars per run, and you’re still not getting guaranteed reliable output.

If you know what you’re doing and you know where to fit the LLMs (they’re genuinely the best system we’ve ever devised for interpreting and categorizing unstructured human input), they can be immensely useful, but they sing a siren song of simplicity that will lure you to your doom if you believe it.

➕ show 2 replies

_boffin_ • today at 9:03 PM

…it really feels like they’re attempting to reinvent a project tracker and starting off from scratch in thinking about it.

It feels like they’re a few versions behind what I’m doing, which is… odd.

Self-hosting a plane.io instance. Added a plane MCP tool to my codex. Added workflow instructions into Agents.md which cover standards, documentation, related work, labels, branch names, adding of comments before plan, after plan, at varying steps of implementation, summary before moving ticket to done. Creating new tickers and being able to relate to current or others, etc…

It ain’t that hard. Just do inception (high to mid level details) create epics and tasks. Add personas, details, notes, acceptance criteria and more. Can add comments yourself to update. Whatever.

Slice tickets thin and then go wild. Add tickets as your working though things. Make modifications.

Why so difficult?

CurleighBraces • today at 8:56 PM

I wonder how good these agents would be using something like cucumber and behaviour driven development tools?

dangoodmanUT • today at 8:06 PM

> … the model is less likely to inappropriately change or overwrite JSON files compared to Markdown files.

Very interesting.

slurrpurr • today at 8:38 PM

BDSM for LLMs

alt Hacker News

Effective harnesses for long-running agents

Comments