logoalt Hacker News

827ayesterday at 8:36 PM12 repliesview on HN

1000% agree. I am increasingly hesitant to believe Anthropic's continual war drum of "build for the capabilities of future models, they'll get better".

We've got a QA agent that needs to run through, say, 200 markdown files of requirements in a browser session. Its a cool system that has really helped improve our team's efficiency. For the longest time we tried everything to get a prompt like the following working: "Look in this directory at the requirements files. For each requirement file, create a todo list item to determine if the application meets the requirements outlined in that file". In other words: Letting the model manage the high level control flow.

This started breaking down after ~30 files. Sometimes it would miss a file. Sometimes it would triple-test a bundle of files and take 10 minutes instead of 3. An error in one file would convince it it needs to re-test four previous files, for no reason. It was very frustrating. We quickly discovered during testing that there was no consistency to its (Opus 4.6 and GPT 5.4 IIRC) ability to actually orchestrate the workflow. Sometimes it would work, sometimes it wouldn't. I've also tested it once or twice against Opus 4.7 and GPT 5.5; not as extensively; but seems to have the same problems.

We ended up creating a super basic deterministic harness around the model. For each test case, trigger the model to test that test case, store results in an array, write results to file. This has made the system a billion times more reliable. But, its also made the agent impossible to run on any managed agent platform (Cursor Cloud Agents, Anthropic, etc) because they're all so gigapilled on "the agent has to run everything" that they can't see how valuable these systems can be if you just add a wee bit of determinism to them at the right place.


Replies

DrewADesignyesterday at 9:00 PM

I used to assume they pushed people into the prompt-only workflows because you’re paying them for the tokens, and not paying them for the scaffolding you built. However, I think that they’re really worried about is that a person needs to design and implement that stuff… It throws a wet blanket on their insistence that this will replace entire people in entire workflows or even projects, and I just don’t buy it. I do think it’s going to increase productivity enough to disastrously affect developer job market/pay scale, but I just don’t think this particular version of this particular technology is going to actually do what they say it will. If they said they were spending this much money bootstrapping a super useful thingy that can reduce a big chunk of the busy work of a human dev team— what most developers really want, and most executives really don’t— a bunch of investors would make them walk the plank.

I also think having granular, tightly controlled steps is much friendlier to implementing smaller, cheaper, more specialized models rather than using some ginormous behemoth of a model that can automate your tests, or crank out 5 novels of CSI fan fic in a snap.

show 3 replies
bob1029today at 12:03 AM

I saw a major uplift in performance after I combined tools like apply_patch with check_compilation & run_unit_tests. I still call the tool "apply_patch", but it now returns additional information about the build & tests if the patch succeeds. The agent went from ~80% success rate to what seems to be deterministic (so far). I don't bother to describe the compilation and unit testing processes in my prompts anymore. All I need to do is return the results of these things after something triggers them to run as a dependency.

I feel like I'm falling out of whatever is popular these days. I've been using prepaid tokens and custom harnesses for a long time now. It just seems to work. I can ignore most of the news. Copilot & friends are currently dead to me for the problems I've expressly targeted. For some codebases it's not even in the same room of performance anymore, despite using the exact same GPT5.4 base model.

julianlamyesterday at 11:23 PM

> This started breaking down after ~30 files. Sometimes it would miss a file. Sometimes it would triple-test a bundle of files and take 10 minutes instead of 3. An error in one file would convince it it needs to re-test four previous files, for no reason. It was very frustrating.

Sorry, you thought a prompt was a suitable replacement for a testing suite?

show 1 reply
woeiruayesterday at 9:21 PM

I have but one upvote, but yes. The only way to make these systems work reliably is to break the problems down into smaller chunks. Any internal consistency checks are just going to show you that LLMs are way less consistent than you’d expect.

rdedevyesterday at 9:51 PM

I had to create a hypothesis testing agent where it gets a query like "is manufacturing parameter x significantly different this month than last month" and have the agent follow a flowchart to run a statistical test and return the answer

At the time I had access to only 4o and there was no way to guarantee that the agent would follow the flowchart if I just mention it in its prompt. What I ended up wrapping the agent in a loop that kept feeding it the next step in the flowchart. In a way, a custom harness for the agent

mmis1000yesterday at 9:02 PM

> This started breaking down after ~30 files.

Codex's short context and todolist system combined somehow helps here though. Because of the frequent compact. The model was forced to recheck what todo list item has not done yet and what workflow skill it has to use. I used to left it for multi hour to do a big clean up and it finished without obvious issues.

show 1 reply
awonghyesterday at 10:23 PM

The other part of the question is exactly when the "build for the capabilities of future models" becomes the present.

Looking at the Mythos benchmarks, it doesn't seem like the models are that close to being truly reliable for agentic tasks.

Is it a year away, or five? That's a big difference in deciding what to build today.

sharperguyyesterday at 9:52 PM

So I wonder, if a more powerful agent harness could have the agent basically write and exectute its own deteministic code, which when executed, spawns sub agents for each of the subtasks?

So far we've seen agents spawn subagents directly, but that still means leaving the final flow control to the non-deterministic orchestrator model, and so your case is a perfect example of where it would probably fail.

show 2 replies
crsnyesterday at 10:14 PM

Our team at Agentforce recently open-sourced our solution to this and we've gotten very valuable feedback -- would love to hear from more of you about it: https://github.com/salesforce/agentscript

show 1 reply
srousseyyesterday at 8:40 PM

I’m working on a hybrid system of old school task graph and ai agents and let them instantiate each other. I think others will do that eventually.

show 2 replies
Joeriyesterday at 9:10 PM

You could have a skill that is the combination of a minimal markdown file and a set of orchestration scripts that do the deterministic work. The agent does not have to “run everything”, it just needs to know how to launch the right script.

pishpashyesterday at 9:08 PM

Can you not have it write your harness for you, or have it be the first step? You can push your own determinism where you need, surely.

show 1 reply