logoalt Hacker News

cultofmetatrontoday at 8:05 AM21 repliesview on HN

I seriously dont' know all this big hullabaloo about one shot prompting.

by definition, a single prompt wont' constitute the complexity of a software project. ergo, what you'll get is a series of assumptions made by the model based on preexisting code in its training corpus.

I'd rather see a coding agent that can follow steps in a plan file to a T while following guardrails and adhering to the proper coding conventions in the human reviewed spec.

Id rather see performance in agent loops against human defined objectives where it can be verified to stick to defined guardrails and continue without drift till its objectives are complete.

I'd also like to see it identify bugs and potential performance increases by identifying existing code and suggesting refactors based on context it can pickup about the particular use case you are trying to create.

These are way more valuable metrics than "hey build X"


Replies

ulrikrasmussentoday at 11:29 AM

I guess the experiment is interesting to determine if a model can produce something subjectively valued as "good" based on fairly vague and open-ended specifications. The benchmark is not to determine if the output fits the input, but whether the output is internally consistent: it's a game, but does it behave as one would expect that any game behaves? Does it end when you each the goal, do you die when hitting the spikes, are there weird edge cases in behavior when you move around?

I think however that they should have used the same harness and also repeated the experiment a few times to judge the variance in results.

rdsubhastoday at 11:29 AM

IMHO, It's not the oneshotting.

It's the "starting from empty slate" greenfield that's the real problem.

We used to make fun of Engineers who follow a README on a framework, test it on an empty project, and say "this framework is the best for our 10 year running production app". Greenfield mentality is always the solution to all problems and problem to all solutions.

One should still measure oneshotting, it's an important self-measurement metric - but against an established, large codebase.

show 2 replies
post-ittoday at 1:04 PM

The streetlight effect:

> A policeman sees a drunk man searching for something under a streetlight and asks what the drunk has lost. He says he lost his keys and they both look under the streetlight together. After a few minutes the policeman asks if he is sure he lost them here, and the drunk replies, no, and that he lost them in the park. The policeman asks why he is searching here, and the drunk replies, "this is where the light is"

All of your suggestions are better but they're hard, so someone casually evaluating an AI isn't going to do them.

show 4 replies
hnfongtoday at 12:54 PM

It's a proxy for what you actually want to measure.

Note that after the model generated a bunch of (intermediary) code, they still have to have it tested and get bugs fixed (via the agent/harness). In this "one shot" you still have agent loops against human defined objectives.

And these toy examples give some insight as to how the model performs. If the test were "here's some code written by $corp, please take these tickets and work on them" it may be a "real" example but nobody would be able to make sense of actually how "hard" it is, or how "well" the model did the job, besides the workers already familiar with the context.

At least everyone knows what a 3D game is.

show 1 reply
pu_petoday at 9:10 AM

It's true that no one is trying to one shot anything serious right now, but it's still an important metric. Claude Code and Opus really took off when they improved the harnessing enough that it would self-correct many of its mistakes without needing user input. In fact I think long-term autonomy (in the range of several hours) and self-correcting is going to be where we see most improvements in coming years.

show 2 replies
losvedirtoday at 11:24 AM

One shotting is useful to test but only with a huge prompt (eg, build something according to this spec).

I agree generating millions of tokens from a handful of input tokens doesn't convey anything meaningful to me.

jatoratoday at 1:14 PM

I also love the term zero-shot in the AI benchmark world. So logical. So intuitive.........

NichoPaoluccitoday at 10:31 AM

If a model can take a series of increasingly complex instructions and satisfy the requirements without human intervention, we can pretty easily decide how well overall the model does. And, judging better models just means adding more requirements to a task. So, I think it's a useful method (Even if it's not a realistic use case).

Of course, with a software engineer at the helm - the models are going to be able to be guided to produce much better output. (Or worse, depending on the engineer!)

show 2 replies
athrowaway3ztoday at 9:41 AM

I think you're underestimating the elegance of "hey build X". It already captures a lot of what you're interested in.

Additionally, with "Hey build X" nobody is happy with the methodology and people rightfully complain about the set up.

Using your suggestion the methodology would require a lot of presumptions & arguments regarding why you choose it and think it relevant to people.

Either people would not "get" it quickly enough or would disagree/not be interested on the setup because its not how they use LLMs.

ACCount37today at 10:05 AM

On one hand, that's sort of true for practical uses - and benchmarks notoriously undercount multi-turn settings.

On another, being able to reliably tackle minor tasks with no handholding is very valuable in itself. Sometimes implementation details are important, but often, the most important thing is to Get It Done.

scwoodaltoday at 10:04 AM

> I'd rather see a coding agent that can follow steps in a plan file to a T while following guardrails and adhering to the proper coding conventions in the human reviewed spec.

Guardrails/conventions should be enforced in linters, formatters, static analysis tooling; not specs/prompts.

show 2 replies
jaapztoday at 8:54 AM

When the model produces reasonable results from one prompt, you could assume that it will also return reasonable results through the follow up prompts.

Revanche1367today at 9:09 AM

The argument is flawed, there is no logical reason to assume a single prompt won’t be sufficient to constitute the complexity of a software project. It may not be practical in many cases but there is too much variability in what is considered a complex software project and in the sufficiency of instruction in a single prompt to make that claim and say it’s “by definition.”

show 2 replies
irthomasthomastoday at 9:23 AM

Blame anthropic, they decided to make these type of one-shot examples the primary focus of the Fable 5 release, and relegating benchmark scores to the pdf.

halyconWaystoday at 8:37 AM

"We did multi-shot prompting to try and get these two games into comparable states using these two different models."

"Well obviously you provided better follow-up prompts to the one that came out better."

Also nothing about human-provided plan files and guardrails preclude the one-shot benchmark test. Heavens, I almost said "real coding," but in "real agentic program creation" you'd obviously be doing multi-turn interaction with the agent, but how can you provide a fair test when the model's output n determines your n+1 response?

show 2 replies
miroljubtoday at 9:50 AM

That's precisely the difference between an engineer and a business guy.

The business guy would say "hey build me this and that" and would get _something_ to show of.

An engineer will have a long conversation with a llm about the exact requirements, tech stack, tradeoffs. He would understand what is built, how is it built, and refine on the fly until he gets something sensible.

It won't be as fast as "build this", but the result will be much better and more maintainable.

For the enginering workflow, you don't need Fable. Any model better or equivqlent to Sonnet 4.6 would do. Yes, sometimes it will hallucinate, sometimes it'll be wrong, but it's our job as engineers to correct it and have full ownership of the result.

show 1 reply
scotty79today at 11:16 AM

Single prompt performance is interesting because best agentic results of yesterday turned out to be best single prompt results of today.

If we stopped developing LLMs the the only reasonable way to benchmark them would be to compare yheir performance with all the tricks we can build on top of them. Sine the are still developing rapidly any apples to apples comparison is worthwhile.

Of course this particular benchmark is not really single prompt but rather "agentic without steering".

LoganDarktoday at 9:02 AM

The thing with one-shot prompting is that it tests the ability for the model to make good choices on its own, rather than only instruction following.

Instruction following has been down for years, and while there are of course metrics that continue to improve as the frontier advances (for example, the ability to continue following the original instructions even as context grows), you can't really get that much better at performing a list of instructions as-written if the instructions are sufficiently precise enough that there's no wiggle room for interpretation (which seems to be what you are describing).

For example, one of the things that got me the most excited for Fable 5 was its ability to work for over eight hours straight on a single instruction and seemingly faithfully the entire time. That was something I observed personally after trying out the same workflow that runs for maybe two or three hours with Opus and then still needs followups. Fable needed no followups. That's a game changer for me compared to the prior state of the art.

That kind of stuff is going to end up being the most beneficial to people who are touching the edges of their knowledge or even exploring completely new areas. And that type of work is exactly the kind of work that makes agentic coding so powerful, even as much as it gets harder to judge the quality of the work when you lack the skills yourself. It's a good thing that the quality increases across the board, even for skilled practitioners.

For example, even people who know how to write inference engines or how matmul kernels work or how to optimize model architecture can't always predict just the sheer breadth of things agents can try to improve performance, and sometimes you get over some wall and reach a completely different optimum that you just wouldn't have reached in any reasonable amount of time by applying traditional knowledge even if you're an expert in the field.

That kind of stuff is amazing. And that's exactly the kind of stuff that one-shot prompting is testing for. It's kind of like testing for the model's "innovation", as much of an oxymoron that is.

epolanskitoday at 8:12 AM

Yet this is how virtually everybody is benchmarking and fine tuning.

Since Opus 4.6 I've seen later Anthropic models being more and more capable on one hand, but also less useful on multi turn open tasks.

It feels like with each model they are more and more prone to go "their own way" and jump into the implementation as soon as they can.

I can't but blame it on benchmarks and fine tuning around prompt-to-solution work.

alfiedotwtftoday at 10:28 AM

I think that’s the point of the Superpowers SKILL