> So we ran it head-to-head against Claude Opus 4.8: same one-shot prompt, build a 3D platformer ...

meander_water • today at 7:42 AM • 4 replies • view on HN

> So we ran it head-to-head against Claude Opus 4.8: same one-shot prompt, build a 3D platformer in raw WebGL from scratch

Running a single one-shot prompt is not a benchmark, not is it representative of any sort of real-world usage.

Most agent usage is collaborative so you need to test things like reliability (when I delegate a task, does it complete it without making up test results for e.g.) and steerability (does it obey my instructions or does it just do what it thinks is best).

Replies

jameswhitford • today at 7:46 AM

Hi, I am the author, I completely agree! I set out to run a vibe test on this one, not a benchmark, the real benchmarks are listed. My test shows what the models can do when both tasked with a long-running, technically difficult, one-shot task.

I think your test you describe (collaborative, task delegation, task completion, TTD, steerability) is a great format for a future test that I will definitely try out.

➕ show 2 replies

esperent • today at 7:58 AM

On the other hand, I did just leave my pi agent running GPT 5.5 overnight on a clearly defined, long running task. It's been running about 10 hours now and it's mostly done. So this kind of use case is also valid.

Thinking about it, I would say that the majority of agentic work I do, by a long shot, is subagents which are launched from the main session, using a prompt of its choosing. Those could be considered short versions of these fully autonomous tasks.

➕ show 2 replies

ritzaco • today at 7:44 AM

sure that's why we look at a mix of formal benchmarks, one longer analysis of a side-by-side, and various other people who we trust to form an opinion, all covered in the article - not intended to be a formal benchmark, there are enough of those.

➕ show 1 reply

unliftedq • today at 8:34 AM

Totally agree, a single one-shot prompt can't prove anything.

alt Hacker News

Replies