Naively tested a set of agents on this task. Each ran the same spec headlessly in their native har...

languid-photic • today at 6:50 AM • 3 replies • view on HN

Naively tested a set of agents on this task.

Each ran the same spec headlessly in their native harness (one shot).

Results:

    Agent                        Cycles     Time
    ─────────────────────────────────────────────
    gpt-5-2                      2,124      16m
    claude-opus-4-5-20251101     4,973      1h 2m
    gpt-5-1-codex-max-xhigh      5,402      34m
    gpt-5-codex                  5,486      7m
    gpt-5-1-codex                12,453     8m
    gpt-5-2-codex                12,905     6m
    gpt-5-1-codex-mini           17,480     7m
    claude-sonnet-4-5-20250929   21,054     10m
    claude-haiku-4-5-20251001    147,734    9m
    gemini-3-pro-preview         147,734    3m
    gpt-5-2-codex-xhigh          147,734    25m
    gpt-5-2-xhigh                147,734    34m

Clearly none beat Anthropic's target, but gpt-5-2 did slightly better in much less time than "Claude Opus 4 after many hours in the test-time compute harness".

Replies

ponyous • today at 7:30 AM

Very interesting thanks! I wonder what would happen if you kept running Gemini in a loop for a while. Considering how much faster it ended it seems like there is a lot more potential.

forgotpwd16 • today at 7:34 AM

Could you make a repo with solutions given by each model inside a dir/branch for comparison?

➕ show 1 reply

giancarlostoro • today at 7:32 AM

I do wonder how Grok would compare, specifically their Claude Code Fast model.

alt Hacker News

Replies