Show HN: OSS Agent I built topped the TerminalBench on Gemini-3-flash-preview

90 points • by GodelNumbering • today at 12:35 PM • 28 comments • view on HN

Scored 65.2% vs google's official 47.8%, and the existing top closed source model Junie CLI's 64.3%.

Since there are a lot of reports of deliberate cheating on TerminalBench 2.0 lately (https://debugml.github.io/cheating-agents/), I would like to also clarify a few things

1. Absolutely no {agents/skills}.md files were inserted at any point. No cheating mechanisms whatsoever

2. The cli agent was run in leaderboard compliant way (no modification of resources or timeouts)

3. The full terminal bench run was done using the fully open source version of the agent, no difference between what is on github and what was run.

I was originally going to wait for it to land on the leaderboard, but it has been 8 days and the maintainers do not respond unfortunately (there is a large backlog of the pull requests on their HF) so I decided to post anyways.

HF PR: https://huggingface.co/datasets/harborframework/terminal-ben...

It is astounding how much the harness matters, based on this and other experiments I have done.

Comments

mdasen • today at 2:31 PM

It's really interesting how much the AI harness seems to matter. Going from 48% via Google's official results to 65% is a huge jump. I feel like I'm constantly seeing results that compare models and rarely seeing results that compare harnesses.

Is there a leaderboard out there comparing harness results using the same models?

➕ show 2 replies

GodelNumbering • today at 1:03 PM

Interesting things Dirac does:

1. Uses an optimized version of Hash-Anchored edits for file editing (https://dirac.run/posts/hash-anchors-myers-diff-single-token)

2. Utilizes language's AST to decide what to fetch into context, entirely avoids large code file reads

3. Batches all operations. Does large number of reads/edits simultaneously (you can see a video demo for deepseek-v4-flash here https://www.reddit.com/r/LocalLLaMA/comments/1suhdki/tested_...)

4. Allows the model to execute code to analyze things on the fly, so the model can simply write bash/python/perl script to accomplish things where appropriate

5. A lot of context curation and opportunistic context updates, i.e. put into context anything that you are certain model would ask next

➕ show 7 replies

adyavanapalli • today at 2:03 PM

I haven't tried it, but I'm curious why you decided to implement a whole new harness over just writing extensions in pi. From whatever I've done with pi so far, the extension api is quite extensive. Hash anchored edits, for example, can definitely be implemented in pi. Anyhow, thank you for showing us your project and will be checking it out later. Cheers!

➕ show 1 reply

bryanhogan • today at 1:32 PM

If I understand correctly, this is a heavily improved Cline fork? Does that mean features such as plan and act mode are also still there?

➕ show 1 reply

Mashimo • today at 1:18 PM

Interesting. Would love a comparison to pi.dev (Not Ohmypi)

How does this perform in day to day coding tasks, outside of benchmarks?

➕ show 1 reply

martinald • today at 1:09 PM

Very interesting! I've often thought static analysis could really help agents (I wrote this last summer: https://martinalderson.com/posts/claude-code-static-analysis...), but despite being hyped for LSPs in Claude Code it turned out to be very underwhelming (for many of the reasons that they can be annoying in a "real" IDE, ie static analysis starts firing mid edit and complaining and cached analysis getting stuck).

Curious to know if this has been an issue with your AST approach on larger projects?

The hash line based numbering is very interesting too (though I see on Opus 4.5+ far far fewer editing errors).

I've often thought that even if model progress stopped today, we'd still have _years_ of improvements thru harness iteration.

➕ show 1 reply

blueTiger33 • today at 1:50 PM

Stared it. will try it later. one question though, to make it simpler for me, in what tasks does this model shine, how do you improve the score? I already use some skills to cut down CC costs, like caveman, rtk cli and a few others. just want to understand

➕ show 1 reply

redrove • today at 2:00 PM

I keep trying to use dirac-cli with codex and it won't work: Error: Codex API error: Codex API request failed: 400.

Any ideas?

➕ show 1 reply

nthypes • today at 1:08 PM

Can't OpenCode reach the same just developing this as a feature or plug-in? Like anchored edit?

➕ show 1 reply

neonstatic • today at 2:22 PM

I am a bit confused. What languages does it help with? You mention AST manipulation, so I am assuming it's not universally applicable, e.g. to Rust?

➕ show 1 reply

snqb • today at 1:50 PM

how well does it do on frontier models like Opus 4.6?

➕ show 1 reply

aetherspawn • today at 1:00 PM

Sorry I couldn’t really figure out if this was a harness, a fine tuned model, or both. Can we use Qwen with this for example? Is the performance expected to be better in that case?

➕ show 1 reply

nthypes • today at 1:07 PM

No CLI? Only VSCode extension?

➕ show 1 reply

tommy29tmar • today at 2:29 PM

[dead]

dk970 • today at 1:31 PM

[dead]

phoebe_builds • today at 1:02 PM

[dead]

alt Hacker News

Show HN: OSS Agent I built topped the TerminalBench on Gemini-3-flash-preview

Comments