logoalt Hacker News

staredtoday at 6:46 AM0 repliesview on HN

When dealing with binaries, Gemini 3.1 Pro is in the same tier as Opus 4.6, https://quesma.com/benchmarks/binaryaudit/. Here are the results without humans in the loop, fully end-to-end.

For any practical development, you want humans in the loop, just precisely as much as it is needed (e.g. to ask the right questions, not to get steered away), but not more.