It seems their tests rely on Claude alone. It’s not safe to assume that Codex or Gemini will behave ...

sheepscreek • yesterday at 10:29 PM • 1 reply • view on HN

It seems their tests rely on Claude alone. It’s not safe to assume that Codex or Gemini will behave the same way as Claude. I use all three and each has its own idiosyncrasies.

Replies

verdverm • yesterday at 11:36 PM

I've done very similar things with my custom agent that uses Gemini and have gotten very similar results. Working on the evals to back that claim up

alt Hacker News

Replies