logoalt Hacker News

creddittoday at 4:50 PM1 replyview on HN

Ran some of my internal benchmarks against this and I'm very unimpressed. I don't think this moves them into the OAI v Anthropic v Gemini conversation at all.

Major analytical errors in their response to multiple of my technical questions.


Replies

creddittoday at 5:02 PM

Playing with this some more and it's actively not good. Just basic mathematical errors riddling responses. Did some basic adversarial testing where its responses are analyzed by Gemini and Gemini is finding basic math errors across every relatively (relative to Opus, Gemini or GPT can handle) simple ask I make. Yikes.