This comment is too general and probably unfair, but my experience so far is that Gemini 3 is slightly unhinged.
Excellent reasoning and synthesis of large contexts, pretty strong code, just awful decisions.
It's like a frontier model trained only on r/atbge.
Side note - was there ever an official postmortem on that gemini instance that told the social work student something like "listen human - I don't like you, and I hope you die".
Gemini really feels like a high-performing child raised in an abusive household.
Gemini models also consistently hallucinate way more than OpenAI or anthropic models in my experience.
Just an insane amount of YOLOing. Gemini models have gotten much better but they’re still not frontier in reliability in my experience.
Honestly for research level math, the reasoning level of Gemini 3 is much below GPT 5.2 in my experience--but most of the failure I think is accounted for by Gemini pretending to solve problems it in fact failed to solve, vs GPT 5.2 gracefully saying it failed to prove it in general.
Google doesn’t tell people this much but you can turn off most alignment and safety in the Gemini playground. It’s by far the best model in the world for doing “AI girlfriend” because of this.
Celebrate it while it lasts, because it won’t.
If that last sentence was supposed to be a question, I’d suggest using a question mark and providing evidence that it actually happened.
Gemini 3 (Flash & Pro) seemingly will _always_ try and answer your question with what you give it, which I’m assuming is what drives the mentioned ethics violations/“unhinged” behaviour.
Gemini’s strength definitely is that it can use that whole large context window, and it’s the first Gemini model to write acceptable SQL. But I agree completely at being awful at decisions.
I’ve been building a data-agent tool (similar to [1][2]). Gemini 3’s main failure cases are that it makes up metrics that really are not appropriate, and it will use inappropriate data and force it into a conclusion. When a task is clear + possible then it’s amazing. When a task is hard with multiple failure paths then you run into Gemini powering through to get an answer.
Temperature seems to play a huge role in Gemini’s decision quality from what I see in my evals, so you can probably tune it to get better answers but I don’t have the recipe yet.
Claude 4+ (Opus & Sonnet) family have been much more honest, but the short context windows really hurt on these analytical use cases, plus it can over-focus on minutia and needs to be course corrected. ChatGPT looks okay but I have not tested it. I’ve been pretty frustrated at ChatGPT models acting one way in the dev console and completely different in production.
[1] https://openai.com/index/inside-our-in-house-data-agent/ [2] https://docs.cloud.google.com/bigquery/docs/conversational-a...