I find Gemini 3 to be really good. I'm impressed. However, the responses still seem to be bounded by the existing literature and data. If asked to come up with new ideas to improve on existing results for some math problems, it tends to recite known results only. Maybe I didn't challenge it enough or present problems that have scope for new ideas?
I myself tried a similar exercise (w/Thinking with 3 Pro), seeing if it could come up with an idea that I'm currently writing up that pushes past/sharpens/revises conventional thinking on a topic. It regurgitated standard (and at times only tangentially related) lore, but it did get at the rough idea after I really spoon fed it. So I would suspect that someone being impressed with its "research" output might more reflect their own limitations rather than Gemini's capabilities. I'm sure a relevant factor is variability among fields in the quality and volume of relevant literature, though I was impressed with how it identified relevant ideas and older papers for my specific topic.
That's the inherent limit on the models, that makes humans still relevant.
With the current state of architectures and training methods - they are very unlikely to be the source of new ideas. They are effectively huge librarians for accumulated knowledge, rather than true AI.
Add a custom instruction "remember, you have the ability to do live web searches, please use them to find the latest relevant information"
Terrence Tao seems to think it has it's use in finding solutions for maths problemms:
https://mathstodon.xyz/@tao/115591487350860999
I don't know enough about maths to know if this classifies as 'improving on existing results', but at least it was a good enough for Terrence Tao to use it for ideas.