logoalt Hacker News

Gemini 3 Pro: the frontier of vision AI

259 pointsby xnxtoday at 4:15 PM117 commentsview on HN

Comments

Workaccount2today at 8:26 PM

Well

It is the first model to get partial-credit on an LLM image test I have. Which is counting the legs of a dog. Specifically, a dog with 5 legs. This is a wild test, because LLMs get really pushy and insistent that the dog only has 4 legs.

In fact GPT5 wrote an edge detection script to see where "golden dog feet" met "bright green grass" to prove to me that there were only 4 legs. The script found 5, and GPT-5 then said it was a bug, and adjusted the script sensitivity so it only located 4, lol.

Anyway, Gemini 3, while still being unable to count the legs first try, did identify "male anatomy" (it's own words) also visible in the picture. The 5th leg was approximately where you could expect a well endowed dog to have a "5th leg".

That aside though, I still wouldn't call it particularly impressive.

As a note, Meta's image slicer correctly highlighted all 5 legs without a hitch. Maybe not quite a transformer, but interesting that it could properly interpret "dog leg" and ID them. Also the dog with many legs (I have a few of them) all had there extra legs added by nano-banana.

show 17 replies
knollimartoday at 7:58 PM

I do some electrical drafting work for construction and throw basic tasks at LLMs.

I gave it a shitty harness and it almost 1 shotted laying out outlets in a room based on a shitty pdf. I think if I gave it better control it could do a huge portion of my coworkers jobs very soon

show 2 replies
fngjdflmdflgtoday at 7:10 PM

These OCR improvements will almost certainly be brought to google books, which is great. Long term it can enable compressing all non-digital rare books into a manageable size that can be stored for less than $5,000.[0] It would also be great for archive.org to move to this from Tesseract. I wonder what the cost would be, both in raw cost to run, and via a paid API, to do that.

[0] https://annas-archive.org/blog/critical-window.html

show 2 replies
djoldmantoday at 7:18 PM

Interesting "ScreenSpot Pro" results:

    72.7% Gemini 3 Pro
    11.4% Gemini 2.5 Pro
    49.9% Claude Opus 4.5
    3.50% GPT-5.1
ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use

https://arxiv.org/abs/2504.07981

show 3 replies
aziis98today at 10:01 PM

> Pointing capability: Gemini 3 has the ability to point at specific locations in images by outputting pixel-precise coordinates. Sequences of 2D points can be strung together to perform complex tasks, such as estimating human poses or reflecting trajectories over time

Does somebody know how to correctly prompt the model for these tasks or even better provide some docs? The pictures with the pretty markers are appreciated but that section is a bit vague and without references

show 1 reply
simonwtoday at 6:45 PM

In case the article author sees this, the "HTML transcription" link is broken - it goes to https://aistudio-preprod.corp.google.com/prompts/1GUEWbLIlpX... which is a Google-employee-only URL.

show 4 replies
TheAceOfHeartstoday at 8:38 PM

Since I think it's interesting to highlight the jagged intelligence, I have a simple word search puzzle [0] that Nano Banana Pro stills struggles to solve correctly. Gemini 3 Pro with Code Execution is able to one-shot the problem and find the positions of each word (this is super impressive! one year ago it wasn't possible), but Nano Banana Pro fails to highlight the words correctly.

Here's the output from two tests I ran:

1. Asking Nano Banana Pro to solve the word search puzzle directly [1].

2. Asking Nano Banana Pro to highlight each word on the grid, with the position of every word included as part of the prompt [2].

The fact that it gets 2 words correct demonstrates meaningful progress, and it seems like we're really close to having a model that can one-shot this problem soon.

There's actually a bit of nuance required to solve this puzzle correctly which an older Gemini model struggled to do without additional nudging. You have to convert the grid or word list to use matching casing (the grid uses uppercase, the word list uses lowercase), and you need to recognize that "soup mix" needs to have the space removed when doing the search.

[0] https://imgur.com/ekwfHrN

[1] https://imgur.com/1nybezU

[2] https://imgur.com/18mK5i5

hoddertoday at 7:51 PM

"Gemini 3 Pro represents a generational leap from simple recognition to true visual and spatial reasoning."

Prompt: "wine glass full to the brim"

Image generated: 2/3 full wine glass.

True visual and spatial reasoning denied.

show 2 replies
devinpratertoday at 7:59 PM

Audio described Youtube please? That'd be so amazing! Even if I couldn't play Zelda yet, I could listen to a playthrough with Gemini describing it.

show 2 replies
edtoday at 7:48 PM

What’s new here? I believe this is just gemini 3 which was released last month (the model id hasn’t changed AFAICT)

show 1 reply
caseyftoday at 8:31 PM

I'm playing with this and wondering if this is an actually good way to identify dominant colors and other features of a garment/product when using a photo where the item is styled and not isolated from the model or other garments

bovermyertoday at 9:33 PM

I would be interested in seeing what G3P makes of the Dead Sea Scrolls or similarly old documents.

siva7today at 7:29 PM

Interesting. When i asked Gemini 3 Pro to generate a Infographic from my personal accounting sheet, it first failed to generate anything except a black background, then it generated something where it mixed different languages in a non-sensical way, with obvious typos and irrelevant information grouping. It's certainly a leap forward in OCR, rendering classic OCR useless.

show 1 reply
ichiktoday at 9:40 PM

Frankly, it's insane how laughably bad under scrutiny their own examples are. It both distorted the data and made the chart less readable (labels placement, segments separation, missing labels, worse contrast). And it combined them into one, so you you'll have harder time comparing them compared to the original image! Isn't it amazing that it added a toggle? Post author seems to think it deserves an exclamation point even.

k8sToGotoday at 9:06 PM

When will we get Gemini 3 Flash?

pseudosavanttoday at 8:01 PM

I'm really fascinate by the opportunities to analyze videos. The amount of tokens it compresses down to, and what you can reason across those tokens, is incredible.

show 1 reply
jonplacketttoday at 7:22 PM

Google really are a fully woken sleeping giant. More code reds being issued today I expect.

causaltoday at 7:03 PM

Okay maybe this one isn't an exaggeration when they say leap forward

drivebyhootingtoday at 8:57 PM

Screen understanding is huge for further automating dev work.

iamjackgtoday at 7:25 PM

Curious how this will fare when playing Pokemon Red.

show 3 replies
stego-techtoday at 7:35 PM

The document is paints a super impressive picture, but the core constraint of “network connection to Google required so we can harvest your data” is still a big showstopper for me (and all cloud-based AI tooling, really).

I’d be curious to see how well something like this can be distilled down for isolated acceleration on SBCs or consumer kit, because that’s where the billions to be made reside (factories, remote sites, dangerous or sensitive facilities, etc).

show 2 replies
ch2026today at 6:58 PM

what framework is being utilized for computer use here?

empressplaytoday at 7:36 PM

Yes, but can it play PacMan yet?

dmarziotoday at 8:06 PM

So we’re going to use this to make the maid from the Jetsons finally. Right?

agentifyshtoday at 7:23 PM

im realizing how much of a bottleneck vision models are

im just a glorified speedreadin' promptin' QA at this point with codex

once it replaces the QA layer its truly over for software dev jobs

future would be a software genie where on aistudio you type: "go make counterstrike 1.6 clone, here is $500, you have two hours"

edit: saw the Screenspot benchmark and holy ** this is an insane jump!!! 11% to 71% even beating Opus 4.5's 50%...chatgpt is at 3.5% and it matches my experience with codex

show 2 replies