I do some electrical drafting work for construction and throw basic tasks at LLMs.
I gave it a shitty harness and it almost 1 shotted laying out outlets in a room based on a shitty pdf. I think if I gave it better control it could do a huge portion of my coworkers jobs very soon
These OCR improvements will almost certainly be brought to google books, which is great. Long term it can enable compressing all non-digital rare books into a manageable size that can be stored for less than $5,000.[0] It would also be great for archive.org to move to this from Tesseract. I wonder what the cost would be, both in raw cost to run, and via a paid API, to do that.
Interesting "ScreenSpot Pro" results:
72.7% Gemini 3 Pro
11.4% Gemini 2.5 Pro
49.9% Claude Opus 4.5
3.50% GPT-5.1
ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use> Pointing capability: Gemini 3 has the ability to point at specific locations in images by outputting pixel-precise coordinates. Sequences of 2D points can be strung together to perform complex tasks, such as estimating human poses or reflecting trajectories over time
Does somebody know how to correctly prompt the model for these tasks or even better provide some docs? The pictures with the pretty markers are appreciated but that section is a bit vague and without references
In case the article author sees this, the "HTML transcription" link is broken - it goes to https://aistudio-preprod.corp.google.com/prompts/1GUEWbLIlpX... which is a Google-employee-only URL.
Since I think it's interesting to highlight the jagged intelligence, I have a simple word search puzzle [0] that Nano Banana Pro stills struggles to solve correctly. Gemini 3 Pro with Code Execution is able to one-shot the problem and find the positions of each word (this is super impressive! one year ago it wasn't possible), but Nano Banana Pro fails to highlight the words correctly.
Here's the output from two tests I ran:
1. Asking Nano Banana Pro to solve the word search puzzle directly [1].
2. Asking Nano Banana Pro to highlight each word on the grid, with the position of every word included as part of the prompt [2].
The fact that it gets 2 words correct demonstrates meaningful progress, and it seems like we're really close to having a model that can one-shot this problem soon.
There's actually a bit of nuance required to solve this puzzle correctly which an older Gemini model struggled to do without additional nudging. You have to convert the grid or word list to use matching casing (the grid uses uppercase, the word list uses lowercase), and you need to recognize that "soup mix" needs to have the space removed when doing the search.
"Gemini 3 Pro represents a generational leap from simple recognition to true visual and spatial reasoning."
Prompt: "wine glass full to the brim"
Image generated: 2/3 full wine glass.
True visual and spatial reasoning denied.
Audio described Youtube please? That'd be so amazing! Even if I couldn't play Zelda yet, I could listen to a playthrough with Gemini describing it.
What’s new here? I believe this is just gemini 3 which was released last month (the model id hasn’t changed AFAICT)
I'm playing with this and wondering if this is an actually good way to identify dominant colors and other features of a garment/product when using a photo where the item is styled and not isolated from the model or other garments
I would be interested in seeing what G3P makes of the Dead Sea Scrolls or similarly old documents.
Interesting. When i asked Gemini 3 Pro to generate a Infographic from my personal accounting sheet, it first failed to generate anything except a black background, then it generated something where it mixed different languages in a non-sensical way, with obvious typos and irrelevant information grouping. It's certainly a leap forward in OCR, rendering classic OCR useless.
Frankly, it's insane how laughably bad under scrutiny their own examples are. It both distorted the data and made the chart less readable (labels placement, segments separation, missing labels, worse contrast). And it combined them into one, so you you'll have harder time comparing them compared to the original image! Isn't it amazing that it added a toggle? Post author seems to think it deserves an exclamation point even.
When will we get Gemini 3 Flash?
I'm really fascinate by the opportunities to analyze videos. The amount of tokens it compresses down to, and what you can reason across those tokens, is incredible.
Google really are a fully woken sleeping giant. More code reds being issued today I expect.
Okay maybe this one isn't an exaggeration when they say leap forward
Screen understanding is huge for further automating dev work.
The document is paints a super impressive picture, but the core constraint of “network connection to Google required so we can harvest your data” is still a big showstopper for me (and all cloud-based AI tooling, really).
I’d be curious to see how well something like this can be distilled down for isolated acceleration on SBCs or consumer kit, because that’s where the billions to be made reside (factories, remote sites, dangerous or sensitive facilities, etc).
what framework is being utilized for computer use here?
Yes, but can it play PacMan yet?
So we’re going to use this to make the maid from the Jetsons finally. Right?
im realizing how much of a bottleneck vision models are
im just a glorified speedreadin' promptin' QA at this point with codex
once it replaces the QA layer its truly over for software dev jobs
future would be a software genie where on aistudio you type: "go make counterstrike 1.6 clone, here is $500, you have two hours"
edit: saw the Screenspot benchmark and holy ** this is an insane jump!!! 11% to 71% even beating Opus 4.5's 50%...chatgpt is at 3.5% and it matches my experience with codex
Well
It is the first model to get partial-credit on an LLM image test I have. Which is counting the legs of a dog. Specifically, a dog with 5 legs. This is a wild test, because LLMs get really pushy and insistent that the dog only has 4 legs.
In fact GPT5 wrote an edge detection script to see where "golden dog feet" met "bright green grass" to prove to me that there were only 4 legs. The script found 5, and GPT-5 then said it was a bug, and adjusted the script sensitivity so it only located 4, lol.
Anyway, Gemini 3, while still being unable to count the legs first try, did identify "male anatomy" (it's own words) also visible in the picture. The 5th leg was approximately where you could expect a well endowed dog to have a "5th leg".
That aside though, I still wouldn't call it particularly impressive.
As a note, Meta's image slicer correctly highlighted all 5 legs without a hitch. Maybe not quite a transformer, but interesting that it could properly interpret "dog leg" and ID them. Also the dog with many legs (I have a few of them) all had there extra legs added by nano-banana.