logoalt Hacker News

MattRogishyesterday at 12:40 AM0 repliesview on HN

We have some broad shapes - it’s a finite set of “things that are interesting to us” and the dataset is bounded. It’s not “Google Image Search”. But it is kinda like “we have a giant pile of PDFs, pictures, etc and the user wishes to run an arbitrary query on them and extract the information they want. Ex: “I need the to know $something about the data embedded in the corpus, that look like excel data with line charts describing some particular class of metric that are to the left of gray dogs and are about $something_else earlier in the document”

Gemini has a very specific mode where it has been trained on making boxes normalized to a 1000x1000 grid (https://docs.cloud.google.com/gemini-enterprise-agent-platfo...) and in our experience this “just works” AND is very fast on 3.5 and 3.1 models without needing much thinking (so it is not terrifically expensive).

(BTW A+++ gold star triple thumbs up give this person a bonus to whomever did that magic it basically made this task for us tractable. When we first found it nobody else had anything like it - it’s worked so well I haven’t felt any need to look. )

So we say, “Hey Gemini draw box_2d […] around #{things we are interested in}” and then it is pretty easy to then go - ok if this is here and that is there, let’s slice the image in this particular way, making sure to overlap by some amount because the boxes are fuzzy, then send the chunks to a thing that turns it into JSON, then we use something like edge detection to reconstruct the whole from the parts. (Squint and it looks like whole genome shotgun sequencing)