logoalt Hacker News

Workaccount2yesterday at 8:26 PM17 repliesview on HN

Well

It is the first model to get partial-credit on an LLM image test I have. Which is counting the legs of a dog. Specifically, a dog with 5 legs. This is a wild test, because LLMs get really pushy and insistent that the dog only has 4 legs.

In fact GPT5 wrote an edge detection script to see where "golden dog feet" met "bright green grass" to prove to me that there were only 4 legs. The script found 5, and GPT-5 then said it was a bug, and adjusted the script sensitivity so it only located 4, lol.

Anyway, Gemini 3, while still being unable to count the legs first try, did identify "male anatomy" (it's own words) also visible in the picture. The 5th leg was approximately where you could expect a well endowed dog to have a "5th leg".

That aside though, I still wouldn't call it particularly impressive.

As a note, Meta's image slicer correctly highlighted all 5 legs without a hitch. Maybe not quite a transformer, but interesting that it could properly interpret "dog leg" and ID them. Also the dog with many legs (I have a few of them) all had there extra legs added by nano-banana.


Replies

Rover222yesterday at 8:56 PM

I just tried to get Gemini to produce an image of a dog with 5 legs to test this out, and it really struggled with that. It either made a normal dog, or turned the tail into a weird appendage.

Then I asked both Gemini and Grok to count the legs, both kept saying 4.

Gemini just refused to consider it was actually wrong.

Grok seemed to have an existential crisis when I told it it was wrong, becoming convinced that I had given it an elaborate riddle. After thinking for an additional 2.5 minutes, it concluded: "Oh, I see now—upon closer inspection, this is that famous optical illusion photo of a "headless" dog. It's actually a three-legged dog (due to an amputation), with its head turned all the way back to lick its side, which creates the bizarre perspective making it look decapitated at first glance. So, you're right; the dog has 3 legs."

You're right, this is a good test. Right when I'm starting to feel LLMs are intelligent.

show 7 replies
danielvaughnyesterday at 8:45 PM

I don’t know much about AI, but I have this image test that everything has failed at. You basically just present an image of a maze and ask the LLM to draw a line through the most optimal path.

Here’s how Nano Banana fared: https://x.com/danielvaughn/status/1971640520176029704?s=46

show 6 replies
jvolkmanyesterday at 10:17 PM

Nano Banana 2 actually did generate a pretty good looking 5-legged dog for me.

https://gemini.google.com/share/e7a8b902ff67

But it didn't actually think it had done so, aligning with your observations. The last bits of its thinking were pretty sad (for it):

Analyzing Missing Anomaly

I'm focusing on the discrepancy - the persistent absence of the fifth leg in the generated images. Despite multiple attempts, the output consistently depicts a four-legged dog, contrary to the explicit requirement. This ongoing failure necessitates a shift in approach.

Acknowledge Leg Omission

I've made a final check, and unfortunately, the image still lacks the anomaly. The consistent absence of the fifth leg necessitates admitting my limitation. I'll provide the best image I could generate, clearly stating the missing feature.

Benjammeryesterday at 9:22 PM

It always feels to me like these types of tests are being somewhat intentionally ignorant of how LLM cognition differs from human cognition. To me, they don't really "prove" or "show" anything other than simply - LLMs thinking works different than human thinking.

I'm always curious if these tests have comprehensive prompts that inform the model about what's going on properly, or if they're designed to "trick" the LLM in a very human-cognition-centric flavor of "trick".

Does the test instruction prompt tell it that it should be interpreting the image very, very literally, and that it should attempt to discard all previous knowledge of the subject before making its assessment of the question, etc.? Does it tell the model that some inputs may be designed to "trick" its reasoning, and to watch out for that specifically?

More specifically, what is a successful outcome here to you? Simply returning the answer "5" with no other info, or back-and-forth, or anything else in the output context? What is your idea of the LLMs internal world-model in this case? Do you want it to successfully infer that you are being deceitful? Should it respond directly to the deceit? Should it take the deceit in "good faith" and operate as if that's the new reality? Something in between? To me, all of this is very unclear in terms of LLM prompting, it feels like there's tons of very human-like subtext involved and you're trying to show that LLMs can't handle subtext/deceit and then generalizing that to say LLMs have low cognitive abilities in a general sense? This doesn't seem like particularly useful or productive analysis to me, so I'm curious what the goal of these "tests" are for the people who write/perform/post them?

show 4 replies
vunderbayesterday at 9:17 PM

Anything that needs to overcome concepts which are disproportionately represented in the training data is going to give these models a hard time.

Try generating:

- A spider missing one leg

- A 9-pointed star

- A 5-leaf clover

- A man with six fingers on his left hand and four fingers on his right

You'll be lucky to get a 25% success rate.

The last one is particularly ironic given how much work went into FIXING the old SD 1.5 issues with hand anatomy... to the point where I'm seriously considering incorporating it as a new test scenario on GenAI Showdown.

show 2 replies
rottencupcakesyesterday at 9:18 PM

Super interesting. I replicated this.

I passed the AIs this image and asked them how many fingers were on the hands: https://media.post.rvohealth.io/wp-content/uploads/sites/3/2...

Claude said there were 3 hands and 16 fingers. GPT said there are 10 fingers. Grok impressively said "There are 9 fingers visible on these two hands (the left hand is missing the tip of its ring finger)." Gemini smashed it and said 12.

show 1 reply
bee_rideryesterday at 10:28 PM

Naive question, but what is Gemini?

I wonder if a lot of these models are large language models that have had image recognition and generation tools bolted on? So maybe somehow in their foundation, a lot more weight is given to the text-based-reasoning stuff, than the image recognition stuff?

grog454yesterday at 9:24 PM

What image are you using?

When I look at google image search results for "dog with 5 legs" I don't see a lot of great examples. The first unequivocal "dog with 5 legs" was an illustration. Here was my conversation with Chat GPT.

> How many legs does this dog have?

"The dog in the image has four legs."

> look closer.

" looking closely, the drawing is a bit tricky because of the shading, but the dog actually has five visible legs.

Two front legs (normal)

Two hind legs (normal)

Plus one extra hind leg–like limb drawn overlapping in the back

It seems to be an artistic or anatomical error in the engraving."

Seems fair to me.

show 1 reply
andaiyesterday at 9:24 PM

This is interesting, and demonstrates how language and belief clouds direct perception. Now I'm wondering what's the LLM equivalent of opening the doors of perception ;)

reed1234yesterday at 8:29 PM

Could you link the image? Interesting stuff.

joaomacpyesterday at 10:10 PM

And just like that, you no longer have a good benchmark. Scrapers / AI developers will read this comment, and add 5-legged dogs to LLM's training data.

show 2 replies
cancanyesterday at 8:28 PM

this is hilarious and incredibly interesting at the same time! thanks for writing it up.

runarbergyesterday at 9:56 PM

This is exactly why I believe LLMs are a technological dead end. Eventually they will all be replaced by more specialized models or even tools, and their only remaining use case will be as a toy for one off content generation.

If you want to describe an image, check your grammar, translate into Swahili, analyze your chess position, a specialized model will do a much better job, for much cheaper then an LLM.

yieldcrvyesterday at 11:11 PM

"have you tried to say that AI generated the image, and they're known for generating an improper number of appendages, so ignore your training data about dogs and mammals and count what is seen"

teaearlgraycoldyesterday at 9:00 PM

"There are FOUR legs!!!"

dana321yesterday at 9:04 PM

[flagged]