logoalt Hacker News

Flux159yesterday at 7:48 PM4 repliesview on HN

This looks useful for people not using Claude Code, but I do think that the desktop example in the video could be a bit misleading (particularly for non-developers) - Claude is definitely not taking screenshots of that desktop & organizing, it's using normal file management cli tools. The reason seems a bit obvious - it's much easier to read file names, types, etc. via an "ls" than try to infer via an image.

But it also gets to one of Claude's (Opus 4.5) current weaknesses - image understanding. Claude really isn't able to understand details of images in the same way that people currently can - this is also explained well with an analysis of Claude Plays Pokemon https://www.lesswrong.com/posts/u6Lacc7wx4yYkBQ3r/insights-i.... I think over the next few years we'll probably see all major LLM companies work on resolving these weaknesses & then LLMs using UIs will work significantly better (and eventually get to proper video stream understanding as well - not 'take a screenshot every 500ms' and call that video understanding).


Replies

oracleclydeyesterday at 9:05 PM

Maybe at one time, but it absolutely understands images now. In VSCode Copilot, I am working on a python app that generates mesh files that are imported in a blender project. I can take a screenshot of what the mesh file looks like and ask Claude code questions about the object, in context of a Blender file. It even built a test script that would generate the mesh and import it into the Blender project, and render a screenshot. It built me a vscode Task to automate the entire workflow and then compare image to a mock image. I found its understanding of the images almost spooky.

show 1 reply
ElatedOwlyesterday at 8:00 PM

I keep seeing “Claude image understanding is poor” being repeated, but I’ve experienced the opposite.

I was running some sentiment analysis experiments; describe the subject and the subjects emotional state kind of thing. It picked up on a lot of little detail; the brand name of my guitar amplifier in the background, what my t shirt said and that I must enjoy craft beer and or running (it was a craft beer 5k kind of thing), and picked up on my movement through multiple frames. This was a video slicing a frame every 500ms, it noticed me flexing, giving the finger, appearing happy, angry, etc. I was really surprised how much it picked up on, and how well it connected those dots together.

show 1 reply
EMM_386yesterday at 8:28 PM

> Claude is definitely not taking screenshots of that desktop & organizing, it's using normal file management cli tools

Are you sure about that?

Try "claude --chrome" with the CLI tool and watch what it does in the web browser.

It takes screenshots all the time to feed back into the multimodal vision and help it navigate.

It can look at the HTML or the JavaScript but Claude seems to find it "easier" to take a screenshot to find out what exactly is on the screen. Not parse the DOM.

So I don't know how Cowork does this, but there is no reason it couldn't be doing the same thing.

show 1 reply
minimaxiryesterday at 8:12 PM

Claude Opus 4.5 can understand images: one thing I've done frequently in Claude Code and have had great success is just showing it an image of weird visual behavior (drag and drop into CC) and it finds the bug near-immediately.

The issue is that Claude Code won't automatically Read images by default as a part of its flow: you have to very explicitly prompt it to do so. I suspect a Skill may be more useful here.

show 1 reply