What do you mean LLMs are blind? All frontier models are multimodal, which means they literally cons...

semiquaver • yesterday at 11:06 PM • 3 replies • view on HN

What do you mean LLMs are blind? All frontier models are multimodal, which means they literally consume images as tokens. They can “see” exactly as well as they can “read”.

Also, GPT-Image-2 is not a diffusion model, it is based on Transformers, like other LLMs are.

Replies

embedding-shape • yesterday at 11:52 PM

I guess they do "see" but more like "see an explanation of the image", not "see" as in experience visually. They're really bad at details and perfection when it comes to images, and doesn't understand things like visual hierarchy, affordances and other fundamental design concepts. Most of them are able to describe those things with letters, but doesn't seem to actually fundamentally grasp it when asking it to do UIs even when mentioning these things.

Try doing 100% vibe-coding with an agent and loosely specify what kind of application you want, and observe how the resulting UI and UX is a complete mess, unless you specify exactly how the UI and UX should work in practice.

If they actually had spatial understanding, together with being able to visually experience images, then they'd probably be able to build proper UI/UX from the get go, but since they only could describe what those things are, you end up with the messes even the current SOTAs produce.

➕ show 3 replies

slashdave • yesterday at 11:17 PM

Tokens are not a substitute for a numerical measurement.

Ask a LLM how much time has passed. Watch it hallucinate wildly.

Has anyone noticed that Opus has trouble building ascii diagrams (often leaves out spaces so lines are misaligned)?

➕ show 2 replies

bombcar • today at 12:04 AM

Claude has been kicking ass at code, but I asked it to “sketch” a second floor with a stairway and bedrooms with large closets and it made … something that resembles something akin to not at all what I asked.

alt Hacker News

Replies