Neat. I wonder if a allowing the models to inspect pixels or pixel regions, instead of fully relying on the VLM, would help at all. The spatial reasoning required might be too complex though. In general the VLM seems to be a limiting factor, so I wonder if there's some way to usefully augment it or sidestep limitations.
Like, instead of being in pseudo-MSpaint, pseudo-Photoshop with manipulable layers and bounding boxes. They struggle to add an outline to something previously drawn, but that's something that could be done programmatically. The limitations are obviously part of what makes this interesting, but different limitations could be interesting, too. Maybe additional complexity would just result in more uninteresting failures though, I don't know.
I noticed that the feedback/strengths/suggestions outputs are clearly also given the initial image's prompt. It could be useful to additionally have an output that's not given the prompt, so the LLM knows what the VLM sees without bias?
You may enjoy
* "The last six months in LLMs, illustrated by pelicans on bicycles" https://simonwillison.net/2025/Jun/6/six-months-in-llms/ (https://news.ycombinator.com/item?id=44215352 | 962 points | 11 months ago | 239 comments)
* "Using “underdrawings” for accurate text and numbers" https://samcollins.blog/underdrawings/ (https://news.ycombinator.com/item?id=47977990 | 379 points | 9 days ago | 138 comments)
Good attempt. Compared to diffusion, these paintings look more like they were created by humans.
LLMs can draw (play music, write books), but they imitate, not create.
I've been trying to get some language models to paint one stroke at a time for a few months now. I thought this community would be interested to see the results.
The article runs through my findings, and there's a linked technical rundown of how the app was built. There's also an interactive gallery [0] of my attempts. You can point an agent at the API docs [1], and they might (ymmv) do a painting themselves.
[0] https://www.liamlaverty.com/paint-by-language-model/ [1] https://www.liamlaverty.com/paint-by-language-model/draw/api