Using “underdrawings” for accurate text and numbers

209 points • by samcollins • last Friday at 6:07 PM • 66 comments • view on HN

Comments

> Transform this image into a photographed claymation diorama of assorted artisan chocolates and candies […] viewed from a low-angle

Side note: whenever I read prompts for image generation, I notice very specific details which the model obviously ignored. Here the chocolates / candies in the last two images look anything but artisanal. They look very "sterile" and mass-produced. The viewing angle is also not accurate.

Why do we even bother writing such elaborate prompts, if the model ignores most of it anyway?

danpalmer • today at 2:06 AM

I'm glad that we're making progress towards a deeper understanding of what LLMs are inherently good at and what they're inherently bad at (not to say incapable of doing, but stuff that is less likely to work due to fundamental limitations).

There's similarity here with, for example, defining the architecture of software, but letting an LLM write the functions. Or asking an LLM to write you the SQL query for your data analysis, rather than asking it to do your data analysis for you.

What I'd really like to see is a more well defined taxonomy of work and studies on which bits work well with LLMs and which don't. I understand some of this intuitively, but am still building my intuition, and I see people tripping up on this all the time.

➕ show 1 reply

samcollins • last Friday at 6:07 PM

I found a simple technique to get reliable text and numbers in AI generated images.

I’m surprised the image models aren’t already doing this, so wanted to share since I’m finding this so useful

➕ show 1 reply

smusamashah • today at 3:59 AM

This is just img2img where first image with correct structure was generated by code.

➕ show 2 replies

Geonode • today at 8:30 AM

We've been doing this for a long time now, it's similar to using a depth map or a line drawing to control the silhouette.

xigoi • today at 6:30 AM

The standard objection: if the LLM is supposedly intelligent, why can’t it figure out on its own that this two-step process would achieve a better result?

➕ show 4 replies

elil17 • today at 7:24 AM

I wonder whether this could be used to fine-tune image models to provide better outputs. Something like this:

1. Algorithmically generate a underdrawing (e.g. place numbers and shapes randomly in the underdrawing)

2. Algorithmically generate a description of the underdrawing (e.g. for each shape, output text like "there is a square with the number three in the top left corner). You might fuzz this by having an LLM rewrite the descriptions in a variety of ways.

3. Generate a "ground truth" image using the underdrawing and an image+text-to-image model.

4. Use the generated description and the generated "ground truth" image as training data for a text-to-image model.

➕ show 2 replies

sparuchuri • last Friday at 6:58 PM

This hack definitely falls in the “duh, why didn’t I think of that” category of tricks, but glad to now have it next time imagegen comes up short

➕ show 1 reply

dllu • today at 7:08 AM

I was thinking about doing the opposite for the common task of "SVG of a pelican riding a bike". Obviously, directly spitting out the SVG is gonna be bad. But image gen can produce a really stunning photorealistic image easily. Probably a good way to get an LLM to produce a decent bike-pelican SVG is to generate an image first and then get the model to trace it into an SVG. After all, few human beings can generate SVG works of art by just typing out numbers into Notepad. At the core of it, we still rely on looking at it and thinking about it as an image.

docheinestages • today at 8:42 AM

And what happens if the model can't come up with a good enough SVG to begin with?

nottorp • today at 7:00 AM

LLMs are like a box of chocolates...

cheekyant • today at 7:48 AM

Has anyone built a platform which has image to image pipelines and lets you use prompt to SVG generation from SOTA LLMs?

➕ show 1 reply

nine_k • today at 6:34 AM

It's normal to first create a plan, then allow agents to write code. But it seems to be surprising for many to first create a draft / outline of a picture, then go for a final render.

BobbyTables2 • today at 3:20 AM

How is it that LLMs aren’t good at rendering the sequence of numbers but can reliably put the supplied pieces all in the right order?

➕ show 1 reply

choeger • today at 4:41 AM

Transformers are great translators. So, yeah, starting with structured output like SVG is probably the best way to start.

It should be fairly trivial to fix any logic errors in the structured output, too.

wg0 • today at 5:40 AM

Has anyone had good luck with making consistent game art and assets?

SomaticPirate • today at 6:11 AM

inb4 this technique is subsumed into the next MoE model release

LLMs are evolving so fast I wouldn’t be surprised if this technique was not needed in <6 months

➕ show 2 replies

globular-toast • today at 7:22 AM

Wait, where did it get the "Sweet Path//Trail of treats" thing from in the SVG? It wasn't about sweets at that point. Something missing here, I think.

tracerbulletx • today at 1:52 AM

Ive been doing charts for slides like this for a while. Noticed html viz was super reliable, but I could style it with diffusion model. Its very useful for data viz.

Melamune • today at 6:24 AM

I wondered why I was losing all passion for creating. These tips and tricks are part of the answer.

foxes • today at 7:28 AM

I feel sorry for the recipient.

jeffrallen • today at 4:54 AM

I wish the opposite was true: that when I tell Gemini I want "a diagram of X" that it immediately breaks out Python and mathplotlib, instead of wasting my time with Nano Banana.

nullc • today at 4:05 AM

Inpainting/guiding from a sketch is how I've always used diffusion models. I thought everyone did that, or at least everyone who wasn't just trying to get some arbitrary filler material without much care of what the output looked like.

psychoslave • today at 7:07 AM

A few months ago I tried to make Le-chat Mistral output a French poetry in Alexandrin (12 vowels). Disastrous at first. Then adding in specifications that each line had to also be transposed in IPA and each syllable counted, it went better.

Still emotionally unrelatable, but definitely was providing something that match the specifications of there are explicit and systematically enforced through deterministitic means. For now I retain that LLM limitations are thus that they can't seize the ineffable and so untrustworthy they can only be employed under very clear and inescapable constraints or they will go awry just as sure as water is wet.

gwern • today at 2:57 AM

tldr: do a standard img2img workflow where you lay out a skeleton or skeleton or low-res version, and then turn it into the final high-quality photorealistic version, instead of trying to zeroshot it purely from a text prompt.

alt Hacker News

Using “underdrawings” for accurate text and numbers

Comments