As long as you're not asking for a zero-shot solution with a single model run three times in a ...

TeMPOraL • 12/11/2024 • 1 reply • view on HN

As long as you're not asking for a zero-shot solution with a single model run three times in a row, this should be entirely doable, though I imagine ensuring the result would require a complex pipeline consisting of:

- An LLM to inflate descriptions in the script to very detailed prompts (equivalent to artist thinking up how characters will look, how the scene is organized);

- A step to generate a representative drawing of every character via txt2img - or more likely, multiple ones, with a multimodal LLM rating adherence to the prompt;

- A step to generate a lot of variations of every character in different poses, using e.g. ControlNet or whatever is currently the SOTA solution used by the Stable Diffuison community to create consistent variations of a character;

- A step to bake all those character variations into a LoRA;

- Finally, scenes would be generated by another call to txt2img, with prompts computed in step 1, and appropriate LoRAs active (this can be handled through prompt too).

Then iterate on that, e.g. maybe additional img2img to force comic book style (with a different SD derivative, most likely), etc.

Point being, every subproblem of the task has many different solutions already developed, with new ones appearing every month - all that's left to have an "AI artist" capable of solving your challenge is to wire the building blocks up. For that, you need just a trivial bit of Python code using existing libraries (e.g. hooking up to ComfyUI), and guess what, GPT-4 and Claude 3.5 Sonnet are quite good at Python.

EDIT: I asked Claude to generate "pseudocode" diagram of the solution from our two comments:

http://www.plantuml.com/plantuml/img/dLLDQnin4BthLmpn9JaafOR...

Each of the nodes here would be like 3-5 real ComfyUI nodes in practice.

Replies

staticman2 • 12/11/2024

I appreciate the detailed response. I had a feeling the answer was some variation of "well I could get an AI to draw that but I'd have to hack at it for a few hours...". If a human has to work at it for hours, it's more like using Blender than "having an AI draw it" in my mind.

I suspect if someone went to the trouble to implement your above solution they'd find the end result isn't as good as they'd hoped. In practice you'd probably find one or more steps don't work correctly- for example, maybe today's multimodal LLM's can't evaluate prompt adherence acceptably. If the technology was ready the evidence would be pretty clear- I'd expect to see some very good, very quickly made comic books shown off by AI enthusiast on reddit rather then the clearly limited/ not very good comic book experiments which have been demonstrated so far.

➕ show 1 reply

alt Hacker News

Replies