Anything that needs to overcome concepts which are disproportionately represented in the training data is going to give these models a hard time.
Try generating:
- A spider missing one leg
- A 9-pointed star
- A 5-leaf clover
- A man with six fingers on his left hand and four fingers on his right
You'll be lucky to get a 25% success rate.
The last one is particularly ironic given how much work went into FIXING the old SD 1.5 issues with hand anatomy... to the point where I'm seriously considering incorporating it as a new test scenario on GenAI Showdown.
https://gemini.google.com/share/8cef4b408a0a
Surprisingly, it got all of them right
It mostly depends on "how" the models work. Multi-modal unified text/image sequence to sequence models can do this pretty well, diffusion doesn't.