The [demo PDF](https://github.com/Tongyi-MAI/Z-Image/blob/main/assets/Z-Ima...) has ~50 photos of attractive young women sitting/standing alone, and exactly two photos featuring young attractive men on their own.
It's incredibly clear who the devs assume the target market is.
We've come a long way with these image models, and the things you can do with paltry 6B are super impressive. The community has adopted this model wholesale, and left Flux(2) by the way side. It helps that Z-Image isn't censored, whereas BFL (makers of Flux 2) dedicated like a fith of their press release talking about how "safe" (read: censored and lobotomized) their model is.
My issue with this model is it keeps producing Chinese people and Chinese text. I have to very specifically go out of my way to say what kind of race they are.
If I say “A man”, it’s fine. A black man, no problem. It’s when I add context and instructions is just seems to want to go with some Chinese man. Which is fine, but I would like to see more variety of people it’s trained on to create more diverse images. For non-people it’s amazingly good.
Z-Image seems to be the first successor to Stable Diffusion 1.5 that delivers better quality, capability, and extensibility across the board in an open model that can feasibly run locally. Excitement is high and an ecosystem is forming fast.
i have been testing this on my Framework Desktop. ComfyUI generally causes an amdgpu kernel fault after about 40 steps (across multiple prompts), so i spent a few hours building a workaround here https://github.com/comfyanonymous/ComfyUI/pull/11143
overall it's fun and impressive. decent results using LoRA. you can achieve good looking results with as few as 8 inference steps, which takes 15-20 seconds on a Strix Halo. i also created a llama.cpp inherence custom node for prompt enhancement which has been helping with overall output quality.
It's amazing how much knowledge about the world fits into 16 GiB of the distilled model.
We have vLLM for running text LLMs in production. What is the equivalent for this model?
As an AI outsider with a recent 24GB macbook, can I follow the quick start[1] steps from the repo and expect decent results? How much time would it take to generate a single medium quality image?
[1]: https://github.com/Tongyi-MAI/Z-Image?tab=readme-ov-file#-qu...
Just want to learn - who actually needs or buys up generated images?
Very good, not always perfect with text or with following exactly the prompt, but 6B so... impressive.
Did anyone test it on 5090? I saw some 30xx reports and it seemed very fast
I'm particularly impressed by the fact that they seem to aim for photorealism rather than the semi-realistic AI-look that is common in many text-to-image models.
I wish they would have used the WAN vae.
I've done some preliminary testing with Z-Image Turbo in the past week.
Thoughts
- It's fast (~3 seconds on my RTX 4090)
- Surprisingly capable of maintaining image integrity even at high resolutions (1536x1024, sometimes 2048x2048)
- The adherence is impressive for a 6B parameter model
Some tests (2 / 4 passed):
https://imgpb.com/exMoQ
Personally I find it works better as a refiner model downstream of Qwen-Image 20b which has significantly better prompt understanding but has an unnatural "smoothness" to its generated images.