Personally I find it works better as a refiner model downstream of Qwen-Image 20b which has significantly better prompt understanding but has an unnatural "smoothness" to its generated images.
Couple that with the LoRA, in about 3 seconds you can generate completely personalized images.
The speed alone is a big factor but if you put the model side by side with seedream and nanobanana and other models it's definitely in the top 5 and that's killer combo imho.
Yeah, I've definitely switched largely away from Flux. Much as I do like Flux (for prompt adherency), BFL's baffling licensing structure along with its excessive censorship makes it a noop.
For ref, the Porcupine-cone creature that ZiT couldn't handle by itself in my aforementioned test was easily handled using a Qwen20b + ZiT refiner workflow and even with two separate models STILL runs faster than Flux2 [dev].
With today's remote social validation for women and all time low value of men due to lower death rates and the disconnect from where food and shelter come from, lonely men make up a huge portion of the population.
They maybe have an rhlf phase, but I mean there is also just the shape of the distribution of images on the internet and, since this is from alibaba, their part of the internet/social media (Weibo) to consider
We've come a long way with these image models, and the things you can do with paltry 6B are super impressive. The community has adopted this model wholesale, and left Flux(2) by the way side. It helps that Z-Image isn't censored, whereas BFL (makers of Flux 2) dedicated like a fith of their press release talking about how "safe" (read: censored and lobotomized) their model is.
Z-Image seems to be the first successor to Stable Diffusion 1.5 that delivers better quality, capability, and extensibility across the board in an open model that can feasibly run locally. Excitement is high and an ecosystem is forming fast.
As an AI outsider with a recent 24GB macbook, can I follow the quick start[1] steps from the repo and expect decent results? How much time would it take to generate a single medium quality image?
If you don't know anything about AI in terms of how these models are run, comfyui's macos version is probably the easiset to use. There is already a Z-Image workflow that you can get and comfyui will get all the models you need and get it work together. Can expect decent speed
i have been testing this on my Framework Desktop. ComfyUI generally causes an amdgpu kernel fault after about 40 steps (across multiple prompts), so i spent a few hours building a workaround here https://github.com/comfyanonymous/ComfyUI/pull/11143
overall it's fun and impressive. decent results using LoRA. you can achieve good looking results with as few as 8 inference steps, which takes 15-20 seconds on a Strix Halo. i also created a llama.cpp inherence custom node for prompt enhancement which has been helping with overall output quality.
I would say there's isn't an equivalent. Some people will probably tell you ComfyUI - you can expose workflows via API endpoints and parameterize them. This is how e.g. Krita AI Diffusion uses a ComfyUI backend.
For various reasons, I doubt there are any large scale SaaS-style providers operating this in production today.
I follow an author who publishes online on places like Scribblehub and has a modestly successful Patreon. Over the years he has spent probably tens of thousands of dollars on commissioned art for his stories, and he's still spending heavily on that. But as image models have gotten better this has increasingly been supplemented with AI-images for things that are worth a couple dollars to get right with AI, but not a couple hundred to get a human artist to do them
Roughly speaking the art seems to have three main functions:
1. promote the story to outsiders: this only works with human-made art
2. enhance the story for existing readers: AI helps here, but is contentious
3. motivate and inspire the author: works great with AI. The ease of exploration and pseudo-random permutations in the results are very useful properties here that you don't get from regular art
By now the author even has an agreement with an artist he frequently commissions that he can use his style in AI art in return for a small "royalty" payment for every such image that gets published in one of his stories. A solution driven both by the author's conscience and by the demands of the readers
Except for gaming, that doesn't sound like a huge market worthy of pouring millions into training these high-quality models. And there is a lot of competition too. I suspect there are some other deep-pocketed customers for these images. Probably animations? movies? TV ads?
Dying businesses like newspapers and local banks, who use it to save the money they used to spend on shutterstock images? That’s where I’ve seen it at least. Replacing one useless filler with another.
Supports MPS (Metal Performance Shaders). Using something that skips Python entirely along with a mlx or gguf converted model file (if one exists) will likely be even faster.
I've done some preliminary testing with Z-Image Turbo in the past week.
Thoughts
- It's fast (~3 seconds on my RTX 4090)
- Surprisingly capable of maintaining image integrity even at high resolutions (1536x1024, sometimes 2048x2048)
- The adherence is impressive for a 6B parameter model
Some tests (2 / 4 passed):
https://imgpb.com/exMoQ
Personally I find it works better as a refiner model downstream of Qwen-Image 20b which has significantly better prompt understanding but has an unnatural "smoothness" to its generated images.
On fal, it takes less than a second many times.
https://fal.ai/models/fal-ai/z-image/turbo/api
Couple that with the LoRA, in about 3 seconds you can generate completely personalized images.
The speed alone is a big factor but if you put the model side by side with seedream and nanobanana and other models it's definitely in the top 5 and that's killer combo imho.
So does this finally replace SDXL?
Is Flux 1/2/Kontext left in the dust by the Z Image and Qwen combo?
Yeah, I've definitely switched largely away from Flux. Much as I do like Flux (for prompt adherency), BFL's baffling licensing structure along with its excessive censorship makes it a noop.
For ref, the Porcupine-cone creature that ZiT couldn't handle by itself in my aforementioned test was easily handled using a Qwen20b + ZiT refiner workflow and even with two separate models STILL runs faster than Flux2 [dev].
https://imgur.com/a/5qYP0Vc
SDXL has been outclassed for a while, especially since Flux came out.
Subjective. Most in creative industries regularly still use SDXL.
Once Z-image base comes out and some real tuning can be done, I think it has a chance of replacing it for the function SDXL has
Source?
The [demo PDF](https://github.com/Tongyi-MAI/Z-Image/blob/main/assets/Z-Ima...) has ~50 photos of attractive young women sitting/standing alone, and exactly two photos featuring young attractive men on their own.
It's incredibly clear who the devs assume the target market is.
The model is uncensored, so will probably suite that target market admirably.
It's interesting the handsome guy is literally Tony Leung Chiu-wai, https://www.imdb.com/name/nm0504897/, not even modified
They're correct. This tech, like much before it, is being driven by the base desires of extremely smart young men.
Not just extremely smart young men, but probably young men who fall closer to the incel category than normal go-on-tinder-and-find-a-date category.
With today's remote social validation for women and all time low value of men due to lower death rates and the disconnect from where food and shelter come from, lonely men make up a huge portion of the population.
I mean spending all that time on dates, and wives, and kids gives you much less time to build AI models.
The people with the time and desire to do something are the ones most likely to do it, this is no brilliant observation.
They maybe have an rhlf phase, but I mean there is also just the shape of the distribution of images on the internet and, since this is from alibaba, their part of the internet/social media (Weibo) to consider
Please write what you mean instead of making veiled implications. What is the point of beating around the bush here?
It's not clear to me what you mean either, especially since female models are overwhelmingly more popular in general[1].
[1]: "Female models make up about 70% of the modeling industry workforce worldwide" https://zipdo.co/modeling-industry-statistics/
The ratio of naked female loras compared to naked male loras, or even non-porn loras, on civitai is at least 20 to 1. This shouldn't be surprising.
We've come a long way with these image models, and the things you can do with paltry 6B are super impressive. The community has adopted this model wholesale, and left Flux(2) by the way side. It helps that Z-Image isn't censored, whereas BFL (makers of Flux 2) dedicated like a fith of their press release talking about how "safe" (read: censored and lobotomized) their model is.
But this is a CCP model, would it refuse to generate Xi?
You tell me.
https://imgur.com/a/7FR3uT1
Z-Image seems to be the first successor to Stable Diffusion 1.5 that delivers better quality, capability, and extensibility across the board in an open model that can feasibly run locally. Excitement is high and an ecosystem is forming fast.
As an AI outsider with a recent 24GB macbook, can I follow the quick start[1] steps from the repo and expect decent results? How much time would it take to generate a single medium quality image?
[1]: https://github.com/Tongyi-MAI/Z-Image?tab=readme-ov-file#-qu...
If you don't know anything about AI in terms of how these models are run, comfyui's macos version is probably the easiset to use. There is already a Z-Image workflow that you can get and comfyui will get all the models you need and get it work together. Can expect decent speed
Have a 48GB M4 Pro and every inference step takes like 10 seconds on a 1024x1024 image. so six steps and you need a minute. Not terrible, not great.
I'm fine with the quick start steps and I prefer CLI to GUI anyway. But if I try it and find it too complex, I now know what to try instead - thanks.
I'm still curious whether this would run on a MacBook and how long would it take to generate an image. What machine are you using?
i have been testing this on my Framework Desktop. ComfyUI generally causes an amdgpu kernel fault after about 40 steps (across multiple prompts), so i spent a few hours building a workaround here https://github.com/comfyanonymous/ComfyUI/pull/11143
overall it's fun and impressive. decent results using LoRA. you can achieve good looking results with as few as 8 inference steps, which takes 15-20 seconds on a Strix Halo. i also created a llama.cpp inherence custom node for prompt enhancement which has been helping with overall output quality.
It's amazing how much knowledge about the world fits into 16 GiB of the distilled model.
This is early days, too. We're probably going to get better at this across more domains.
Local AI will eventually be booming. It'll be more configurable, adaptable, hackable. "Free". And private.
Crude APIs can only get you so far.
I'm in favor of intelligent models like Nano Banana over ComfyUI messes (the future is the model, not the node graph).
I still think we need the ability to inject control layers and have full access to the model, because we lose too much utility by not having it.
I think we'll eventually get Nano Banana Pro smarts slimmed down and running on a local machine.
Yeah I know right?? Just a few more trillion dollars bro and it'll be good bro
>Local AI will eventually be booming.
With how expensive RAM currently is, I doubt it.
We have vLLM for running text LLMs in production. What is the equivalent for this model?
I would say there's isn't an equivalent. Some people will probably tell you ComfyUI - you can expose workflows via API endpoints and parameterize them. This is how e.g. Krita AI Diffusion uses a ComfyUI backend.
For various reasons, I doubt there are any large scale SaaS-style providers operating this in production today.
Just want to learn - who actually needs or buys up generated images?
I follow an author who publishes online on places like Scribblehub and has a modestly successful Patreon. Over the years he has spent probably tens of thousands of dollars on commissioned art for his stories, and he's still spending heavily on that. But as image models have gotten better this has increasingly been supplemented with AI-images for things that are worth a couple dollars to get right with AI, but not a couple hundred to get a human artist to do them
Roughly speaking the art seems to have three main functions:
1. promote the story to outsiders: this only works with human-made art
2. enhance the story for existing readers: AI helps here, but is contentious
3. motivate and inspire the author: works great with AI. The ease of exploration and pseudo-random permutations in the results are very useful properties here that you don't get from regular art
By now the author even has an agreement with an artist he frequently commissions that he can use his style in AI art in return for a small "royalty" payment for every such image that gets published in one of his stories. A solution driven both by the author's conscience and by the demands of the readers
Some ideas for your consideration:
- Illustrating blog posts, articles, etc.
- A creativity tool for kids (and adults; consider memes).
- Generating ads. (Consider artisan production and specialized venues.)
- Generating assets for games and similar, such as backdrops and textures.
Like any tool, it takes certain skill to use, and the ability to understand the results.
Except for gaming, that doesn't sound like a huge market worthy of pouring millions into training these high-quality models. And there is a lot of competition too. I suspect there are some other deep-pocketed customers for these images. Probably animations? movies? TV ads?
Propaganda?
During the holiday season I've been noticing AI-generated assets on tons of meatspace ads and cheap, themed products.
Dying businesses like newspapers and local banks, who use it to save the money they used to spend on shutterstock images? That’s where I’ve seen it at least. Replacing one useless filler with another.
Very good, not always perfect with text or with following exactly the prompt, but 6B so... impressive.
Did anyone test it on 5090? I saw some 30xx reports and it seemed very fast
Even on my 4080 it's extremely fast, it takes ~15 seconds per image.
I wish they would have used the WAN vae.
Does it run on apple silicon?
Apparently - https://github.com/ivanfioravanti/z-image-mps
Supports MPS (Metal Performance Shaders). Using something that skips Python entirely along with a mlx or gguf converted model file (if one exists) will likely be even faster.
It's working for me - it does max out my 64GB though.
Wow. I always forget how unlike autoregressive models, diffusion models are heavier on resources (for the same number of parameters).