If the prompt is the compass, and represents a point in space, why walk there? Why not just go to that point in image space directly, what would be there? When does the random seed matter if you're aiming at the same point anyway, don't you end up there? Does the prompt vector not exist in the image manifold, or is there some local sampling done to pick images which are more represented in the training data?
So I’m not an expert, this post was just based on my understanding, but as I understand it: the prompt embedding space and the latent image space are different “spaces”, so there is no single “point” in the latent image space that represents a given prompt. There are regions that are more or less consistent with the prompt, and due to cross-attention between the text embedding vector and the latent image vector, it’s able to guide the diffusion process in a suitable direction.
So different seeds lead to slightly different end points, because you’re just moving closer to the “consistent region” at each step, but approaching from a different angle.