> Under the hood, with the way the text is turned into vector embeddings, it's fairly questionable whether you'd agree that it can even represent such a thing.
The text encoder may not be able to know complex relationships, but the generative image/video models that are conditioned on said text embeddings absolutely can.
Flux, for example, uses the very old T5 model for text encoding, but image generations from it can (loosely) adhere to all rules and nuances in a multi-paragraph prompt: https://x.com/minimaxir/status/1820512770351411268
> but image generations from it can (loosely) adhere to all rules and nuances in a multi-paragraph prompt
Flux certainly does not consistently do so across an arbitrary collection of multi-paragraph prompts, as anyone whose run more than a few long prompts past it would recongize; also, the tweet is wrong in the other direction, as well, longer language-model-preprocessed prompts for models that use CLIP (like various SD1.5 and SDXL derivatives) are, in fact, a common and useful technique. (You’d kind of think that the fact that generated prompt here is significantly longer than the 256 token window of T5 would be a clue that the 77 token limit of CLIP might not be as big of a constraint as the tweet was selling it as, too.)