It mostly depends on "how" the models work. Multi-modal unified text/image sequence t...

XenophileJKO • yesterday at 11:12 PM • 0 replies • view on HN

It mostly depends on "how" the models work. Multi-modal unified text/image sequence to sequence models can do this pretty well, diffusion doesn't.

alt Hacker News