logoalt Hacker News

XenophileJKOyesterday at 11:12 PM0 repliesview on HN

It mostly depends on "how" the models work. Multi-modal unified text/image sequence to sequence models can do this pretty well, diffusion doesn't.