logoalt Hacker News

3PStoday at 4:16 PM0 repliesview on HN

This feels like the opposite to me? The MoT architecture looks like the ideal that the Bitter Lesson alludes to - just take all of your data in all of your formats (audio, image, text, action, video) and dump it all into a single shared latent space. Then let the model sort things out, with just enough structure to handle the different requirements/output formats needed (e.g. autoregressive stuff for sequence modeling/prediction, diffusion stuff for generation).