If you squint your eyes it's a fixed iteration ODE solver. I'd love to see a generalization on this and the Universal Transformer metioned re-envisioned as flow-matching/optimal transport models.
How would flow matching work? In language we have inputs and outputs but it's not clear what the intermediate points are since it's a discrete space.
This makes me think it would be nice to see some kinda child of modern transformer architecture and neural ODEs. There was such interesting work a few years ago on how neural ode/pdes could be seen as a sort of continuous limit of layer depth. Maybe models could learn cool stuff if the embeddings were somehow dynamical model solutions or something.