Many people say:
> these things will get bigger and better much faster than we can learn to discern
I would like to ask “Why?”
Clearly, these models are just one case of “NN can learn to map anything from one domain to another” and with enough training/overfitting they can approximate reality to a high degree.
But, why would it get better to any significant extent?
Because we can collect an infinite amount of video? Because we can train models to the point where they become generative video compression algorithms that have seen it all?
> But, why would it get better to any significant extent?
Two years ago, the very best closed-source image model was unable to represent anything remotely realistic. Today, there's hundreds of open source models that can generate images that are literally indistinguishable from reality (like Flux). Not only that, there's an entire collection of tools and techniques around style transfer, facial reconstruction, pose control, etc. It's mindblowing, and every week there's a new paper making it even better. Some of that could have been more training data. Most of it wasn't.
I guess it's fair to extrapolate that same trend to video, since it's the arc text, audio and images have taken? No reason it would be different.