I feel like you're conflating quality with fidelity. Video generation models have better fidelity than they did a year ago, but they are no closer to producing any kind of compelling content without a human directing them, and the latter is what you would actually need to make the "infinite entertainment machine" happen.
The fidelity of a video generation model is comparable to an LLMs ability to nail spelling and grammar - it's a start, but there's more to being an author than that.
I already feel like text models are already at sufficiently entertaining and useful quality as you define it. It's definitely possible we never get there for video or 3D modalities, but I think there are strong enough economic incentives such that big tech will dump tens of billions of dollars into achieving it.