You can model a generative process, but it's necessarily an auto-regressive generative process, not the same as the originating generative process which is based on the external world.
Human language, and other actions, exist on a range from almost auto-regressive (generating a stock/practiced phrase such as "have a nice day") to highly interactive ones. An auto-regressive model is obviously going to have more success modelling an auto-regressive generative process.
Weather prediction is really a good case of the limitation of auto-regressive models, as well as models that don't accurately reflect the inputs to the process you are attempting to predict. "There's a low pressure front coming in, so the weather will be X, same as last time", works some of the time. A crude physical weather model based on limited data points, such as weather balloon inputs, or satellite observation of hurricanes, also works some of the time. But of course these models are sometimes hopelessly wrong too.
My real point wasn't about the lack of sensory data, even though this does force a purely auto-regressive (i.e. wrong) model, but rather about the difference between a passive model (such as weather prediction), and an interactive one.
The whole innovation of GPT and LLMs in general is that an autoregressive model can make alarmingly good next-token predictions with the right inductive bias, a large number of parameters, a long context window, and a huge training set.
It turns out that human communication is quite a lot more "autoregressive" than people assumed it was up until now. And that includes some level of reasoning capability, arising out of a kind of brute force pattern matching. It has limits, of course, but it's amazing that it works as well as it does.