My understanding is that the true entropy floor of a language is intractable- regardless of context length there will be “unpredictable” tokens where cross entropy loss is bound to happen. Even with infinite parameters and data you’ll still have a chance at failing to predict the next token correctly a decent chunk of the time.
Also, linear gains in context length scale quadratically with compute because of attention, so depending on context growth means taking a bath on GPUs for as long as you can, right?
Yeah I mean, if you and I were to play the word-guessing game where you needed to guess what next word I'm thinking of, there's always uncertainty in your guess because it's a game of partial information - you can't fully observe my inner state. But that doesn't mean you couldn't evolve a strategy that spends a really long time thinking and analyzing to get asymptotically close to the best guess. There's no limit on that intelligence.