logoalt Hacker News

Schlagbohrertoday at 10:32 AM8 repliesview on HN

What does this mean: "In addition, because most AI models are not trained uniformly across their maximum context length, their reasoning quality tends to degrade gradually near the limit rather than fail abruptly."

Models aren't trained across their context, their context is their short term memory at runtime, right? Nothing to do with training. They are trained on a static dataset.


Replies

anon373839today at 12:55 PM

When you read technical papers on various models, you’ll find that they often did most of the pretraining and even the supervised fine tuning using relatively short context data; then they “extended” the context window by training on a little bit of long context data. I think this is what is meant by not being trained uniformly.

However, now that RL environments and long-horizon agentic performance have taken such a prominent role in model development, I wonder if that practice still holds. I know that the most recent Gemma and Qwen models are incomparably more reliable at long contexts than their predecessors, even though, e.g. Qwen already had a 256k context. It just didn’t work like it does now.

vessenestoday at 11:56 AM

Context is the vector of tokens (numbers) that goes into the first layers of the neural network.

When you train, you teach the model to, among other things ‘self attend’ to the input vector, ultimately projecting that vector into a large embedding space.

Thought experiment —- if 99% of the time the last 100,000 digits of your vector was zero, how likely is it that you’d have high quality embedding trained by doing gradient descent on those outputs?

That’s what the paper is referring to.

andaitoday at 10:42 AM

Not sure how it is now, but a while back most of the training data was short interactions.

I noticed that the longer a chat gets, the more unpredictable the models behavior becomes (and I think that's still a common jailbreak technique too).

(I think it might also have something to do with RoPE, but that's beyond me.)

Jabrovtoday at 12:26 PM

They absolutely are. The “maximum context window” of a model is a byproduct of the context length it was trained on.

If your model only ever sees 8K token samples during training, it won’t be as good at 128K context length than if you had trained on samples from 8 to 128

AntiUSAbahtoday at 10:59 AM

So for the context to work well, you need some attention mechanism which makes sure that details are not getting lost due to context amount.

or lets say it differently: The LLM gets trained on static data but also on the capability of handling context in itself.

Kimi introduced this https://github.com/MoonshotAI/Attention-Residuals but i'm pretty sure closed labs like Google had something like this for a while.

show 1 reply
smallerizetoday at 11:00 AM

I think it means most of the training data is short. And a lot of the long-context examples are conversations where the middle turns are less important.

alansabertoday at 12:32 PM

They mean input token quantity

gbnwltoday at 11:22 AM

[dead]