They absolutely are. The “maximum context window” of a model is a byproduct of the context length it...

Jabrov • today at 12:26 PM • 0 replies • view on HN

They absolutely are. The “maximum context window” of a model is a byproduct of the context length it was trained on.

If your model only ever sees 8K token samples during training, it won’t be as good at 128K context length than if you had trained on samples from 8 to 128

alt Hacker News