I struggle to get quality results from the frontier models at contexts > 256k anyway.

droidjj • yesterday at 9:28 PM • 1 reply • view on HN

Replies

stingraycharles • yesterday at 9:58 PM

Yup, same experience, it’s because the attention basically has exponential complexity. So at large context windows, they need to compress the attention (eg group multiple tokens together), when then leads to loss in accuracy.

It’s almost always better to keep your context windows small.

alt Hacker News

Replies