logoalt Hacker News

WarmWashyesterday at 9:17 PM1 replyview on HN

Start front loading the models with 5k, 10k, 50k, 100k tokens of messy quasi related context, and then run the benchmarks.

These models are ridiculously powerful with a blank slate. It's when they get loaded down with all the necessary (and inevitably unnecessary) context to complete the task that they really start to crumble and fold.


Replies

jballancyesterday at 10:09 PM

We need benchmarks that can distinguish between continuous learning and long-context extrapolation.