Everything in this post is spot on and it is a rare example of a HN person not saying BS about LLMs!
That said, modern LLM sampling algorithms like min_p, top_n sigma , etc heavily mitigate the performance penalty you get from doing long context tasks. Problems with long context come from accumulation of small sampling errors over time.
My qwen 3.6 27b (the dense one) runs perfectly well on coding tasks at the edge of its context window because I run it using modern LLM sampling stack, namely top N sigma of one, using DRY to stop repetitions and XTC as a superior alternative to temperature for diversification.
Yes there will be a paper soon on arxiv and hopefully NeurIPS proceedings talking about this phenomenon because it’s not well appreciated by the academic AI community yet.
Can you please share you llama.cpp server parameters to turn on modern LLM sampling stack?
Docs [1] say that the top_n_sigma is already in the default sampler list: "(default: penalties;dry;top_n_sigma;top_k;typ_p;top_p;min_p;xtc;temperature)"
[1] https://github.com/ggml-org/llama.cpp/blob/master/tools/serv...