Btw samplers do in fact help with this. Random tokens deep in your output context are due to accumulated sampling errors from using shit samplers like top_p and top_k with temperature.
Use a full distribution aware sampler like p-less decoding, top-H, or top-n sigma, and this goes away
Yes the paper for this will be up for review at NeurIPS this year.