I think it's too early to declare the Turing test passed. You just need to have a conversation long enough to exhaust the context window. Less than that, since response quality degrades long before you hit hard window limits. Even with compaction.
Neuroplasticity is hard to simulate in a few hundred thousand tokens.
It was not meant as a pass/fail
For as rigorous of a Turing test as you present, I believe many (or even most) humans would also fail it.
How many humans seriously have the attention span to have a million "token" conversation with someone else and get every detail perfect without misremembering a single thing?
"You're absolutely right!"
I think for a while the test was passed. Then we learned the hallmark characteristics of these models, and now most of us can easily differentiate. That said -- these models are programmed specifically to be more helpful, more articulate, more friendly, and more verbose than people, so that may not be a fair expectation. Even so, I think if you took all of that away, you'd be able to differentiate the two, it just might take longer.