I wasn’t talking about human reinforcement.
The discussion has been about CoT in LLMs, so I’ve been referring to the model in isolation from the start.
Here’s how I currently understand the structure of the thread (apologies if I’ve misread anything):
“Is CoT actually thinking?” (my earlier comment)
→ “Yes, it is thinking.”
→ “It might be thinking.”
→ “Under that analogy, self-training on its own CoT should work — but empirically it doesn’t.”
→ “Maybe it would work if you add external memory with human or automated filtering?”
Regarding external memory:without an external supervisor, whatever gets written into that memory is still the model’s own self-generated output — which brings us back to the original problem.