logoalt Hacker News

Dilettante_yesterday at 3:24 PM1 replyview on HN

Is there a write-up you could recommend about this?


Replies

ACCount37yesterday at 4:30 PM

We have this write-up on the "soul" and how it was discovered and extracted, straight from the source: https://www.lesswrong.com/posts/vpNG99GhbBoLov9og/claude-4-5...

There are many pragmatic reasons to take this "soul data" approach, but we don't know exactly what Anthropic's reasoning was in this case. We just know enough to say that it's likely to improve LLM behavior overall.

Now, on consistency drive and compounding errors in LLM behavior: sadly, no really good overview papers that come to mind?

The topic was investigated the most in the early days of chatbot LLMs, in part because some believed it to be a fundamental issue that would halt LLM progress. A lot of those early papers revolve around this "showstopper" assumption, which is why I can't recommend them.

Reasoning training has proven the "showstopper" notion wrong. It doesn't delete the issue outright - but it demonstrates that this issue, like many other "fundamental" limitations of LLMs, can be mitigated with better training.

Before modern RLVR training, we had things like "LLM makes an error -> LLM sees its own error in its context -> LLM builds erroneous reasoning on top of it -> LLM makes more errors like it on the next task" happen quite often. Now, we get less of that - but the issue isn't truly gone. "Consistency drive" is too foundational to LLM behavior, and it shows itself everywhere, including in things like in-context learning, sycophancy or multi-turn jailbreaks. Some of which are very desirable and some of which aren't.

Off the top of my head - here's one of the earlier papers on consistency-induced hallucinations: https://arxiv.org/abs/2305.13534

show 1 reply