logoalt Hacker News

throw310822today at 11:14 AM1 replyview on HN

Makes me wonder if during training LLMs are asked to tell whether they've written something themselves or not. Should be quite easy: ask the LLM to produce many continuations of a prompt, then mix them with many other produced by humans, and then ask the LLM to tell them apart. This should be possible by introspecting on the hidden layers and comparing with the provided continuation. I believe Anthropic has already demonstrated that the models have already partially developed this capability, but should be trivial and useful to train it.


Replies

8organicbitstoday at 1:25 PM

Isn't that something different? If I prompt an LLM to identify the speaker, that's different from keeping track of speaker while processing a different prompt.