logoalt Hacker News

FuckButtonsyesterday at 6:52 AM1 replyview on HN

I see where you’re coming from, but I think teasing out something that looks like a clear objective function that generalizes to improved intelligence from llm interaction logs is going to be hellishly difficult. Consider, that most of the best llm pre training comes from being very very judicious with the training data, selecting the right corpus of llm interaction logs and then defining an objective function that correctly models…? Being helpful? From that sounds far harder than just working from scratch with rlhf.


Replies

visargayesterday at 10:00 AM

The way I see it is to use hindsight, not to come with predefined criteria. The criteria is usefulness of one LLM response in the interactions that follow it down the line.

For example, the model might propose "try doing X", and I come back later and say "I tried X but this and that happened", it can use that asa feedback. It might be a feedback generated from the real world outcomes of the X suggestion, or even from my own experience, maybe I have seen X in practice and know if it works or not. The longitudinal analysis can span multiple days, the more context the better for self analysis.

The cool thing is that generating preference scores for LLM responses, training a judge model on them, and then doing RLHF with this judge model on the base LLM ensures isolation. So personal data leaks might not be an issue. Another beneficial effect is that the judge model learns to transfer judgements skills across similar contexts, so there might be some generalization going on.

Of course there is always the risk of systematic bias and random noise in the data, but I believe AI researchers are equipped to deal with it. It won't be as simple as I described, but the size of the interaction dataset and the human in the loop, and real world testing are certainly useful for LLMs.