logoalt Hacker News

Legend2440yesterday at 11:21 PM1 replyview on HN

Almost certainly those weren't even in the training data. They showed up too soon; LLMs are retrained only every 6-12 months.

Instead, the LLM did a web search for 'bixonimania' and summarized the top results. This is not an example of training data poisoning.

>This is an extraordinary claim, and extraordinary claims require extraordinary evidence.

Well, I don't know what to tell you; double descent is widely accepted in ML at this point. Neural networks are routinely larger than their training data, and yet still generalize quite well.

That said, even a model that does not overfit can still repeat false information if the training data contains false information. It's not magic.


Replies

runarbergtoday at 12:11 AM

> even a model that does not overfit can still repeat false information

A good model will disregard outliers, or at the very least the weight of the outlier is offset by the weight of the sample. In other words, a good model won’t repeat false information. When you have too many parameters the model will traverse every outlier, even the ones who are not representative of the sample. This is the poison.

To me it sounds like data scientists have found an interesting and seemingly true phenomena, namely double descent, and LLM makers are using it as a magic solution to wisk away all sorts of problem that this phenomena may or may not help with.

> Instead, the LLM did a web search for 'bixonimania' and summarized the top results. This is not an example of training data poisoning.

Good point, I hadn’t considered this, Although it is probably more likely it did web search with the list of symptoms and outputted the term from there especially considering the research papers which cited the fictitious disease probably did not include a made-up term in its prompt.