logoalt Hacker News

drob518today at 5:29 PM1 replyview on HN

But does that really help, or do you get distortion? The frequency distribution of human generated content moves slowly over time as new subjects are discussed. What frequency distribution do those “data generating processes” use? And at root, aren’t those “data generating processes” basically just another LLM (I.e., generating tokens according to a probability distribution)? Thus, aren’t we just sort of feeding AI slop into the next training run and humoring ourselves by renaming the slop as “synthetic data?” Not trying to be argumentative. I’m far from being an AI expert, so maybe I’m missing it. Feel free to explain why I’m wrong.


Replies

htrptoday at 7:03 PM

That's the problem in a nutshell. There is an art to how you generate the synthdata so that you don't get crappy trained models (especially when mistakes cost XX million dollars).

It's also theoretically why facebook paid 14bn for alex wang and scale ai