So LLMs predict the next token. Basically, you train them by taking your training data that's N...

TomatoCo • yesterday at 3:46 AM • 0 replies • view on HN

So LLMs predict the next token. Basically, you train them by taking your training data that's N words long and, for X = 1 to N, and optimizing it to predict token X using tokens 1 to X-1.

There's no reason you couldn't generate training data for a model by getting output from another model. You could even get the probability distribution of output tokens from the source model and train the target model to repeat that probability distribution, instead of a single word. That'd be faster, because instead of it learning to say "Hello!" and "Hi!" from two different examples, one where it says hello and one where it says hi, you'd learn to say both from one example that has a probability distribution of 50% for each output.

Sometimes DeepSeek said it's name is ChatGPT. This could be because they used Q&A pairs from ChatGPT for training or because they scraped conversations other people posted where they were talking to ChatGPT. Or for unknown reasons where the model just decided to respond that way, like mixing up some semantics of wanting to say "I'm an AI" and all the scraped data referring to AI as ChatGPT.

Short of admission or leaks of DeepSeek training data it's hard to tell. Conversely, DeepSeek really went hard into an architecture that is cheap to train, using a lot of weird techniques to optimize their training process for their hardware.

Personally, I think they did. Research shows that a model can be greatly improved with a relatively-small set of high quality Q&A pairs. But I'm not sure the cost evaluation should be influenced that much, because the ChatGPT training price was only paid once, it doesn't have to be repaid for every new model that cribs its answers.

alt Hacker News