I thought this was common knowledge. DeepSeek’s Wikipedia entry says that they trained all their models on Nvidia chips procured before the U.S. embargo to China on them. It wouldn’t surprise me if they continued acquiring them through, well, less than legal means.
I also read somewhere (not Wikipedia) that they trained on ChatGPT, Claude, and Gemini queries, basically feeding in the output of competitor’s LLMs as training data. Kinda surprised they didn’t run into model collapse problems, but they stole their training data from other people who stole their training data from data collections that arguably stole them from content creators. It’s bandits all the way down, so adding a little smuggling to that doesn’t surprise me.
There’s no honor among thieves. You don’t get to cry about Chinese “bandits” when Anthropic just had to pay $1 billion to settle a massive copyright infringement lawsuit. All of these models were created through the mass-scale theft of humanity’s intellectual property, personal data, and dignity.
Open always beats closed. Drain the moats. Starve the ClosedAI beast.
It's not even stealing. They paid OpenAI for the tokens. It violates the OpenAI TOS, which specifically forbids using it's outputs for training competing models (which is very ironic)
Everyone trains on queries from other models, it's called distillation
> It wouldn’t surprise me if they continued acquiring them through, well, less than legal means.
Strictly speaking, it's not illegal for them to acquire it, it's illegal for an exporter in the US to sell (even if transitively) to them.
I think the fact that DeepSeek trains on competitor queries (i.e., distillation) — along with using banned Nvidia chips — helps explain how it can achieve such low training costs (USD 6 million vs. billions) while delivering only slightly worse performance than its American counterparts. It also undermines the narrative that DeepSeek or China is posing a serious challenge to the U.S. lead in AI. The gap may be closing, but the initial reactions now seem knee-jerk.
That the discussion has being hijacked and shifted to moral superiority is really unfortunate, because that was never the point in the first place.
training on data isn’t stealing the data, in the same way learning from a textbook doesn’t mean youre stealing from it
| Kinda surprised they didn’t run into model collapse problems
Not sure why you would expect this, all the models started doing this as its much more cost effective to get data for post training don't you remember the first grok release where many times it started replies "as a model trained by openai..."
> I also read somewhere (not Wikipedia) that they trained on ChatGPT, Claude, and Gemini queries, basically feeding in the output of competitor’s LLMs as training data
All the labs permitting synthetic data do that.
They don't need to break any laws for this. Where do you think are the customers for data centers in the Middle East? Chinese companies do everything legally, paying for access to data centers that got the chips directly from the US.
Lets take the weapons embargos placed on Israel by our allies. In the NDAA must pass bill we set funds aside to procure those weapons and sell them to Israel. We don't really care about these things we have selective enforcement.
> less than legal means.
This is an absurd concept when it comes to international trade. Even intellectual property is mostly meaningless outside a state. Of course people will evade sanctions; what is the us going to do, invade singapore or malaysia?
> Kinda surprised they didn’t run into model collapse problems,
This is just model distillation.
Anyone with the expertise to build a model from scratch (which DeepSeek certainly can) can do this in a careful manner.
> but they stole their training data from other people who stole their training data from data collections that arguably stole them from content creators.
Bingo.
I have no problem with pirates pirating other pirates.
Screw OpenAI and Anthropic closed source models built from public data. The law should be that weights trained from non-owned sources should be public domain, or that any copyright holder can sue them and demand model takedown.
Google and Meta are probably the only two AI companies that have a right to license massive amounts of training data from social media and user file uploads given that their ToSes grant them these rights. But even Meta is pirating stuff.
Even if OpenAI and Anthropic continue pirating training data and keeping the results closed, China's open source strategy will win out in the end. It erodes the crust of value that is carefully guarded by the American giants. Everyone else will be integrating open models and hacking them apart, splicing them in new ways.
"The TBD group is using several third-party models as part of the training process for Avocado, distilling from rival models including Google’s Gemma, OpenAI’s gpt-oss and Qwen, a model from the Chinese tech giant Alibaba Group Holding Ltd., the people said."
LOL. Distillation doesn't count as plagiarism, or you should call Meta out on it. They're distilling the Chinese model.
Ref: https://www.moneycontrol.com/news/business/inside-meta-s-piv...
Lol @ quoting 3 companies that broke any possible copyright law as victims.
i find it weird that this comment got so much pushback. i don't think it was portraying deepseek as any more morally wrong than anyone else, or castigating anyone as morally wrong.
but maybe i gave it a gracious reading.
Singapore is where it happens.
> It’s bandits all the way down, so adding a little smuggling to that doesn’t surprise me.
Implying it’s *morally* wrong for a Chinese company to bypass US sanctions is hilarious. You really say that with a straight face when even the president admits this is only protectionism?