The issue here is not whether Anthropic used Common Crawl, Alibaba also does that.
The issue is that by distilling Claude, Alibaba reuses the IP anthropic used to train the model that's more akin to historical Chinese reverse engineering methods and disrespect of IP
If using Common Crawl or Anna's Archive in your training data is legal, then surely the same is true for using conversations with Claude. I don't see a reasonable framework where training AI on copyrighted data is ok if and only if that data is not generated by AI
(granted, only meta got caught using Anna's Archive, but it seems safe to assume it's common practice. And even if it wasn't, the websites in Common Crawl are still covered by copyright)
I wish people would stop using Anthropics incorrect use of the term distill. They don’t share logits so you can’t distill. You can generate training data, which doesn’t sound nearly so scary.
'Issue' for who?
Anthropic clearly doesn't respect other people's IP, it's real rich that they now insist on theirs being worthy of protection.
Fwiw, I think the concept of IP in general is counter to human progress.
> reuses the IP anthropic used to train the model
> disrespect of IP
Nobody other than Anthropic cares.
> Alibaba reuses the IP anthropic used to train the model that's more akin to historical Chinese reverse engineering methods and disrespect of IP
Why is this any worse than Anthropic's disrepect of IP? You've apparently drawn a distinction between the two here, but I'm failing to see what it actually is.
Alibaba paid for that data though, right? They didn't hack Anthropic, they bought accounts and ran them normally.
Also, you can't copyright AI outputs. So worst case they violated the ToS.