> You can do what frontier labs do today which is to properly license things that are copyrighted and use open source web crawls for things that don’t have copyright issues. You can then also commission new datasets (volume needed goes down when quality is high).
It cost Anthropic $1.5 billion for training on libgen's 480k pirated ebooks.
Investors will cough up that money if you're already clearly a frontier lab with a model people are paying a lot of money for.
Tough to get that much cash without anything to show.
I thought the joke was that people aren't paying enough money.