You also need to scrape huge amounts of data with no regard for copyright which is:
1. No longer possible the same way it was for openai and anthropic and
2. Much more regulated in the EU
Also the EU would need state backing since we don't have the same private capital, meaning the regulations are even tighter.
You can do what frontier labs do today which is to properly license things that are copyrighted and use open source web crawls for things that don’t have copyright issues. You can then also commission new datasets (volume needed goes down when quality is high).
The European regulations are the thing that will kneecap anything meaningful coming out of Europe. Mind blowing to me that this is worth the tradeoff since Europe will be beholden to other frontier labs be it China or the US, so regulations accomplishing very little if anything on impacting actual AI development and losing vast amounts of leverage in the process.