logoalt Hacker News

xdavidliulast Friday at 8:57 PM1 replyview on HN

I should have clarified what I meant. The training data includes roughly speaking the entire internet. Open source code is probably a large fraction of the code in the data, but it is a tiny fraction of the total data, which is mostly non-code.

My point was that the hypothetical of "not contributing to any open source code" to the extent that LLMs had no code to train on, would not have made as big of an impact as that person thought, since a very large majority of the internet is text, not code.


Replies

maplethorpeyesterday at 4:32 AM

I'm sorry but your point doesn't make sense to me. Training on all the world's text but omitting code means that your machine won't know how to write code. That's an enormous impact, not a small one.

Unless you're in the camp that believes ChatGPT can extrapolate outside of its training data and do computer programming without having ever trained on any computer programming material?

show 1 reply