logoalt Hacker News

maxlohyesterday at 10:55 PM0 repliesview on HN

That's a point.

It is legal to train on copyrighted materials, provided they were obtained legally. Most companies also train their models using user interactions with previous iterations.

It is impossible to release this data publicly, let alone license it to a third party. However, I believe that at least the training code and the data processing pipeline could, and should, be released in order to claim a model is truly "open source."

That said, Allen AI actually released several models with the full datasets available. It is impressive how they pushed the models' performance despite training on a limited set of publicly available data. Kudos to them.