logoalt Hacker News

mschuster91today at 6:43 PM5 repliesview on HN

"open training" is something that won't ever happen for large scale models. For one, probably everyone's training datasets include large amount of questionable material: copyrighted media first and foremost (court cases have shown that AI models can regurgitate entire books almost verbatim), but also AI slop contaminating the dataset, or on the extreme end CSAM - for Grok to know how the intimate bits of children look like (which is what was shown during the time anyone could prompt it with "show her in a bikini") it obviously has to have ingested CSAM during training.

And then, a ton of training still depends on human labor - even at $2/h in exploitative bodyshops in Kenya [1], that still adds up to a significant financial investment in training datasets. And image training datasets are expensive to train as well - Google's reCAPTCHA used millions of hours of humans classifying which squares contained objects like cars or motorcycles.

[1] https://time.com/6247678/openai-chatgpt-kenya-workers/


Replies

iamcreasytoday at 9:16 PM

> "open training" is something that won't ever happen for large scale models

https://www.swiss-ai.org/apertus

Source: EPFL, ETH Zurich, and the Swiss National Supercomputing Centre (CSCS) has released Apertus, Switzerland’s first large-scale open, multilingual language model — a milestone in generative AI for transparency and diversity. Trained on 15 trillion tokens across more than 1,000 languages – 40% of the data is non-English – Apertus includes many languages that have so far been underrepresented in LLMs, such as Swiss German, Romansh, and many others. Apertus serves as a building block for developers and organizations for future applications such as chatbots, translation systems, or educational tools. The model is named Apertus – Latin for “open” – highlighting its distinctive feature: the entire development process, including its architecture, model weights, and training data and recipes, is openly accessible and fully documented.

hananovatoday at 7:57 PM

I’m not convinced that Grok’s dataset must contain CSAM for it to generate CSAM. Surely a combination of nude adults and clothed children would allow for it to synthesize CSAM?

(Disclaimer: I’m not in favor of AI in general and definitely not in favor of what Grok is doing specifically. I’m just entirely sold on the claim that its dataset must contain CSAM, though I think it is probably likely that it has at least some, because cleaning up such a massive dataset carefully and thoroughly costs money that Elon wouldn’t want to spend.)

oscarmoxontoday at 7:54 PM

Agree that this makes it unlikely we see frontier training data OS'd but this is a separate problem from software and infrastructure transparency, which has none of those constraints. Training stack, the parallelism decisions, documented failure modes are engineering knowledge and there's no principled reason it doesn't ship.

pfortunytoday at 7:46 PM

The human labor aspect is very little discussed and essential and very abusive, I am sure.

People think of these models as "magic" and "science" but they do not realize the immense amount (in human years) of clicking yes/no in front of thousands of pairs of input/outputs.

I worked for some months as a Google Quality Rater (wow), and know the job. This must be much worse.

addiefoote8today at 6:58 PM

I agree full transparency on data adds several other challenges. Still, even releasing the software and infrastructure aspects would be a huge step from where we are now. Also, some recent work has shown pretraining filtering to be possible and beneficial which could help mitigate some concerns of sensitive data in the datasets.