> It remains unclear whether continuing to throw vast quantities of silicon and ever-bigger corpu...

drob518 • today at 4:59 PM • 6 replies • view on HN

> It remains unclear whether continuing to throw vast quantities of silicon and ever-bigger corpuses at the current generation of models will lead to human-equivalent capabilities. Massive increases in training costs and parameter count seem to be yielding diminishing returns. Or maybe this effect is illusory. Mysteries!

I’m not even sure whether this is possible. The current corpus used for training includes virtually all known material. If we make it illegal for these companies to use copyrighted content without remuneration, either the task gets very expensive, indeed, or the corpus shrinks. We can certainly make the models larger, with more and more parameters, subject only to silicon’s ability to give us more transistors for RAM density and GPU parallelism. But it honestly feels like, without another “Attention is All You Need” level breakthrough, we’re starting to see the end of the runway.

Replies

munificent • today at 5:29 PM

There is a whole giant essay I probably need to write at some point, but I can't help but see parallels between today and the Industrial Revolution.

Prior to the industrial revolution, the natural world was nearly infinitely abundant. We simply weren't efficient enough to fully exploit it. That meant that it was fine for things like property and the commons to be poorly defined. If all of us can go hunting in the woods and yet there is still game to be found, then there's no compelling reason to define and litigate who "owns" those woods.

But with the help of machines, a small number of people were able to completely deplete parts of the earth. We had to invent giant legal systems in order to determine who has the right to do that and who doesn't.

We are truly in the Information Age now, and I suspect a similar thing will play out for the digital realm. We have copyright and intellecual property law already, of course, but those were designed presuming a human might try to profit from the intellectual labor of others. With AI, we're in the industrial era of the digital world. Now a single corporation can train an AI using someone's copyrighted work and in return profit off the knowledge over and over again at industrial scale.

This completely unpends the tenuous balance between creators and consumers. Why would a writer put an article online if ChatGPT will slurp it up and regurgitate it back to users without anyone ever even finding the original article? Who will contribute to the digital common when rapacious AI companies are constantly harvesting it? Why would anyone plant seeds on someone else's farm?

It really feels like we're in the soot-covered child-coal-miner Dickensian London era of the Information Revolution and shit is gonna get real rocky before our social and legal institutions catch up.

➕ show 9 replies

xmprt • today at 5:14 PM

I see a lot of researchers working on newer ideas so I wouldn't be surprised if we get a breakthrough in 5-10 years. After all, the gap between AlexNet and Attention is All You Need was only 6 years. And then Scaling Laws was about 3-4 years after that. It might seem like not much progress is being made but I think that's in part because AI labs are extremely secretive now when ideas are worth billions (and in the right hands, potentially more).

Of course 5-10 years is a long time to bang our heads against the wall with untenable costs but I don't know if we can solve our way out of that problem.

➕ show 2 replies

embedding-shape • today at 5:04 PM

> I’m not even sure whether this is possible.

Based on what's happened so far, maybe. At least that's exactly how we got to the current iteration back in 2022/2023, quite literally "lets see what happens when we throw an enormous amount data at them while training" worked out up until one point, then post-training seems to have taken over where labs currently differ.

➕ show 1 reply

htrp • today at 5:10 PM

We pay people to create more high quality tokens (mercor, turing) which are then fed into data generating processes (synthetic data) to create even more tokens to train on

➕ show 1 reply

krainboltgreene • today at 5:25 PM

> The current corpus used for training includes virtually all known material.

This is just totally incorrect. It's one of those things everyone just assumes, but there's an immense amount of known material that isn't even digitized, much less in the hands of tech companies.

➕ show 1 reply

alt Hacker News

Replies