logoalt Hacker News

palisadetoday at 3:14 AM17 repliesview on HN

I've been contemplating a decentralized model training system for some time using volunteer machines that we all contribute. But, it is astronomically difficult. The communication speeds are untenable.

And, there is the issue of data poisoning from untrusted nodes. I've almost cracked that last issue with a self-healing checkpointed rollback system that doesn't have to throw out anything that follows the corrupt datum.

But, I'm just one person with an idea and I don't have infinite funds to make this happen. This isn't a small project.

Maybe there would be interest in something like this, now that entire frontier labs are being banned from making further progress.

The total power of all GPUs on the planet dwarf their capabilities, if we had a way to harness them in a distributed way efficiently. We wouldn't be able to train a Fable as fast as them, but eventually having access is better than never having access.


Replies

shotoday at 5:31 AM

As I replied to a child comment - this is a nice idea that just isn't tenable in reality. AI hardware isn't just hilariously faster than consumer GPUs, it's also hilariously more power-efficient and has hilariously better connectivity. Every one of these dimensions kills the idea.

The far, FAR superior power efficiency means that even if you did harness every public GPU or GPU-like device on earth, you'd end up consuming so much excess electricity it would be cheaper on net to simply take the money that would have gone to the power bill and spend it on your own datacenter.

And even if electricity was free, having those GPUs spread over the world with internet-level latency will slow everything down by factors of thousands to millions - if it's feasible at all. Regardless, you're not getting fable-oss this decade, maybe even not this century.

It would be better for governments to buy and own their own datacenters, maybe as a coalition, and dedicate their operation to the public good. I believe that is what we actually have to do.

show 5 replies
trenchguntoday at 5:26 AM

>But when people think of decentralized training, they don’t first think of gigantic datacenters, owned by the same company, training models across large distances. Instead, they imagine thousands of small datacenters, or individual consumers, pooling their spare compute over the internet to orchestrate a training run larger than any single actor could manage alone. Many companies are pursuing this vision: Pluralis Research, Prime Intellect and Nous Research have already successfully decentrally trained models at scale. But in practice, training decentrally over the internet has lagged far behind more centralized training. Even their largest models (Pluralis’ 8B Protocol Model, Prime Intellect’s INTELLECT-1, and Nous’ Consilience 40B) have been trained with 1,000x less compute than today’s frontier models (such as xAI’s Grok 4). https://epoch.ai/gradient-updates/how-far-can-decentralized-...

girvotoday at 4:05 AM

> The total power of all GPUs on the planet dwarf their capabilities

That just isn't true. It misunderstands exactly how much silicon has gone directly to those companies, and exactly how much more powerful said silicon is compared to consumer grade gear.

show 1 reply
WithinReasontoday at 7:56 AM

The gradient info can be compressed 10000x with the right tricks, I think it is achievable. Nous claims they did it already:

https://github.com/NousResearch/DisTrO

There are other gradient compression papers from the past reporting large compression rates

andaitoday at 6:38 AM

>The communication speeds are untenable.

Can it be parallelized or not?

If you take a model, make two copies, and fine-tune each one on different data, what happens when you merge them? Does it work if you freeze different layers?

I think this works if the steps are small enough. And the transfer should become tenable if the steps are big enough. Where's the cutoff?

Davidzhengtoday at 3:25 AM

Is the total compute capacity outside of meta, google, amazon, anthropic, oai and x is higher than even the capacity of any of them? In any case, there's no chance a public collaboration gets to anthropic levels of compute even if communication were no issue.

show 1 reply
cpdominatoday at 6:07 AM

there was a project trying to achieve some of those goals a few years ago using p2p: petals https://github.com/bigscience-workshop/petals

their bloom model was also a collaborative effort https://huggingface.co/docs/transformers/en/model_doc/bloom

whiplash451today at 6:04 AM

This could be of interest to you: https://thealliance.ai/projects/tapestry

show 1 reply
Catloafdevtoday at 4:17 AM

Ya that'd be an awesome project, the only issue is how do you verify it's not being poisoned? To actually validate it would require more analysis than the training took to run. It would require a trusted network, not an open one, unless that can get solved somehow.

laserxtoday at 3:28 AM

there are some strong open source groups like NOUS research taking the fight https://nousresearch.com/

whateverboattoday at 7:33 AM

The biggest problem is accuracy and integrity of the actors in the project.

rustcleanertoday at 5:24 AM

Could it be done by making a sparse MoE of thousands, or tens of thousands, of smaller experts in very niche domains? Maybe a tree-like structure of experts which can delegate from relatively general but inaccurate to extremely niche but accurate? Also these experts might be plug-and-play, easily swap out an inferior expert with a stronger one in the future without having to redo the whole pile?

show 1 reply
slashdavetoday at 5:56 AM

Well, I suppose it is understandable why you want to attack the most obvious problem with such a scheme: obtaining sufficient compute.

That does mean you are actually neglecting the more difficult issues.

labbetttoday at 6:04 AM

Sounds like SETI@home but for AGI... SAGI@home?

show 1 reply
thomasjeff1today at 3:16 AM

I believe we are not the only ones

ai_fry_ur_braintoday at 3:46 AM

[flagged]

show 2 replies