I've been contemplating a decentralized model training system for some time using volunteer machines that we all contribute. But, it is astronomically difficult. The communication speeds are untenable.
And, there is the issue of data poisoning from untrusted nodes. I've almost cracked that last issue with a self-healing checkpointed rollback system that doesn't have to throw out anything that follows the corrupt datum.
But, I'm just one person with an idea and I don't have infinite funds to make this happen. This isn't a small project.
Maybe there would be interest in something like this, now that entire frontier labs are being banned from making further progress.
The total power of all GPUs on the planet dwarf their capabilities, if we had a way to harness them in a distributed way efficiently. We wouldn't be able to train a Fable as fast as them, but eventually having access is better than never having access.
>But when people think of decentralized training, they don’t first think of gigantic datacenters, owned by the same company, training models across large distances. Instead, they imagine thousands of small datacenters, or individual consumers, pooling their spare compute over the internet to orchestrate a training run larger than any single actor could manage alone. Many companies are pursuing this vision: Pluralis Research, Prime Intellect and Nous Research have already successfully decentrally trained models at scale. But in practice, training decentrally over the internet has lagged far behind more centralized training. Even their largest models (Pluralis’ 8B Protocol Model, Prime Intellect’s INTELLECT-1, and Nous’ Consilience 40B) have been trained with 1,000x less compute than today’s frontier models (such as xAI’s Grok 4). https://epoch.ai/gradient-updates/how-far-can-decentralized-...
> The total power of all GPUs on the planet dwarf their capabilities
That just isn't true. It misunderstands exactly how much silicon has gone directly to those companies, and exactly how much more powerful said silicon is compared to consumer grade gear.
The gradient info can be compressed 10000x with the right tricks, I think it is achievable. Nous claims they did it already:
https://github.com/NousResearch/DisTrO
There are other gradient compression papers from the past reporting large compression rates
>The communication speeds are untenable.
Can it be parallelized or not?
If you take a model, make two copies, and fine-tune each one on different data, what happens when you merge them? Does it work if you freeze different layers?
I think this works if the steps are small enough. And the transfer should become tenable if the steps are big enough. Where's the cutoff?
Is the total compute capacity outside of meta, google, amazon, anthropic, oai and x is higher than even the capacity of any of them? In any case, there's no chance a public collaboration gets to anthropic levels of compute even if communication were no issue.
there was a project trying to achieve some of those goals a few years ago using p2p: petals https://github.com/bigscience-workshop/petals
their bloom model was also a collaborative effort https://huggingface.co/docs/transformers/en/model_doc/bloom
This could be of interest to you: https://thealliance.ai/projects/tapestry
Ya that'd be an awesome project, the only issue is how do you verify it's not being poisoned? To actually validate it would require more analysis than the training took to run. It would require a trusted network, not an open one, unless that can get solved somehow.
there are some strong open source groups like NOUS research taking the fight https://nousresearch.com/
The biggest problem is accuracy and integrity of the actors in the project.
Could it be done by making a sparse MoE of thousands, or tens of thousands, of smaller experts in very niche domains? Maybe a tree-like structure of experts which can delegate from relatively general but inaccurate to extremely niche but accurate? Also these experts might be plug-and-play, easily swap out an inferior expert with a stronger one in the future without having to redo the whole pile?
Well, I suppose it is understandable why you want to attack the most obvious problem with such a scheme: obtaining sufficient compute.
That does mean you are actually neglecting the more difficult issues.
I believe we are not the only ones
As I replied to a child comment - this is a nice idea that just isn't tenable in reality. AI hardware isn't just hilariously faster than consumer GPUs, it's also hilariously more power-efficient and has hilariously better connectivity. Every one of these dimensions kills the idea.
The far, FAR superior power efficiency means that even if you did harness every public GPU or GPU-like device on earth, you'd end up consuming so much excess electricity it would be cheaper on net to simply take the money that would have gone to the power bill and spend it on your own datacenter.
And even if electricity was free, having those GPUs spread over the world with internet-level latency will slow everything down by factors of thousands to millions - if it's feasible at all. Regardless, you're not getting fable-oss this decade, maybe even not this century.
It would be better for governments to buy and own their own datacenters, maybe as a coalition, and dedicate their operation to the public good. I believe that is what we actually have to do.