This is fascinating that it worked though. Can we just merge all the open weight models and get something better?
I imagine it'd work the same as merging all the good-tasting foods to get an even tastier one
most merge improve a small subset of "feeling" benchmark (too small, too specific, or out of distribution) and tend to show degradation on actual benchmark, with especially punishing result on long chain benchmarks.
also only work on matching architectures (i.e. finetunes/loras of the same model)
that kinda worked in llama 1/2 era, not between different models but between finetunes of the same model. the briefly legendary Mythomax was IIRC a merge of 5+ tunes, some of which were merges themselves.
No, they need the same arch, but you can distill them into a single model. And yes, if you use the API directly Claude will often say it’s an open weight model (likely the ones it was distilled from)
If you go to Civitai this is pretty how it works in that corner of the image generation world
Everything is using Stable Diffusion as underlying model, then most of the usage is merged of checkpoints