Can someone explain what the current state of model benchmarking is? If you try to look up what the best locally runnable model is, you get a bunch of random blog posts using idiosyncratic criteria to rank things seemingly based on one dude's opinion.
Ideally I would love to see a leaderboard with relatively objective ranking criteria that 1. lets you filter by open weight / locally runnable, 2. filter by date of release (nothing older than x), and 3. is agnostic to hardware requirements. I just want to know what the best model is. Let me worry about how I will afford to run it.
I love the llmfit project for seeing what will run on your hardware, but it would be nice to know what I'm missing out on by not having better hardware, thus why objective hardware-agnostic ratings would be helpful.
I'm not much interested in vibe coding (for those who aren't aware that LLMs have other uses). The specific model I've been using with Ollama is hf.co/unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF:UD-Q4_K_XL and it's amazing how fast it is on 64 GB of RAM and i5-13400 CPU. No GPU on this computer. Gemma 4 E4B will think for a couple of minutes vs 3-5 seconds for Qwen. It's hard to believe how much you can do with such limited hardware using their models.
I am very interested in seeing new qwen models. Qwen3.6 27b is the first one that can do things and doesnt constantly loose "it's mind" and that can be run on a 3090 with a good context size. But it's sometimes getting into a loop.
For the safer link: https://xcancel.com/Alibaba_Qwen/status/2056403591464984753
https://xcancel.com/Alibaba_Qwen/status/2056403591464984753
> Qwen3.7 Preview lands on Arena !
> Here come Qwen3.7-Max-Preview & Qwen3.7-Plus-Preview. Alibaba now #6 lab in Text, #5 in Vision.
> Can't wait to release Qwen3.7 series models!Stay tuned! @arena
Gemma 4 and Qwen 3.6 were when my local inference experiments graduated from toy challenges with much hand holding to actually full day back and forth with good ability to utilise tool calls to discover how things are glued together.
I'm not talking about greenfield dev, I'm talking about interfacing with an existing decade old codebase.
Qwen 3.6 35B (finetuned) is so good that it became standard open weights for everyday use. Is not far at all from proprietary models if you give it tools, skills and agents etc, it can actually finish the job. (Thank you Qwen team, appreciated). Using opensource now we can definitely rely to design from scratch very complicated architecture and build pretty fast the full pack. Wish to see Europe AI unleashed, wake up.
Vision has become totally underappreciated, whereas I believe it brings important advantages to a model
Also, a big caveat in using Qwen models has always been its speech patterns. I do wonder how Google made the Gemma lineup so good at this
Let's hope Alibaba continues to open source its models
So glad they’re holding steady on open weights.
At least for now. Worried the Chinese team will change their mind once they have parity
There I was waiting on a smaller version of Qwen 3.6 to drop so I can run it on my Mac, and then bam, they drop this.
I stopped caring about benchmarks at MiniMax M2.5. I no longer want more advanced models. I want cheaper models that don't slow down when everyone else is online.
Will they release the large models as open weight too? So far it seems only 35 or 27 B etc models are being released with nothing larger unlike before.
I have a tangential question. Provided that it is correct that current proprietary models are offered at below cost-covering rates (I believe this is a consensus if I'm not mistaken¹); what factor (multiplication) would have to be applied approximately to current rates to reach break even?
¹: I think I read this a couple of times but I'm not sure if correct to begin with. Can this be substantiated based on annual financial reporting or other published business metrics by OpenAI, Anthropic et al.?
The jump from 3.5 to 3.6 was noticeable and set the bar. If they can keep the momentum, I’d pretty much say Qwen and China won the AI wars
I love that open weight models are catching up so quickly. Also hilarious how far behind Grok is. I guess demand for Grok must be poor if Anthropic is able to rent resources from xAI.
Today I learned Meta's new model is preferred to everything but claude. That is .. a real surprise! Congrats to the Meta team.
I don't think I can handle another small model release by qwen, I'm still trying to find the limits of 3.6 27B and they are already threatening us with a new one?
But jokes aside, I love the fast iteration, these are most probably again finetunes on the 3.5 architecture that appear better in internal testing, which is still very nice to see. Putting more and more pressure on the bigger labs to perform better is always a good thing.