"The community" is astroturfed as hell though. Anthropic pays influencers to promote Claude Code and likely bots a ton as well, so it's hard to come to any kind of consensus online. Even if everyone was acting in good faith, some people will have a much better experience than others because of the domain they're working in (e.g. AI being much better at frontend and commonly used libraries).
The only real way to evaluate a model is to test it yourself but that's exhausting for each new model and not comprehensive anyway.
Yeah, it's crazy that there is no trustworthy source for model reviews. I'd love to know how well the new Deepseek 4 actually performs, for example, but I don't want to spend the next week testing it out. Reddit used to be a somewhat useful gauge, but now there are posts on how 4 is useless right next to posts on how amazing it is. And I have no idea if this is astroturfing, or somebody using a quantized version, or different workloads, or what.
I also find it increasingly difficult to evaluate the models I actually do use. Sometimes each new release seems identical or only marginally better than the previous version, but when I then go back two or three version, I suddenly find that oder model to be dramatically worse. But was that older model always that quality, or am I now being served a different model under the same version name?
It's all just so opaque.