> beats Claude in our Cyber Benchmarks
Beats which model in Claude? Whenever a "benchmark" doesn't put precise model numbers in their headlines I am immediately skeptical. Either they don't know the difference (bad) or they are benchmarking against weaker models (misleading, also bad).
It's like when studies say "AI is bad at X" and they used GPT-3.5 in current year.
Here, it appears they compare a single prompt "find IDOR", against a multi-agent system. However, one can also start far more sophisticated skills that spin up subagents and mostly do the same in Claude Code, Codex, OpenCode, Pi, etc.
Which I guess makes what semgrep sells obsolete. Unless they have built a pareto-optimal point in terms of capabilities and token usage maybe?
GLM export controls incoming? I predict Commerce will force OpenRouter, HuggingFace to take some open models down within the next few months.
Not that it would make any sense.
You can launch GLM-5.2 in Opencode using Nemesis8: https://github.com/DeepBlueDynamics/nemesis8#nemesis-8
After installing, do a `n8 build` to build the image, then `n8 --danger --provider opencode interactive` to launch it in a container.
Signup for GLM-5.2 here: https://z.ai
It reads like an ad.
Secondly these are "just" IDORs, arguably the easiest class of vulnerabilities.
Thirdly it compares to GPT 5.5 and Opus 4.8.
No, we don't have Mythos at home.
These numbers are seem pretty low compared to what I was able to achieve specifically around windows kernel, win32k<->win32u to be exact. It honestly wouldn't surprise me anymore if china started surpassing models that US makes public, at least in specific categories such as cyber.
GLM 5.2 is already capable enough to assist in self-training which is similar to what we saw happen with frontier models and they appear to be getting there at a significantly lower cost than openai/anthropic.