Claims in the article are incorrect. They conveniently ignore Meta CWM models, which are open-sourced [1] and open-weight [2] and are at 65% SWE-bench verified (with TTS) and 54% pass@1 and the same size (32B dense). So claims like "surpassing prior open-source state-of-the-art coding models of comparable sizes and context lengths" and conveniently leaving out the previous OSS SOTA out of your eval tables are ... sketch.
[1]https://github.com/facebookresearch/cwm [2]https://huggingface.co/facebook/cwm
The difference is that the Allen Institute models have open training data, not just open code and weights. Meta doesn't share the training data you would need to reproduce their final models. For many uses open-weight models are nearly as good, but for advancing research it's much better to have everything in the open.
The linked open weight disallows commercial, and is only licensed for research purpose
Hey! These are great observations. So first, while TTS can improve performance, we wanted to evaluate the raw capability of our model. This meant generating only one rollout per evaluation instance, which follows other papers in the space like SWE-smith and BugPilot. In addition, TTS adds extra inference cost and is reliant on how rollouts are ranked, two confounding factors for deployable models where memory and inference speed are extremely important.
Following that line of reasoning, context length is another very large confounding factor. Longer context lengths improve performance - but also result in enormous increases in KV cache size and memory requirements. We decide to control for this in our paper and focus at the 32K context length for 32B size models, a context length that already pushes the bounds of what can be "deployable" locally.
Still, we evaluate at 64K context length using YARN and are able to outperform CWM's 54% performance (non TTS), which it achieves using 128K context, a substantial increase over what we use. This is also pretty significant because we only ever train at 32K context, but CWM trains for a full 128K.