Devstral Small 2 is an open-weights model: https://huggingface.co/mistralai/Devstral-Small-2-24B-Instru...
[1]https://github.com/facebookresearch/cwm [2]https://huggingface.co/facebook/cwm
Following that line of reasoning, context length is another very large confounding factor. Longer context lengths improve performance - but also result in enormous increases in KV cache size and memory requirements. We decide to control for this in our paper and focus at the 32K context length for 32B size models, a context length that already pushes the bounds of what can be "deployable" locally.
Still, we evaluate at 64K context length using YARN and are able to outperform CWM's 54% performance (non TTS), which it achieves using 128K context, a substantial increase over what we use. This is also pretty significant because we only ever train at 32K context, but CWM trains for a full 128K.
Used to be that agent = LLM + scaffold/harness/loop/whatever.
These all make the "bare LLMs" better suited to be used within the "agent" harness.
I think the more accurate term would be "agentic LLMs" instead of calling them "agents" outright. As to why its the case now, probably just human laziness and colloquialisms.
I give it 3-4 more weeks before we start to hear about the death of agentic frameworks. Pointing GPT5+ at a powershell or C#/Python REPL is looking way more capable than wiring up a bunch of domain-specific tools. A code-based REPL is the ultimate tool. You only need one and you can force the model to always call it (100% chance of picking the right tool). The amount of integration work around Process.Start is approximately 10-15 minutes, even if you don't use AI assistance.
Maybe there's certain problems that it excels at but probably 99% of what I throw it at can be gleaned from the context/nearby code anyway, like you said. Even if I'm using some in-house library (pretty much all of our code), the models are good enough to dig into that library and read the headers if they need to.
Maybe it can help with speed? If it needs to do less research before it can start coding.
It's also a great approach for building custom languages.
At least https://huggingface.co/facebook/cwm team had balls comparing to it directly (sort of, see TTS).
What does this model do that gpt-oss-20b does not? AFAIU the base model it was finetuned from is not reproducible, and if I flip a single bit in gpt-oss-20b and tell you how (instruction under MIT) that would satisfy "fully open finetuning" they claim as advantage. But that "open" fine-tuned gpt-oss-20b is probably going to beat their model.
Am I missing something?
I wonder if this indeed will start prompting more language specific work.
Afaik training still requires not just looking at sample code but also being able to write loss functions being able to have problems the AI can work at. That seems hard.
One random thought, are there training styles of just deleting some code from "good" projects then making the AI make it work again?
I wish if AI2 could release a more denser model on Openrouter for free than the 8B model as I was using Devstral model for agentic purposes.
If we can get an agentic good 32B like model on openrouter for ~free, then I feel like it will be very interesting to see how things would go imo.
Good luck with AI2! The premise of truly open source models is really interesting and I feel like it could help bring more innovation in the space imo!