I added this to a benchmark I've been doing of how well agents find security bugs, specifically security bugs originally found by Mythos. It performs poorly with only read/grep/ls tools, but in a follow-up test with a full shell and Python, it doubled its findings (still a poor showing, but it does at least indicate it is doing what it says on the tin: making tools to help it solve problems). It also did worse than Qwen AgentWorld, another recent post-train of Qwen 3.6 MoE intended for agentic use.
It looks like they're comparing Orinth 9B to Qwen 3.5 35B, not Qwen 3.6. I guess it kind of makes sense since it's a finetune of 3.5, but I totally missed until I looked closely.
In my brief tests, Ornith 35B performed quite well. It won't replace DeepSeek V4 Flash for me, but if it was fast and cheap enough it might.
I don't remember being super impressed with Ornith 9B, but I could see it being on par with Qwen 3.5 35B.
https://swelljoe.com/post/will-it-mythos/
reply