Hard to tell, they only mention a few ones that got better, not clear results on others
You can check here the results for Devstral, speed limits me, but these are the results for the first 50 tests of the command
# Run lm-evaluation-harness lm_eval --model local-chat-completions \ --model_args model=test,base_url=http://localhost:8089/v1/chat/completions,num_concurrent=1,max_retries=3,tokenized_requests=False \ --tasks gsm8k_cot,ifeval,mbpp,bbh_cot_fewshot_logical_deduction_five_objects,mbpp \ --apply_chat_template --limit 50 \ --output_path ./eval_results
You can check here the results for Devstral, speed limits me, but these are the results for the first 50 tests of the command