Its just Qwen2.5-VL with a sticker on it. Chinese are leading now!
Buried the lede - new benchmark for web tasks: https://huggingface.co/datasets/microsoft/WebTailBench
Why does Microsoft keep releasing models trained on synthetic data? Is it possible their contract with OpenAI won't let them do anything else?
I would think Microsoft, of all companies, would want to be working on their own LLM behind the scenes, even if they're relying on OpenAI for the bulk of their work.
Meta seems to be the only US company releasing big 'open source' models, while Chinese companies continue to release many completely open source LLMs.
If I'm reading this correctly, it's limited to browser use, not general computer use (eg, you won't be able to orchestrate KiCAD workflows with it). Not disparaging, just noticing the limitation.
I've been playing with the Qwen3-VL-30B model using Playwright to automate some common things I do in browsers, and the LLM does "reasonably well", in that it accelerates finding the right ways to wrangle a page with Playwright, but then you want to capture that in code anyway for repeated use.
I wonder how this compares -- supposedly purpose made for the task, but also significantly smaller.
I don't understand the use case here.. We've had this kind of automation for years now without needing a heavy GPU and without risk of going rouge. The worst that might happen is an interface changes once every year or two and you need to update your scripts.
Microsoft so hell bent on throwing all of their AI-SH*T and seeing what sticks.
Looking at the table, I will admit that I don't get most of the use cases ( maybe with exception of comparison shopping ( gather info ), but are people really 'outsourcing' shopping? Am I really that much outside what 'normal' consumers do these days?
Task Segment Tasks SoM GPT-4o-0513 SoM o3-mini SoM GPT-4o GLM-4.1V-9B OAI Comp-Use UI-TARS-1.5 Fara-7B Single-Site Tasks Shopping 56 62.5 71.4 38.1 31.0 42.3 41.1 52.4 Flights 51 60.1 39.2 11.1 10.5 17.6 10.5 37.9 Hotels 52 68.6 56.4 31.4 19.9 26.9 35.3 53.8 Restaurants 52 67.9 59.6 47.4 32.1 35.9 22.4 47.4 Activities 80 70.4 62.9 41.7 26.3 30.4 9.6 36.3 Ticketing 57 58.5 56.7 37.4 35.7 49.7 30.4 38.6 Real Estate 48 34.0 17.4 20.1 16.0 9.0 9.7 23.6 Jobs/Careers 50 49.3 44.0 32.7 22.7 20.7 20.7 28.0 Multi-Step Tasks Shopping List (2 items) 51 66.0 62.7 17.0 7.8 34.0 20.9 49.0 Comparison Shopping 57 67.3 59.1 27.5 22.8 1.2 8.8 32.7 Compositional Tasks 55 51.5 39.4 26.7 17.0 10.3 9.1 23.0 Overall
Are there any agentic models like this that would work for controlling input in arbitrary video games? I've been wanting to have an AI play Kerbal Space Program because I think it would just be pretty hilarious.
I find it kind of hilarious that a 7 billion parameter AI model is necessary to automate the clicking of webpages. I mean, how broken is the software stack if we can't script things? We jumped the shark, clearly.
How much VRAM would this require, if I would want to run this locally?
I bought a 12GB Nvidia card a year ago. In general I'm having a hard time to find the actual required hardware specs for any self hosted AI model. Any tips/suggestions/recommended resources for that?
Seems like SoM GPT-4o is the one to beat. Also table and plot does not seem to agree
It's great to see how we went from the first iteration of Claude Computer Use, to now being able to run it locally with just 7B params.
It is not working on my Mac Mini
Forgive me if I can't keep up with the latest AI bubble mania buzzwords, but what is "agentic" even supposed to mean? As far as I can tell it doesn't have a precise definition, and doesn't even sound like proper English.
[dead]
Buried the lead. Microsoft fine tuned Qwen2.5-VL-7B. That’s the big conversation starter here. Have any of the big providers done this before?
“The model is based on Qwen2.5-VL-7B and trained with supervised fine-tuning.”