Skyvern Browser Agent 2.0: How We Reached State of the Art in Evals

49 points • by suchintan • 01/17/2025 • 31 comments • view on HN

Comments

happyopossum • 01/17/2025

Many of the examples given for agents such as this are things I just flat wouldn’t trust an LLM to do - buying something on Amazon for example: Will it pick new or ‘renewed’? Will it select an item that is from a janky looking vendor and may be counterfeit? Will it pick the cheapest option for me? What if multiple colors are offered?

This one example alone has so many branches that would require knowing what’s in my head.

On the flip side, it’s a ridiculously simple task for a human to do for themselves, so what am I truly saving?

Call me when I can ask it to check the professional reviews of X category on N websites (plus YouTube), summarize them for me, and find the cheapest source for the top 2 options in the category that will arrive in Y days or sooner.

That would be useful.

➕ show 5 replies

mkagenius • 01/18/2025

Pre-planned steps by Planner will go wrong more often than not, as it will try to guess the UI layers from its memory/training data. Its better to just ask the "next step" by giving it current state of the UI.

I have built a similar project for mobile automation [1] and the validator phase is not separate rather it's inherently baked in each step since we only ask next step based on current UI and previous actions.

My Planner sometimes goes "Oh, we are still on home screen, let's find the Uber app icon". This sort of self-correcting behaviour was not programmed but the LLM does it on its own.

1. https://github.com/BandarLabs/ClickClickClick - A framework to automate mobile use via any LLM (local/remote)

lyime • 01/17/2025

This is an impressive tool. I especially like the observability around the workflow and the steps it takes to achieve the outcome. We are potentially interested in exploring this if we can get the cost down at scale.

➕ show 1 reply

wejick • 01/18/2025

UI is most common interface but not particularly AI friendly, i'll wait for more standardized interface that's both human and AI friendly. Hoping it will still br a browser based.

➕ show 1 reply

skull8888888 • 01/17/2025

isn't browser use sota on web voyager? At this point web voyager seems to be outdated, there's def a need for a new harder benchmark.

➕ show 2 replies

govindsb • 01/17/2025

congrats Suchintan! huge achievement!

alt Hacker News

Skyvern Browser Agent 2.0: How We Reached State of the Art in Evals

Comments