PA Bench: Evaluating Frontier Models on Multi-Tab Pa Tasks

12 points • by shahules • yesterday at 8:11 PM • 3 comments • view on HN

We’re the team at Vibrant Labs (W24). We’ve been building envs for browser agents and quickly realized that existing benchmarks in this space didn’t capture the primary failure modes we were seeing in production (which scaled up as the number of applications and horizon length increase).

We built PA Bench (Personal Assistant Benchmark) to evaluate frontier computer/web use models on their ability to handle multi-step workflows across simulated clones of Gmail and Calendar.

*What’s next:*

We’re currently scaling the dataset to 3+ tabs and are building more high-fidelity simulations for common enterprise workflows. We’d love to hear feedback on the benchmark and notes about what was/wasn’t surprising about the results.

Blog post: https://vibrantlabs.com/blog/pa-bench

Comments

abhijithneil • yesterday at 10:04 PM

Is there a possible way computer use can be automated using multiple computer use agents from different providers, but also with some sort of routing setup so the best course of action can be chosen without hitting failures (for eg: permission issues in OpenAI could be rerouted to Gemini)

➕ show 1 reply

shahules • yesterday at 9:43 PM

Founder of Vibrant Labs here. We’re working on automating the synthesis of high-quality evals and RL data for LLM agents.

Some of the things we’re exploring:

1.Automated task and verifier generation

2.Synthesizing coherent worlds for evaluating and training agents

3.Continual learning setups for long-horizon agents

Would love to talk with anyone who's interested to know more!

alt Hacker News

PA Bench: Evaluating Frontier Models on Multi-Tab Pa Tasks

Comments