Built something similar for E2E web testing recently. A few observations from running an agentic tes...

zoetaka38 • today at 2:24 PM • 0 replies • view on HN

Built something similar for E2E web testing recently. A few observations from running an agentic test harness in production:

1. The single biggest jump in test quality came from giving the agent BOTH source code analysis AND live browser snapshots, not either alone. With code-only the agent hallucinates selectors; with browser-only it misses project conventions. Two MCP servers feeding the same agent — one local file-read, one Playwright in-process — was the architecture that worked.

2. For the browser snapshot tool, returning the raw DOM ate tens of thousands of tokens per call and the agent struggled to navigate it. Swapping to accessibility-tree refs (e1, e2, ...) cut token usage by ~10x and made the agent reliably target the right elements.

3. We avoided Docker-based MCP servers in production (we run on ECS Fargate). The in-process SDK MCP pattern (create_sdk_mcp_server + @tool decorator) keeps the browser handle in scope of the tool definition, which let us attach page.on('console') listeners and have the agent read them via a separate tool. Hard to do that across stdio process boundaries.

For game testing specifically — your text-renderer detail is interesting because it sidesteps the visual-grounding problem (how does the agent verify what it's seeing?). Curious how you'd extend this to a 2D/3D rendered game where the screen state isn't easily textualized.

alt Hacker News