logoalt Hacker News

johnfnlast Wednesday at 4:46 PM1 replyview on HN

Really, you haven't found a single task they can't do? I like agents, but this seems a little unrealistic? Recently, I asked Codex and Claude both to "give me a single command to capture a performance profile while running a playwright test". Codex worked on this one for at least 2 hours and never succeeded, even though it really isn't that hard.


Replies

magicalhippoyesterday at 9:58 AM

I think I was using Grok Code 1 Fast with Cline, and had it trying to fix some code. Came back a bit later and found out that after not being able to make progress on fixing the code, it decided to "fix" the test by replacing it with a trivial test.

That made the test pass of course, leaving the code as broken as it ever was. Guess that one was on me though, I never specified it shouldn't do that...