I ended up overengineering a LangGraph workflow to handle this. It forces the LLM to generate and pass its own tests in a sandbox before I even see the PR. The API costs are significantly higher because of the retry loops, but it filters out the low effort attempts.