We've seen public examples of where LLMs literally disable or remove tests in order to pass. I'm not sure having tests and asking LLMs to not merge things before passing them being "easy" matters much when the failure modes here are so plentiful and broad in nature.
My favourite so far was Claude "fixing" deployment checks with `continue-on-error: true`
[dead]
[dead]
You'd want to have the tests run as a github action and then fail the check if the tests don't pass. Optio will resume agents when the actions fail and tell them to fix the failures.