I only had Claude rewrite the domain policies and generic instructions, not the individual task statements. I updated the blog with a link showing the exact changes.
So no leakage — it wasn’t solving or hinting at any of the specific test cases, since none of the tasks were ever exposed to it.