We know what we're missing (a lot, we didn't implement the full spec). We don't know what weird edge cases the clients/servers will have, and I would bet you decent money a LLM won't either. That's why manual testing and validation is so important to us.
I wouldn’t be so sure about the LLM not helping. The LLM doesn’t need to know about the edge cases itself. Instead, you’d be relying on other client implementations knowing about the edge cases and the LLM finding the info in those code bases. Those other implementations have probably been through similar test cycles, so using an LLM to compare those implementations to yours isn’t a bad option.