As I understand it, a big issue is that they are really hard to implement correctly. This means that backdoors and weaknesses might not exist in the theoretical algorithm, but still be common in real-world implementations.
On the other hand, Curve25519 is designed from the ground up to be hard to implement incorrectly: there are very few footguns, gotchas, and edge cases. This means that real-world implementations are likely to be correct implementations of the theoretical algorithm.
This means that, even if P-224/P-256/P-384 are on paper exactly as secure as Curve25519, they could still end up being significantly weaker in practice.
> As I understand it, a big issue is that they are really hard to implement correctly.
Any reference for the "really hard" part? That is a very interesting subject and I can't imagine it's independent of the environment and development stack being used.
I'd welcome any standard that's "really hard to implement correctly" as a testbed for improving our compilers and other tools.
I tried to defend a similar argument in a private forum today and basically got my ass handed to me. In practice, not only would modern P-curve implementations not be "significantly weaker" than Curve25519 (we've had good complete addition formulas for them for a long time, along with widespread hardware support), but Curve25519 causes as many (probably more) problems than it solves --- cofactor problems being more common in modern practice than point validation mistakes.
In TLS, Curve25519 vs. the P-curves are a total non-issue, because TLS isn't generally deployed anymore in ways that even admit point validation vulnerabilities (even if implementations still had them). That bit, I already knew, but I'd assumed ad-hoc non-TLS implementations, by random people who don't know what point validation is, might tip the scales. Turns out guess not.
Again, by way of bona fides: I woke up this morning in your camp, regarding Curve25519. But that won't be the camp I go to bed in.