How do you verify that the AI translation to Lean is a correct formalization of the problem? In other fields, generative AI is very good at making up plausible sounding lies, so I'm wondering how likely that is for this usage.
It may help to look at this example concretely:
The natural-language statement of the problem is (from https://www.erdosproblems.com/728):
> Let C>0 and ϵ>0 be sufficiently small. Are there infinitely many integers a,b,n with a≥ϵn and b≥ϵn such that a!b!∣n!(a+b−n)! and a+b>n+Clogn?
The Lean-language statement of the problem (which can be done either by hand or by AI) is (from https://github.com/plby/lean-proofs/blob/f44d8c0e433ab285541...):
∀ᶠ ε : ℝ in [>] 0, ∀ C > (0 : ℝ), ∀ C' > C,
∃ a b n : ℕ,
0 < n ∧
ε * n < a ∧
ε * n < b ∧
a ! * b ! ∣ n ! * (a + b - n)! ∧
a + b > n + C * log n ∧
a + b < n + C' * log n
Yes on the one hand, one needs to know enough about Lean to be sure that this formulation matches what we intend, and isn't stating something trivial. But on the other hand, this is not as hard as finding an error on some obscure line of a long proof.(There's also an older formulation at https://github.com/google-deepmind/formal-conjectures/blob/f... but the new one is more in the spirit of what was intended: see the discussion starting at https://www.erdosproblems.com/forum/thread/728#post-2196 which gives a clear picture, as of course does Tao's thread in the OP that summarizes this discussion.)
For this reason, when we announce results on e.g. the IMO, we formalize the statements by hand and inspect the proofs carefully to ensure they capture the full spirit of the problem.
However, there are some good heuristics. If you expect a problem to be hard and the proof is very short, you've probably missed something!
To answer the question a different way, I think you are asking how we know the proof actually matches the description the human provided? And I'd say we can't know for sure, but the idea is that you can pretty concisely write and check yourself that the problem is accurate, i.e. "There are an infinite number of primes" or whatever, and then even if an LLM goes off and makes up a lean proof wildly different from your description, if lean says the proof is valid then you have proven the original statement. I guess in theory the actual proof could be way different than what you thought it would be, but ultimately all the logic will still check out.
I feel like even outside of AI translation, formalization not capturing the spirit of what the informal description was provided is always a risk.
This is also a big risk when trying to prove code correctness: "prove this algo works" means you gotta define "works" along certain axes, and if you're very unlucky you might have a proof that exploits the uncertainty around a certain axis.
The statement is something you provide. It's the search you can have the LLM do. If this works for math it will immediately make code way higher quality via the same tools.
You're looking for the practical answer, but philosophically it isn't possible to translate an informal statement into a formal one 'correctly'. It is informal, ie, vaguely specified. The only certain questions are if the formal axioms and results are interesting which is independent of the informal formalisation and that can only be established by inspecting the the proof independently of the informal spec.
That's what's covered by the "assuming you have formalized the statement correctly" parenthetical.
Given a formal statement of what you want, Lean can validate that the steps in a (tedious) machine-readable purported proof are valid and imply the result from accepted axioms. This is not AI, but a tiny, well reviewed kernel that only accepts correct formal logic arguments.
So, if you have a formal statement that you've verified to represent what you are interested in by some other means, Lean can tell you whether the proof created by genAI is correct. Basically, there is a nigh infallible checker that won't accept incorrect hallucinations.