how strong is your internal informal LLM at theorem-proving before the formalization stage? or it's combined in a way so that is not measurable?