I just realized the government probably has a lip reading AI model trained. Training one would be super easy. Download youtube videos with uploader-provided captions, cut to just scenes where only a single face is detected, and then use the lip points and facial landmarks and subtitle text (which has word-level timings) as training data. Then you can point a camera at anyone from a distance and know what they are saying. The longer they talk, the more accurate the output will be, as additional context is provided.