Great points (and appreciate the coffeebreak recommendation!). Totally agree that the AI evaluation has plenty of inconsistency and errors.
I do want to clarify - it is more like audio only practice for digital flashcards. Meaning the prompt & response are both expected to be defined ahead of time. That way, GPT (as of today), is instructed to evaluate the semantic meaning of the user's response compared to the correct response.