Well, this isn't probably a problem with the model, but the source frame having wrong eye gaze. Besides, perceptually lossless need not be defined in a side-by-side comparison context. If you were only viewing the right hand side video, how could you tell the eye gaze is off? The point was more on that the movement looks natural, unlike almost all neural avatars up to this year.
Your argumentation does make sense to me; but it also makes the term lossless pull a lot of weight. Lossless in video encoding is usually defined by zero difference between source and target.