Irrelevant info is taught in grade skill and is a skill for the SAT for example.
Basically any kind of model (not just LLMs/ML) has to distill out irrelevant info.
The point is having an answer that you can defend logically and most people would agree.
If the model said “I’m not sure if this portion is a typo”, I guarantee you the model creators would take the RLHF in a different direction, because that is somewhat reasonable and defensible. However in your specific question, I personally think there is a singular objective answer—but that isn’t always the case to be fair for misleading/irrelevant prompts. The models are being fooled however based on how they respond.
I say this as a RLHF’er who sees and is told to write similar questions at times.
At the end of the day, this is how the Model creators want their models to predict language. And anyone using them is in for their ride.