You are 100% correct with your assessment of the situation. But I do not agree with either of your conclusions:
1. These questions cannot and must not be compared as being similar to homework questions. These are different leagues and possibly even different sports.
2. The "more useful benchmark" that you suggest is already present in the data as we ran every model exactly once in Stage 1.