That is a great example of the kind of thing they're paying people to create as training data.
You write the prompt, and then write rubrics to judge the responses, and you found something the model failed at. Congratulations, you just earned $500, now do it again.
Not the worst way to make money, but if internet-scale data were not enough to reduce errors to a somewhat tolerable margin, how much data do they hope to collect in this manner?