I agree with you completely about the trend which has been going on for years. And it's usually used to trivialize the vast expanse between humans and LLMs.
In this case though it's a pretty weird and hard job to create a context dynamically for a task, cobbling together prompts, tool outputs, and other LLM outputs. This is hard enough and weird enough that you can often end up failing to make text that even a human could make sense of to produce the desired output. And there is practical value to taking a context the LLM failed at and checking if you'd expect a human to succeed.