This is harder than it sounds, although I agree in a vacuum the idea is a good one.
So much value of the code review comes from having actual knowledge of the larger context. Mundane stuff like formatting quirks and obvious bad practices should be getting hoovered up by the linters anyways. But what someone new may *not* know is that this cruft is actually important for some arcane reason. Or that it's important that this specific line be super performant and that's why stylistically it's odd.
The real failure mode I worry about here is how much of this stuff becomes second nature to people on a team. They see it as "obvious" and forgot that it's actually nuance of their specific circumstances. So then a candidate comes in and misses something "obvious", well, here's the door.
It's not so hard. One of the interview stages I did somewhere well known used this.
Here's the neural net model your colleague sent you. They say it's meant to do ABC, but they found limitation XYZ. What is going on? What changes would you suggest and why?
Was actually a decent combined knowledge + code question.
You can do code review exercises without larger context.
An example from the interview: the code included a python web API and SQL schema. Some obvious points I noticed were no input validation, string concatenating for the database access (SQL injection), no input scrubbing (XSS), based on the call pattern there were some missing indices, a few bad data type choices (e.g. integer for user ID), a possible infinite loop in one case.
You might be thinking about it in the wrong way; what you want to see is that someone can spot these types of logic errors that either a human or AI copilot might produce regardless of the larger context.
The juniors will find formatting and obvious bad practices; the senior and staff will find the real gems. This format works really well for stratification.