It doesn't look like the code anonymizes usernames when sending the thread for grading. This likely induces bias in the grades based on past/current prevailing opinions of certain users. It would be interesting to see the whole thing done again but this time randomly re-assigning usernames, to assess bias, and also with procedurally generated pseudonyms, to see whether the bias can be removed that way.
I'd expect de-biasing would deflate grades for well known users.
It might also be interesting to use a search-grounded model that provides citations for its grading claims. Gemini models have access to this via their API, for example.
You can't anonymize comments from well-known users, to an LLM: https://gwern.net/doc/statistics/stylometry/truesight/index
What a human-like critizicism of human-like behavior.
I [as a human] also do the same thing when observing others in IRL and forum interactions. Reputation matters™
----
A further question is whether a bespoke username could influence the bias of a particular comment (e.g. A username of something like HatesPython might influence the interpretation of that commenter's particular perception of the Python coding language, which might actually be expressing positivity — the username's irony lost to the AI?).