Good catch, there was an issue with the second hardest thing in programming (caching).
Here's an updated eval with the proper models https://a3bmfqfom3.evvl.io/
Wow, I'm surprised. Grok 4.3 actually is noticeably better than the other two for the close-friend variant. Surprisingly I found Claude the cringiest of the three!
Thanks from where I'm looking Grok 4.3 and Claude 4.7 do a better job on the informal close friend/coworker vibe.
ChatGPT sounds fake / formal phrasing (for the specific close friend context) and has em-dashes and uses capitalization. Hence, ChatGPT does not, imo grok the assignment ;)
Is it me or did GPT get noticeably more natural in word choice recently? You can see it between 4.1 and 5.5 here, but I'm not sure when that happened. (My guess would be one of the recent 5.x releases.)
Edit: I meant specifically the absence of bizarre phrasing. That seems to have improved.