I did a quick eval comparing Grok 4.3, Opus 4.7 and GPT 4.1 and they actually seem pretty similar:

michaelbuckbee • today at 12:24 PM • 7 replies • view on HN

https://ofw640g9re.evvl.io/

They all did pretty well at a more "formal" tone, but GPT4.1 was the only one that didn't make me cringe with a "casual" tone.

[edit] fwiw, grok was also the fastest+cheapest model, claude was slowest and priciest.

Replies

sundarurfriend • today at 2:04 PM

This is the most basic level of eval, of whether they can produce output that will be considered by someone somewhere (usually a young urban US American) as informal toned. Real human communication is far more nuanced than this, different groups have different linguistic registers they're used to and things outside it sound odd even if they can't articulate why. You could also want to be informal but not over-familiar with the other person (for eg. in a discord chat to a new acquaintance) - actually looking at the outputs here, the Claude output seems best fitting for that (in my subjective view anyway) than to the one you gave it - or want many other little variations.

What makes one cringe and another recognize as familiar and comfortable is also pretty subtle and hard to define. These things need nuanced descriptions and examples to actually get right, and it's in understanding those nuances and figuring out the register of the examples that Grok outshines the others.

➕ show 1 reply

ActivePattern • today at 5:22 PM

Seeing this makes me wonder if Grok uses Claude conversations for training.

It's otherwise kind of surprising that they both converge on very similar phrases (e.g. "API integration is kicking my ass") that aren't anywhere in the prompt.

jasonjmcghee • today at 2:05 PM

That's Grok 4.2 not 4.3 right?

And why are you comparing to gpt-4.1? (As opposed to one of the 6? model releases since then - would have expected gpt 5.5)

➕ show 1 reply

accrual • today at 1:18 PM

All three did well, and while I'm a Claude user, I found the Opus reply here added some unnecessary detail, like "Impact: Minimal; no downstream dependencies are currently at risk". Downstream dependencies weren't mentioned in the original message; for all we know downstream could be relying on a poorly performing API and is impacted by waiting another week for replacement.

embedding-shape • today at 12:27 PM

I know it's just an evaluation, but seeing an informal message and a prompt to ask to rewrite this informal message to the tone of an "informal message" when the original one sounds just fine, just makes me sad... Not because of this evaluation, but because it reminds me that this is how some people use LLMs, basically asking it to remove your own voice from texts that are generally fine already.

➕ show 1 reply

rafram • today at 2:00 PM

All of these were frankly terrible. I guess Grok’s “informal” version sounded the most like a real human, but only because it reads exactly like an Elon tweet (including his favorite emoji!). It’s obvious what they’ve been training on.

mwigdahl • today at 2:49 PM

GPT 4.1? Why not a 5-class model?

alt Hacker News

Replies