Show HN: We fingerprinted 178 AI models' writing styles and similarity clusters

62 points • by nuancedev • today at 2:06 PM • 18 comments • view on HN

We have a dataset of 3,095 standardized AI responses across 43 prompts. From each response, we extract a 32-dimension stylometric fingerprint (lexical richness, sentence structure, punctuation habits, formatting patterns, discourse markers).

Some findings:

- 9 clone clusters (>90% cosine similarity on z-normalized feature vectors) - Mistral Large 2 and Large 3 2512 score 84.8% on a composite metric combining 5 independent signals - Gemini 2.5 Flash Lite writes 78% like Claude 3 Opus. Costs 185x less - Meta has the strongest provider "house style" (37.5x distinctiveness ratio) - "Satirical fake news" is the prompt that causes the most writing convergence across all models - "Count letters" causes the most divergence

The composite clone score combines: prompt-controlled head-to-head similarity, per-feature Pearson correlation across challenges, response length correlation, cross-prompt consistency, and aggregate cosine similarity.

Tech: stylometric extraction in Node.js, z-score normalization, cosine similarity for aggregate, Pearson correlation for per-feature tracking. Analysis script is ~1400 lines.

Comments

jefftk • today at 2:42 PM

> "Models with >75% writing similarity but massive price gaps. The cheap model writes the same way. You are paying for the brand.

* > ...*

* > Gemini 2.5 Flash Lite Preview 06-17 and Claude 3 Opus: 78.2%*

As someone who has tried to use many of these models for writing assistance, you're very wrong here. It really matters whether the model can get what I'm trying to communicate well enough to be helpful, or else I'll just write it myself. If you actually play with them a bit it's very clear these models are not substitutes. This goes for many on your list!

➕ show 3 replies

kurthr • today at 3:00 PM

It would be shocking to me if the large model trainers didn't have tools like this to analyze their outputs, but this is interesting work!

You can see who likely (post)trained/distilled their models or borrowed parameters from each other. I do wonder if the 32 dimensions were chosen/named from principal components or pre-selected and designed, but the tool seems like an effective discriminator in any case.

Were the prompts similarly selected for orthogonality? I've wondered how the different LLMs would respond from iterative zero-shot prompt_n generation by summary from a previous response_n to generate zero-shot response_n+1. Would it statistically converge to a more distinguishable prompt for that LLM?

sensarts • today at 5:11 PM

This is really cool. That's the good stuff. Did you notice any pattern in why models cluster? Shared training data or just similar architecture choices?

leonidasv • today at 2:51 PM

I've always wondered if the "typical" AI writing style is just an unavoidable RL artifact or a deliberate fingerprint to prevent model collapse as low-effort AI-generated text floods the training data pool (the web).

emaro • today at 4:07 PM

No mention of any linguistic theory, some arbitrary (?*) metrics mixed together and even more arbitrary thresholds. Why does 75% "similarity" mean "writes the same"?

Low quality post imo.

*Generated I assume.

qaid • today at 3:12 PM

Ugh. subheadings were a major turn off.

I expected it to be an analysis of AI-generated writing styles. Not full of them.

;)

➕ show 1 reply

docheinestages • today at 3:37 PM

The muted colors on a dark background makes everything hard to read.

➕ show 1 reply

a960206 • today at 4:39 PM

Amazing,last time I let GPT guess Claude content,it guess GPT made it

redox99 • today at 2:52 PM

Besides claiming opus and gemini flash share 99% of style being suspicious, the point that you are wasting money on the expensive model is non sensical. You pay primarily for the intelligence, not the writing style.

Is this article AI slop?

glaslong • today at 3:51 PM

I'm curious about the sorts of users who care about style but will either one-shot with default style, not providing samples or direction, or who even choose models on that style rather than, you know, substance.

groby_b • today at 3:45 PM

Without showing the prompts and responses, it's yet another meaningless AI benchmark.

Many of those numbers do not really match what I've seen in the wild, and without clear illustration why you arrived at the number it's not a helpful number.

apercu • today at 3:28 PM

Has anyone else used LLMs to fact check other LLMS?

I hate to say it, but Gemini lies less frequently than paid models from OPenAI and Anthropic (Open AI is worst in my use cases).

My guess is that Google has better training data (and uses less synthetic data which might be creating training feedback loops in other models), has more of a "be calibrated" model than a "be helpful" model, but it could just be that they leverage more RAG than leveraging weights more.

But, I really shouldn't speculate the "why" as I'm out of my domain. Just curious if others use all the models they can and compare outputs as much as I do.

agomezc01 • today at 3:00 PM

[dead]

rpdaiml • today at 2:57 PM

[dead]

alt Hacker News

Show HN: We fingerprinted 178 AI models' writing styles and similarity clusters

Comments