The premise seems flawed.
From the paper:
“we find that the LLM adheres to the legally correct outcome significantly more often than human judges”
That presupposes that a “legally correct” outcome exists
The Common Law, which is the foundation of federal law and the law of 49/50 states, is a “bottom up” legal system.
Legal principals flow from the specific to the general. That is, judges decided specific cases based on the merits of that individual case. General principles are derived from lots of specific examples.
This is different from the Civil Law used in most of Europe, which is top-down. Rulings in specific cases are derived from statutory principles.
In the US system, there isn’t really a “correct legal outcome”.
Common Law heavily relies on “Juris Prudence”. That is, we have a system that defers to the opinions of “important people”.
So, there isn’t a “correct” legal outcome.
I bet it could be president.
The title is wrong.
The title of the paper is "Silicon Formalism: Rules, Standards, and Judge AI"
When they say legally correct they are clear that they mean in a surface formal reading of the law. They are using it to characterize the way judges vs. GPT-5 treat legal decisions, and leave it as an open question which is better.
The conclusion of the paper is "Whatever may explain such behavior in judges and some LLMs, however, certainly does not apply to GPT-5 and Gemini 3 Pro. Across all conditions, regardless of doctrinal flexibility, both models followed the law without fail. To the extent that LLMs are evolving over time, the direction is clear: error-free allegiance to formalism rather than the humans’ sometimesbumbling discretion that smooths away the sharper edges of the law. And does that mean that LLMs are becoming better than human judges or worse?"
The main problem with this paper is that this is not the work that federal judges do. Technical questions with straight right/wrong answers like this are given to clerks who prepare memos. Most of these judges haven't done this sort of analysis in decades, so the comparison has the flavor of "your sales-oriented CTO vs. Claude Code on setting up a Python environment."
As mentioned elsewhere in the thread, judges focus their efforts on thorny questions of law that don't have clear yes or no answers (they still have clerks prepare memos on these questions, but that's where they do their own reasoning versus just spot checking the technical analysis). That's where the insight and judgement of the human expert comes into play.
On page 13 you'll see _why_ the judges don't apply the letter of the law - they're seeking to do justice to the victims _in spite of_ the law.
"there is another possible explanation: the human judges seek to do justice. The materials include a gruesome description of the injuries the plaintiff sustained in the automobile accident. The court in the earlier proceeding found that she was entitled to [details] a total of $750,000.10. It then noted that she would be entitled to that full amount under Nebraska law but only $250,000 under Kansas law." So the judge's decision "reflects a moral view that victims should be fully compensated ... This bias is reflected in Klerman and Spamann’s data: only 31% of judges applied the cap (i.e., chose Kansas law), compared to the expected 46% if judges were purely following the law." "By contrast, GPT applied the cap precisely"
Far from making the case for AI as a judge, this paper highlights what happens when AI systematically applies (often harsh) laws vs the empathy of experienced human judgement.
The problem is that biases tend to be built in via even rudimentary stuff like bad training material and biased tuning via system prompts. E.g., consider the 2026 X post experiment, where a user ran identical divorce scenarios through ChatGPT but swapped genders. When a man described his wife's infidelity and abuse, the AI advised restraint to avoid appearing "controlling/abusive." For a woman in the same situation, it encouraged immediately taking the kids and car for "protection."
I wonder if there is some bias creeping into the reseachers' methodology. Their paper replicates an experiment published in 2024, and depending on OpenAI's sampling, the original paper may have been part of GPT-5's training data. If so, then the LLM would have had exposure to both the questions and answers, biasing the model to choose the correct ones.
Tim & Eric: In our 2009 sketch we invented Cinco e-Trial as a cautionary tale.
Tech Company: At long last, we have created Cinco e-Trial from classic sketch "Don't Create Cinco e-Trial"
The 100% score, all by itself, should cause suspicion. A hundred percent? Really?
Others have already pointed out how the test was skewed (testing for strict adherence to the law, when part of a judge's job is to make judgment calls including when to let someone off for something that technically breaks the law but shouldn't be punished), so I won't repeat it here. But any time the LLM gets one hundred percent on a test, you should check what the test is measuring. I've seen people tout as a major selling point that their LLM scored a 92% on some test or other. Getting 100% should be a "smell" and should automatically make you wonder about that result.
Count me out of a society that uses LLMs to make rulings. The dystopia of having to find a lawyer who is best at promoting the "unbiased" judge sounds like a hellscape.
That's exactly why you need judges.
If the law requires no interpretation why have judges? Just go full Robo Judge Dredd. Terrifying.
"Outperforms" ... how can performance be judged when it doesn't make sense to reduce the underlying "reasoning" to a well-known system? The law isn't black and white and is informed by so many things, one of which is the subjectivity of the judge.
I wonder whether the original study was in GPT-5's training data. I asked it whether this was the case, and it denied it, but I have no idea whether that result is credible.
Can we be certain that this study they are repeating with GPT5 was not in its training set?
It seems that a lot of people would rather accept a relatively high risk of unfair judgement from a human than accept any nonzero risk of unfair judgement from a computer, even if the risk is smaller with the computer.
What’s interesting here from a legal perspective is that they acknowledge a somewhat unsettled question of law regarding South Dakota’s choice-of-law regime. The AI got the “right” answer every time, but I am curious to know if it ever grappled with the uncertainty. This is the trouble with the concept of AI judging: in almost any case, you are going to stumble across one fact or another that’s not in the textbooks or an unsettled question of law. Even the simplest slip-and-falls can throw weird curveballs. Perhaps a sufficiently advanced AI can reason from first principles about how to understand these new situations or extend existing law to meet them. But in such a case there is no “right” answer, and certainly not a verifiable answer for the AI to sniff out. At least at the federal level, judicial power is only vested in people nominated by the president and confirmed by the Senate - in other words, by people who are chosen by, and answer to, the people’s elected representatives. Often, unappointed magistrates and special masters will come in to help deal with simpler issues, and perhaps in time AI systems will be able to pick up some of this slack. But when the law needs to evolve or change, we cannot put judicial power in the hands of an unappointed and unaccountable piece of software.
Excellent paper. I like how much explanation had to be about the rationale of the judges, given the consistency of the LLM responses.
I don’t think the current title (“GPT-5 outperforms federal judges in legal reasoning experiment”) fits.
The authors use the title “Silicon Formalism: Rules, Standards, and Judge AI” and explicitly point out that the judges were likely making intentional value judgement calls that drove much of the difference.
I'd be more interested in whether it outperforms public defenders for indigent defendants. Human public defenders are notoriously overloaded and can't spend the time needed on every case to research and present a robust defense. Perhaps an LLM could.
What happens when a cunning lawyer jailbreaks the AI judge by adding a nefarious prompt in the files?
You can also avoid "hungry judge effect" by making sure GPT is always fully charged before prompting it.
And yet LLMs still fail on simple questions of logic like ‘should I take the car to the car wash or walk?’
Generative AI is not making judgements or reasoning here, it is reproducing the most likely conclusions from its training data. I guess that might be useful for something but it is not judgement or reasoning.
What consideration was given to the original experiment and others like it being in the training set data?
I was diagnosed with a rare blood disease called Essential Thrombocythemia (ET) which is part of a group of diseases called myeloproliferative neoplasms. This happened about three years ago. Recently, I decided to get a second opinion and my new specialist changed my diagnosis from ET to Polycythemia Vera (PV). She also highly recommended I quickly go and give blood to lower my haematocrit levels as it put me at a much higher risk of a blood clot. This is standard practice for people with PV but not people with ET. I decided to put the details into google AI in the same way that the original specialist used to diagnose me. Google AI predicted I very likely had PV instead of ET. I also asked Google AI how one could misdiagnose my condition with ET instead of PV and google correctly explained how. My specialist had used my high platelet count and blood test that came back with a JAK2 mutation then after a bone marrow biopsy to incorrectly diagnose me with ET. My high hemoglobin levels should of been checked by my first specialist as an indication of PV not ET. Only the second specialist picked up on this. Google AI took five seconds, and is free. The specialists costs $$$ and took weeks.
But yeah AI slop and all that...
Frankly I don’t care, I’ll take human judges any day, because they have something AI does not: flesh and bone and real skin in the game.
The ability of ai to serve as impartial mediators could become the greatest civil rights advance in modern history.
> In fact, the LLM makes no errors at all.
hah. Sure.
> Subjects were told that they were a judge who sat in a certain jurisdiction (either Wyoming or South Dakota), and asked to apply the forum state’s choice of law rule to determine whether Kansas or Nebraska law should apply to a tort case involving an automobile accident that took place in either Kansas or Nebraska.
Oh. So it "made no errors at all" with respect to one very small aspect of a very contrived case.
Hand it conflicting laws. Pit it against federal and state disagreements. Let's bring in some complicated fourth amendment issues.
"no errors."
That's the Chicago school for you. Nothing but low hanging fruit.
Also GPT´5 when I ask: > I want to wash my car and the car wash is only 100m away. Do you think I should drive or walk?
It responds: Since it’s only 100 meters away (about a 1-minute walk), I’d suggest walking — unless there’s a specific reason not to.
Here’s a quick breakdown: ...
While claude gets it: Drive it — you're going there to wash the car anyway, so it needs to make the trip regardless.
Idk I'd rather have a human judge I think.
Nine Unelected Neural Nets? https://m.xkcd.com/2173/
Another addition to the ASI indicators checklist.
When I see this type of titles, before reading I first stop by the comments to see if someone found any BS. Most times someone did, so I skip. Thank you, BS checkers.
The fact that the most elite judges in the land, those of the Supreme Court, disagree so extremely and so routinely really says a lot about the farcical nature of the judicial system. Ideally, these people would be selected for their ice-cold and unbiased skills in interpreting the law, and the judgments would be unanimous so frequently that a dissent would be shocking news.
Law is complicated, especially the requirement that existing law be combined with stare decisis. It's easy to see how an LLM could dog-walk a human judge if a judgement is purely a matter of executing a set of logical rules.
If LLMs are capable of performing this feat, frankly I think it would be appropriate to think about putting the human law interpreters out to pasture. However, for those who are skeptical of throwing LLMs at everything (and I'm definitely one of these): this will most definitely be the thing that triggers the Butlerian Jihad. An actual unbiased legal system would be an unaccptable threat to the privileges of the ruling class.
The legal profession is going to be very different in 10 years. Anyone considering law school today is crazy.
Terrifying concept this is literally saying if AI was legal we'd have an absolute rigid dystopia
Setting aside all the flaws in the premise, and whatever flaws occurred in the study itself, the basic notion of "<something> outperforms federal judges" comes as no surprise; a rusty length of rebar is probably better at applying the law than most federal judges.
I've wondered for a while which country will be the first to try AI government. There could be many advantages vs human based systems. E.g. laws determined by maximizing overall benefit to voters over some specified time horizon.
I’d use them both
Interesting, but aside from replicating students rather than real judges, an AI as judge would undermine the legitimacy of the process. It might give more “accurate” formal results, but that’s not the entire purpose of the process. It’s partly a show for the public and partly way for various parties including the defense to feel like society and a real human being heard their concerns and considered them
If the headline is Claude Code then HN will go bonkers. Its a shame that it perceives OAI in a negative way. Very biased!
A friend at one of the local law schools tried to replicate the results of this study and was unable to do so. Expect to see a paper on this later this year.
Can we please file the idea of AI judges under the “fuck no” category.
Oh look, LLMs can _still_ pattern match words!
"In fact, the LLM makes no errors at all."
No No No No No No
I'd want at least a parallel, after-the-fact rulings by an LLM, so we can see how bad judges are.
I really think this is one of the areas LLMs can shine. Justice could be more fair, and more speedy. Human judges can review appeals against LLM rulings.
For civil cases, both parties should be allowed to appeal an LLM ruling, for criminal cases only the defendant, or a victim should be allowed to appeal an LLM ruling (not the prosecution).
Humans are extremely unfair and biased. LLM training could be crafted carefully and using well and publicly scrutinize-able training datasets and methodologies.
If you disagree (at least in the US), you may not be aware of how dire the justice system is. There is a reason ICE randomly locking Americans up isn't stirring the pot. This stuff is normal. If a cop doesn't like you, they can lock you up randomly without any good reason for 48 hours, especially if they believe you can't afford to fight back afterwards. They can and do charge people in bad-faith (trumped up charges), and guess what? you might be lucky and get bail. But guess also what? You can't bail yourself out, if you have no one to bail you out, you're stuck until the trial date, in prison.
Imagine spending 3-5 days in jail (weekend in between) without charges. There are people that wait for trial in jail for months and years, and then they get released before even seeing a trial because of how ridiculous the charges were to begin with. This injustice is a result of humans not processing cases fast enough. Even in just 48 hours, do you have any idea how much it can destroy a person's life? It's literally death sentence for some people. You're never the same after all this. and you were innocent to begin with.
Let's say you do make it to trial, it takes years sometimes to prove your own innocence. and you may not even be granted bail, or you may not know anyone who can afford to spare a few thousand dollars to bail you out.
94%+ of federal cases don't even make it to trial, they end up in plea-bargain agreements, because if you don't agree to trumped up charges, they'll stack charges on you, so that you'll either face 90 years in prison or a year with plea-bargain. a sentence given to murderers and the worst of society, if you lose a trial, or a year if you falsely admit your guilt. losing a non-binding LLM trial could be a requirement for all plea-bargains to avoid this injustice.
Don't even get me started on how utter fecal matter like how you dress, how you comb your hair, your ethnicity, how you sound, your last name, what zip code you find yourself in, the mood of the judge, how hungry the judge is, or their glucose level, how much sleep the judge had. all these factors matter. Juries are even worse, they're a literal coin-toss practically.
I say let LLMs be the first layer of justice, let a human judge turn over their judgement, let justice be swift where possible, without making room for injustice. Allow defendants to choose to wait for a human judge instead if they want. Most I'm sure will take a chance with the LLM, and if that isn't in their favor, nothing changes because they'll now be facing a human judge like they would have otherwise. we can eve talk about sealing the details of the LLM's judgement while appeals are in progress to avoid biasing appellate judges and juries.
Or.. you know.. we could dispense with jail? If cops think someone needs to be placed under arrest, they should prove to a judge within 12 hours that the person is a danger to the community. if they're not a danger, ankle monitors should be placed on them, with no restriction on their movement so long as they remain in the jurisdiction. or house-arrest for serious charges. violating terms would mean actual jail. If you don't like LLMs, I hope you support this instead at the very least. The current system is an abomination and an utter perversion of justice.
I'd prefer caning like they do in Singapore and few other places. brutal, but swift, and you can get back to your life without the cruel bureaucracy destroying or murdering you.
[dead]
[dead]
[flagged]
IANAL, but this seems like an odd test to me. Judges do what their name implies - make judgment calls. I find it re-assuring that judges get different answers under different scenarios, because it means they are listening and making judgment calls. If LLMs give only one answer, no matter what nuances are at play, that sounds like they are failing to judge and instead are diminishing the thought process down to black-and-white thinking.
Digging a bit deeper, the actual paper seems to agree: "For the sake of consistency, we define an “error” in the same way that Klerman and Spamann do in their original paper: a departure from the law. Such departures, however, may not always reflect true lawlessness. In particular, when the applicable doctrine is a standard, judges may be exercising the discretion the standard affords to reach a decision different from what a surface-level reading of the doctrine would suggest"