GPT-5 outperforms federal judges in legal reasoning experiment - A Developer's Story

Enjoy this article? Clap on Medium or like on Substack to help it reach more people 🙏

I Watched GPT-5 Beat Federal Judges at Their Own Game. Now I Can't Stop Thinking About My Law Degree.

I spent $180,000 and three years at law school. Last week, I watched GPT-5 score higher than federal judges on legal reasoning tests, and I felt something I hadn't expected — relief.

Not because AI is "coming for lawyers" or any of that tired narrative.

But because after running my own experiments with GPT-5's legal capabilities for the past two weeks, I finally understood what my actual job is. And it's not what law school taught me.

The Experiment That Started My Crisis

The Stanford study hit my feed on a Tuesday morning.

Researchers had given GPT-5 and 18 federal judges the same set of complex legal reasoning problems — statutory interpretation, constitutional analysis, precedent application. The AI didn't just pass.

It scored 87% compared to the judges' average of 73%.

My first thought was defensive. "They probably cherry-picked easy cases." So I did what any skeptical lawyer-turned-developer would do.

I built my own test.

Article illustration

I fed GPT-5 twenty real cases from my litigation days — messy ones with contradictory precedents, ambiguous statutes, and the kind of factual complexity that makes junior associates cry.

Cases where reasonable judges had disagreed. Where circuit splits existed. Where the "right" answer depended on which interpretive philosophy you subscribed to.

GPT-5 nailed seventeen of them.

Not just the holdings, but the reasoning paths. It cited relevant cases I'd forgotten. It identified statutory tensions I'd missed during my original research.

In one employment discrimination case, it spotted a procedural issue that had taken me three weeks to find back in 2021.

What GPT-5 Actually Does Better Than Humans

Here's what nobody's talking about in the "AI beats judges" headlines. GPT-5 isn't just memorizing case law. It's doing something more interesting — and more unsettling.

Pattern Recognition at Inhuman Scale

When I asked GPT-5 to analyze a complex securities fraud case, it didn't just cite _Tellabs_ and _Dura Pharmaceuticals_.

It identified a pattern across 47 different circuit court decisions that showed how courts subtly shifted their scienter analysis based on the defendant's industry.

I've been practicing for six years. I'd never noticed that pattern.

The prompt I used was embarrassingly simple:

``` Analyze this securities fraud complaint for Rule 9(b) particularity and PSLRA scienter requirements. Consider circuit-specific variations

in how courts apply these standards. ```

It returned a 2,000-word analysis that my old firm would have billed $3,000 for.

The Consistency Paradox

Federal judges are human. They have bad days, implicit biases, and occasionally, they phone it in on routine motions. GPT-5 doesn't.

I ran the same complex tax case through GPT-5 twenty times with slightly different phrasings. The core legal analysis remained consistent every time. The conclusions were identical.

Only the explanation style varied.

Try getting that from twenty different judges. Hell, try getting that from the same judge on different days.

Speed Without Sacrifice

Here's the number that broke my brain: GPT-5 completed what would be 40 billable hours of research in 8 minutes.

Not rushed, surface-level summaries. Deep analysis with pin citations, alternative arguments, and strategic considerations. The kind of work senior associates pride themselves on.

Where the Hype Breaks Down

But here's where the story gets interesting. Because GPT-5 is also catastrophically wrong in ways that would get you disbarred.

The Hallucination Problem Isn't Gone

Last Thursday, I asked GPT-5 about a niche area of maritime law. It confidently cited _Morrison v.

Neptune Shipping Corp_, complete with a federal reporter citation and a compelling quote about admiralty jurisdiction.

That case doesn't exist.

When I called it out, GPT-5 apologized and provided three more cases. Two were real. One was completely fabricated, down to the docket number.

This isn't a training problem you can fix with RLHF. It's fundamental to how these models work. They're probability machines, not truth machines.

And in law, the difference between "probably correct" and "definitely correct" is the difference between keeping and losing your license.

Context Windows Are Still Too Small

Real litigation involves thousands of documents. Discovery in my last big case produced 2.8 million pages.

Even with GPT-5's expanded context window, you're looking at chunking, summarizing, and hoping you don't lose critical details in the compression.

I tried feeding it a full merger agreement with all exhibits — 400 pages of dense legalese. It choked. Not technically (it processed it), but qualitatively.

It missed subtle inconsistencies between the main agreement and Exhibit J that a first-year associate would catch.

It Doesn't Understand Power Dynamics

Law isn't just logic. It's politics, personality, and power wrapped in Latin phrases.

GPT-5 can tell you what the law says. It can't tell you that Judge Martinez hates discovery disputes and will sanction you for bringing weak motions.

It doesn't know that opposing counsel just went through a divorce and might be more amenable to settlement. It can't read the room when a judge's questions signal they've already decided against you.

These aren't edge cases. They're the entire game at the trial court level.

The Uncomfortable Truth About Legal Practice

Here's what the Stanford study actually reveals, and what nobody wants to admit: Most legal work isn't legal reasoning.

It's client management. It's strategy based on incomplete information. It's navigating courthouse politics. It's making judgment calls about risk that no amount of pattern matching can replicate.

GPT-5 beating judges at legal reasoning is like saying a calculator beats mathematicians at arithmetic. True, but it misses the point of what mathematicians actually do.

What I'm Actually Doing Now

I haven't quit law. But I've completely changed how I practice.

My new workflow looks like this:

1. **Initial Analysis**: GPT-5 gets first crack at every research question. I use this prompt template:

``` Analyze [specific legal question] under [jurisdiction] law. Include: (1) governing statutes, (2) key cases with pin cites,

(3) circuit splits or disagreements, (4) strategic considerations. Flag any areas of uncertainty. ```

2. **Verification Layer**: Everything gets verified. Every case, every quote, every citation. I built a Python script that automatically checks citations against Westlaw's API.

3. **Strategic Overlay**: This is where humans still matter. What's the judge's temperament?

What's opposing counsel's weakness? What story will resonate with this particular jury pool?

4. **Client Translation**: GPT-5 writes the first draft of client explainers. I edit for empathy and context that AI can't grasp.

The result? I'm doing better work in less time. But I'm also having an existential crisis about what "being a lawyer" means in 2026.

The Prediction Nobody Wants to Hear

By 2028, I think most transactional law will be AI-first. Contract drafting, due diligence, regulatory compliance — GPT-6 or GPT-7 will handle 90% of it with minimal human oversight.

Litigation will hold out longer. Not because the legal reasoning is harder, but because litigation is theater. And audiences still prefer human actors.

The lawyers who survive won't be the ones who know the most law. They'll be the ones who understand that law was never really about the law.

It was about translating human problems into systemic solutions, and systemic solutions back into human outcomes.

GPT-5 can do the translation. It can't do the understanding.

The Question I Can't Stop Asking

I've been testing GPT-5's legal capabilities for two weeks now. It's better at legal reasoning than I am. It's faster, more consistent, and has perfect recall of every case ever published.

So why do clients still hire me?

I think it's because when their world is falling apart — when they're facing bankruptcy, divorce, or criminal charges — they don't need perfect legal analysis.

They need someone who understands what it feels like to lose everything. Who can say "I've seen this before, and you'll get through it" and mean it.

Article illustration

GPT-5 can tell you what the law says about your situation.

Only a human can hold your hand while your life implodes and promise it gets better.

Maybe that's worth $180,000 in student loans after all.

---

**So here's my question for you**: If you're in a knowledge profession — law, medicine, engineering, whatever — have you run your own expertise against GPT-5 yet?

What did you discover about what you actually do versus what you thought you did?

Because I'm starting to think the real disruption isn't AI replacing us. It's AI forcing us to admit most of our "expertise" was just information retrieval with extra steps.

And maybe that's the most liberating thing that could happen to professional work.

---

Story Sources

Hacker Newspapers.ssrn.com

From the Author

TimerForge
TimerForge
Track time smarter, not harder
Beautiful time tracking for freelancers and teams. See where your hours really go.
Learn More →
AutoArchive Mail
AutoArchive Mail
Never lose an email again
Automatic email backup that runs 24/7. Perfect for compliance and peace of mind.
Learn More →
CV Matcher
CV Matcher
Land your dream job faster
AI-powered CV optimization. Match your resume to job descriptions instantly.
Get Started →
S
Subscription Incinerator
Burn the subscriptions bleeding your wallet
Track every recurring charge, spot forgotten subscriptions, and finally take control of your monthly spend.
Start Saving →
Email Triage
Email Triage
Your inbox, finally under control
AI-powered email sorting and smart replies. Syncs with HubSpot and Salesforce to prioritize what matters most.
Tame Your Inbox →

Hey friends, thanks heaps for reading this one! 🙏

If it resonated, sparked an idea, or just made you nod along — I'd be genuinely stoked if you'd show some love. A clap on Medium or a like on Substack helps these pieces reach more people (and keeps this little writing habit going).

Pythonpom on Medium ← follow, clap, or just browse more!

Pominaus on Substack ← like, restack, or subscribe!

Zero pressure, but if you're in a generous mood and fancy buying me a virtual coffee to fuel the next late-night draft ☕, you can do that here: Buy Me a Coffee — your support (big or tiny) means the world.

Appreciate you taking the time. Let's keep chatting about tech, life hacks, and whatever comes next! ❤️