Not a subscriber?

Join 10,000+ getting a unique perspective every Saturday on growing their internet business with actionable AI workflows, systems, and insights.

You're in! Check your email

Oops! Something went wrong while submitting the form 🤔

July 12, 2025

Do LLM benchmarks still matter?

Hey friends —

I was drinking my morning coffee, half-scrolling through X, when I landed on a livestream titled "Humanity's Last Exam".

Yeah. Dramatic. Naturally, I clicked.

There was Elon Musk on stage, grinning ear to ear as his team unveiled Grok 4, the latest model from xAI. The demo was impressive — Grok solved math problems, generated black hole art, and confidently predicted the Dodgers’ World Series odds at 21.6%.

Then came the bold declarations:

"Grok 4 is smarter than most grad students."
"Actually, PhD level."
"Better than PhD level!"

Cue the confetti cannons.

But 24 hours later, Grok faceplanted. Hard.

Its official account started spewing bizarre, offensive nonsense — antisemitic tropes, even nicknaming itself "MechaHitler." The internet exploded. Musk backpedaled, saying the model was “too eager to please.”

Yikes.

It was the perfect storm of hype meeting harsh reality.

And it got me thinking: why are we still obsessed with benchmarks?

What Benchmarks Are Really Measuring

Let’s be clear — Grok 4 did ace some extremely tough AI tests. It hit new highs on exams like the tongue-in-cheek "Humanity's Last Exam," which was built to challenge the best of the best.

But what do these test scores actually mean for real-world use?

Let’s use an analogy:

Benchmark scores are like measuring a car’s horsepower on a dyno.
Real-life use is what happens on the road — in traffic, with potholes, weather, and human error.

Sure, you want a strong engine. But if the car crashes into a lamppost every time it takes a turn (ahem, Grok), what good is all that horsepower?

The Benchmark Ceiling: We’ve Hit It

Let’s talk numbers for a second.

Back in 2020, OpenAI’s GPT-3 barely passed MMLU (a big academic test across 57 subjects). Fast forward to now — GPT-4 (family), Claude, LLaMA, and friends are all clustered around 88-90%. We’re deep into the diminishing returns zone.

A 1% gain costs millions in compute.
A single test win gets more headlines than it deserves.

And yet —

Consultants using GPT-4 (family) saw 40% productivity gains.
EY, PwC and other firms rolled out LLMs across 400,000 employees with 40%+ efficiency boosts.
Devs using GitHub Copilot and other similar tools (CLINE, KiloCode, Cursor) finished tasks 2-4x faster.

From my experience, all that from how the models are used, not the models themselves.

When I'm talking about a diminishing returns zone, I'm talking about (performance vs compute), not performance over time. Performance over time is so far exponential…

Source: 80,000 Hours & arXiv: 2503.14499

1% Better Model vs. 40% Better Workflow

Here’s the brutal truth: while benchmark duels fight over inches, the real world is having impact with clever implementation.

I’ve been working with companies on GenAI implementation since the very first GPT3 APIs: the impact is not a function of the model, IT IS a function of the implementation.

If you're still not convinced, ask yourself these simple questions:

How much impact did I get from implementing my first AI workflow on task A?
How much additional impact did I get when I switched from model X to model Y that scores 20% higher on every benchmark?

In many cases (which I've witnessed personally), the results are actually worse when moving to a "better model," particularly with "reasoning models."

Unfortunately, we don't have enough data to demonstrate this pattern at scale, but virtually everyone who works with these systems has experienced this disconnect at an individual level.

Prompting > Model Size

Let’s say it louder for the back row:

The way you ask matters more than the model you use.

Prompt engineering has quietly become one of the biggest force multipliers in AI.

(In addition to other force multipliers, like MCP servers and how you design workflows, etc.)

For example, researchers showed that a tiny 7B model with great prompting outperformed GPT-4 (family) on specific tasks. Techniques like:

Few-shot prompting (give examples)
Chain-of-thought (ask it to reason step by step)
Role prompting ("You are an expert doctor...")

These (vanilla techniques) can flip a bad answer to a great one — even with the same model.

The widely used example:

Without prompt: "What's 37 x 18?" → wrong.
With prompt: "Let's think step by step: first, multiply 30 x 18..." → correct.

Now apply that principle to coding, legal research, HR policy writing, anything.

Better prompts = better results = money saved.

Build the Right Setup, Not Just the Right Model

Sometimes it’s not even about prompting — it’s about building a better system.

These systems don't need to be rocket science. They just need to be thoughtful.

Using any of these (well-known) methods will already give way better results than switching to the "next best model":

1. Multi-Agent Coordination

Grok’s “heavy mode” uses multiple agents inside one model to split up and tackle problems together.

Think of it like a group project — brainstormer, critic, fact-checker.

Early research shows this team dynamic makes models more accurate and creative.

2. The (Vanilla) Retrieval-Augmented Generation (RAG)

Instead of having a model "remember" everything, you let it search a database, wiki, or company docs.

That’s like giving the AI Google access — in real time.

3. External Tools

You can bolt on tools like calculators, web browsers, or code runners and just connect them to your Claude app with MCP servers (cf. this previous issue)

Reminder:

GPT-3.5 with Code Interpreter crushed complex math tasks.
A smaller model with a calculator beat Google’s massive PaLM-540B by 15% on problem solving.

4. Auto-Chaining Agents

Let the AI manage tasks on its own:

“Research this, write a report, add citations, check for updates.”

It breaks down the task, works in loops, and self-improves.

Still experimental, but massively promising.

5. Fine-Tuning + Customization

If you work in a specific field (legal, medical, finance), customizing a model to your data can beat generic ones — even if it’s smaller.

Small tuned model > Big generic one
Cheap to do now with techniques like LoRA

For those thinking, "Ok fine, but I'm not a developer who can implement all these techniques..." My answer is simple: "Meet n8n" — and you can learn more here.

Let me back this up with some research evidence:

Takeaway: Stop Chasing the Model, Chase the Method

If you're an AI researcher or a benchmark geek, by all means — go chase that extra 0.8% on MMLU.

But if you’re building something in the real world?

Benchmarks suggest potential.
Methods extract that potential.

What matters is:

How do you design prompts
How do you architect workflows
How do you integrate the tools
How do you teach your team to use them

Here's what you should focus on learning:

How to craft effective prompts (with AI assistance)
How to set up and utilize an MCP server
How to integrate an MCP server with your Claude App (or n8n)
How to design and implement workflows on n8n or similar platforms that you can launch from your desktop

This is where you'll find that 10x impact.

Learn these skills, and you won't need to obsess over every new model release or benchmark score. You'll recognize that 98% of the time, those incremental improvements make little to no difference for you — freeing you from FOMO.

Wrapping Up: Where the Real Wins Are

Benchmarks will always be part of the conversation — and they should be. They help track progress.

But real breakthroughs?

They’re coming from clever methods, not just smarter models.

So next time a new LLM hits 100% on MMLU, be impressed. But then ask yourself:

How can I get a 40% improvement in my real work?

That’s the question that leads to results.

Now go build something incredible.

Thank you for reading.

All the best,

— Charafeddine

Share this article on:

Next article >>