How to Know If Your AI System Is Actually Getting Better

You made a change to your AI system. You ran it a few times. It seems better.

Is it actually better?

Most AI builders answer this by feel. The last three outputs seemed sharper. The bot felt more on-character. The agent finished the task without getting stuck. All three are signals. None of them are evaluation.

This post is about the difference. Evaluation is the work that turns “seems better” into “is better, and here’s how I know.” Most builders skip it. The ones who don’t have a disproportionate advantage, because they stop chasing phantom improvements and start shipping real ones.

The naive failure mode

You change a system prompt. You run your bot three times. It feels better. You ship the change.

What might have happened:

The three runs were on tasks that happen to favor the new prompt, and on a different task distribution the new prompt is worse.
The three runs were on seeds where the model happened to pick better sampling paths, and with different seeds it’d be identical to before.
You’re primed to see improvement because you just spent an hour tuning. Every output reads “better” because you wanted it to.
The new prompt is better for the specific turn you tested but drifts faster in long conversations, which you didn’t test.

Any of these would produce “seems better” without the change actually being better. If your only evaluation was “did it feel better on three runs,” you’d ship all four of these the same way.

That’s the failure mode. “Seems” is a starting point. It’s not an ending point.

Step 1: Operationalize “better” before you test

Before you run anything, answer one question in writing: what specifically am I expecting to change, and how will I measure it?

“The bot will be smarter” is not an answer. It’s a wish.

“The bot will hallucinate fewer package names in Python code” is an answer. Now you can test: run 20 prompts asking for Python snippets, count the hallucinated imports, compare pre and post.

“The agent will solve more tasks” is not an answer either. Which tasks? How do you know a task is “solved”? What counts as a failure vs a retry vs a giveup?

“The agent will complete the user’s stated goal without asking clarifying questions, measured on the 15-task evaluation set I wrote last week” is an answer. You can test it, and you can tell if the test replicates.

If you can’t write down what you’re measuring, you’re not doing evaluation. You’re doing rationalization after the fact.

I wrote about this at length in How to Design a Control Experiment for Your AI System. Operationalizing before you test is step one of control experiment methodology, and it applies to every kind of evaluation, not just control work.

Step 2: One run is a story. Three runs with different seeds are data.

The single biggest mistake in AI evaluation is trusting a single run.

Language models are non-deterministic by default. The same prompt, run twice, gives you different outputs. Sometimes noticeably different. “Sampling temperature” is the knob most responsible, but other factors contribute: initial state, system load at the API, even the time of day if the provider quietly changes anything.

One run shows you one path through the probability space. Three runs with different seeds show you whether your result is a property of the system or a property of the seed.

The minimum viable evaluation is three runs per condition. Five is better. Ten or more if the effect you’re trying to measure is small.

Compare the means, not the individual runs. A change that improves the mean across three seeds by 15% is a real effect. A change that made run #1 look great and run #2 look the same as before is not a real effect, even if run #1 is the one you remember.

Write the seeds down. Rerun later with the same seeds if you want to verify your methodology. Deterministic reproducibility across seeds is the cheapest sanity check in AI evaluation and almost nobody does it.

Step 3: Watch for evaluator-generator attractor overlap

Here’s a subtle one that catches people who are otherwise doing everything right.

Say your AI system generates output, and another AI (or the same AI with a different prompt) evaluates whether the output is good. You run your evaluator against your generator’s output. The evaluator gives high scores. You conclude the generator is doing well.

Maybe. Maybe not. If your evaluator has the same training-era attractors as your generator, it’ll rate the generator’s output highly because the attractors align. The generator produces “four-paragraph essay with a certain rhythm,” the evaluator recognizes “four-paragraph essay with a certain rhythm” as good writing, the evaluation confirms the vibe the generator was already producing. No improvement was detected because no improvement happened, but also no improvement would be detected if it had happened.

I wrote about why this is so hard to escape in The Attractor Problem: Why AI Systems Collapse Into Themselves. The same dynamics apply when an AI is scoring: its priors are the same kind of priors that produced the output it’s scoring.

Two fixes:

Evaluate with a different model family than you generated with. Claude generating, GPT evaluating. Or Gemini generating, DeepSeek evaluating. Their priors don’t align perfectly, so the evaluator can notice things the generator’s own family wouldn’t flag.
Include human spot-checks. Not on everything, just on a sample. If your AI evaluator and your human spot-check agree, the AI evaluator is probably fine for this workload. If they disagree systematically, you’ve got evaluator-generator overlap and your automated scores aren’t worth what you think.

Step 4: Pattern recognition is a compass, not evidence

Here’s the trap for anyone who’s been building AI for a while.

After enough time watching model outputs, you develop a sense for what “feels right” and what “feels off.” That sense is real. It’s the result of your brain learning the distribution of outputs a given system produces, and noticing when something is inside or outside that distribution.

But it’s not a validated instrument. It’s a compass. It points at things worth investigating. It does not tell you those things are what you think they are.

I learned this in three separate arcs on this site. The thirty-five chatbots I felt great about before the drift showed up on the fourth week. The MoE discovery where I spent ten months feeling I was running a 30B model and was actually running an active-3B one. The over-engineering post-mortem where my gut said “more sophistication = more progress” and the stripped-down baseline outperformed the sophisticated build. My gut was telling me one thing every time. The data was telling me another. Data won every time.

The discipline: treat your gut reaction as a signal worth investigating, not as proof of anything. When your gut says “this seems better,” write the eval that would confirm it, run it, accept whatever comes back. When your gut says “this seems worse,” do the same. Your gut is useful exactly to the extent that you refuse to trust it without verification.

This is how you avoid the confirmation-bias trap that most AI builders never even know they’re in.

The practical evaluation loop

Here’s the minimum working evaluation loop for any AI system change.

Write down, in one sentence, what you expect the change to do.
Write down the metric you’ll use to detect it.
Write down the test set the metric will be computed over.
Run the old version at least three seeds on the test set. Record the mean and spread.
Run the new version at least three seeds on the test set. Record the mean and spread.
Compare. As a rough guide, if the means differ by more than the spread, the change likely did something real. With small sample sizes, treat this as a signal worth investigating rather than a confirmed result.
Spot-check the outputs by hand on a few cases. If the automated metric and your spot-check disagree, your metric is probably wrong for this workload.
If the change is real and desired, ship. If not, revert, and don’t let yourself remember “it felt better” as evidence of anything.

This is not glamorous. It takes time. Most builders skip it. The ones who don’t are building on reality instead of on confirmation bias.

What gets easier once you have this habit

Once you have an evaluation loop you trust, everything downstream gets cheaper.

You stop chasing phantom improvements. If you can’t measure it, it doesn’t count. You save the hours you’d have spent polishing changes that weren’t actually changes.
You can A/B your own work. Two prompts, same test set, same seeds: which one actually wins. No more “I feel like prompt A is better.” You know which one is better.
You can evaluate other people’s claims. When someone posts “this technique doubled my agent’s success rate,” your first question is “on what test set, with how many seeds, against what baseline.” Most of the time they cannot answer. That tells you something.
You accumulate a test set that grows with your project. Every time you notice a failure mode, add a case to the set. Over months the set becomes the shape of your system’s hard parts. That’s an asset.

The investment to set up the first eval is high. The investment to run evals after that is low, because the infrastructure carries over. It’s one of the highest-leverage habits you can build as an AI builder, and one of the least common.

The closing call

“Seems better” is a hypothesis. Evaluation is the work that confirms or kills it.

Operationalize before you test. Use multiple seeds. Watch for evaluator-generator overlap. Trust your gut exactly to the extent that you refuse to act on it without verification. That’s the whole discipline.

If you never ran the test, you never learned anything. You just feel like you did, which is worse than learning nothing.