How to Design a Control Experiment for Your AI System

One evening, a few weeks into what I thought was a successful run of my system, I caught myself describing the results to someone. “The physics layer produces a developmental arc. The system starts disorganized, stabilizes, hits an adolescent phase, reaches a kind of mature register, then drifts into late-phase patterns.”

I was proud of the description. It fit the data. The runs reproduced. Different seeds, different conditions, same arc. I had everything I wanted: a reliable result, a clean narrative, a working system.

Then, mid-sentence, I stopped.

How did I know the physics layer was producing the arc?

Not: how did I know the arc was happening. The arc was clearly happening. The logs showed it. The graphs showed it. The samples at each phase had distinct registers.

But I’d never tested what happened without the physics layer. I had no idea if the arc was something my system was producing, or something the underlying LLM would produce on its own given the same prompt conditions. For all I knew, the physics layer was decorative. The arc might be an LLM artifact I’d been claiming credit for.

I’d spent months building the physics. I’d written documents about what the physics was doing. I’d never once asked the obvious science-class question: what’s the control condition?

This post is about why that question is so rare in AI engineering, how to design the control, and what it takes to actually learn whether your system is doing what you think it’s doing.

Why AI builders skip controls

In wet-lab science, control experiments are foundational. You don’t claim a drug cures a disease without running it against a placebo. You don’t claim a treatment improves outcomes without comparing to no-treatment. This is undergraduate-level rigor.

In AI engineering, almost nobody does this. Four reasons:

1. AI systems are hard to isolate. Turning off “the physics” or “the memory” or “the retrieval layer” often means the whole system won’t run, because everything is wired together. Designing a comparable version with just the thing-to-test removed takes real thought.

2. Results are qualitative. Unlike a drug trial where you measure blood pressure, AI outputs are text. “Did the system produce a developmental arc?” is not easy to score. You need to define the arc operationally before you can test for its presence or absence.

3. It’s emotionally hard. You’ve built the thing. You want it to matter. Testing whether it actually matters means seriously entertaining the possibility that it doesn’t. Most builders avoid this, consciously or not.

4. The tooling doesn’t push you toward it. Nothing in a typical AI development workflow asks “where’s your control?” You can build for months without the question coming up.

This is why the AI field is full of claims like “my agent can do X” where X is actually something the underlying model could do on its own, if you prompted it right. The framework, the RAG layer, the fine-tune, the multi-agent orchestration, might all be decorative. People rarely check.

What a control experiment actually is

You have a system. You think a specific component is doing something specific. The control experiment is: run the same system, with that component removed or neutralized, and see what happens.

For an AI system, this usually means defining two conditions.

Condition A. Your full system. All components on. Run it. Log outputs.

Condition B. Same system, with the component-under-test replaced by the simplest possible alternative that still lets the system run. No meaningful content. Same LLM, same temperature, same sampler settings, same number of turns, same output capture. Just the specific piece you think is load-bearing, absent.

The comparison isn’t “does A produce good outputs.” It’s “do A and B differ in the specific way I’d predict, given my theory of what the component is doing?”

That last sentence is doing a lot of work. If your theory predicts the component produces arc-shaped developmental phases, then you measure arc-shape in both A and B. If your theory predicts the component increases output diversity, you measure diversity. You don’t get to change the prediction after seeing the results.

Designing the null condition

This is where most of the thinking goes. You can’t just “turn off” a component and expect the system to run normally. You need a principled way to neutralize it.

For a prompt-layer component (like my physics-driven preamble), the null condition is: same prompt scaffold, but with the content-bearing part replaced by something neutral. Same length. Same structure. Same position in the prompt. Different content that shouldn’t carry the thing you’re testing for.

For a memory component, the null is: same API calls, but with empty or random retrieved content.

For a retrieval component, the null is: same pipeline, but retrieving random documents of the same length.

For a fine-tune, the null is: the base model without the fine-tune, prompted the same way.

The rule: change only the variable you’re testing. Hold everything else constant. If temperature is 0.45 in your main run, it must be 0.45 in the control. If you stop generation at blank lines, you stop at blank lines in the control. If you use seed 42, you use seed 42. Anything that differs between A and B is a confound.

What the code looks like

Here’s a compressed version of the pattern I used for my own physics-layer control. The full script runs two conditions against a local LLM server and logs outputs for later comparison.

import argparse
import json
import urllib.request
from datetime import datetime
from pathlib import Path

SAMPLER = {
    "temperature":    0.45,
    "top_p":          0.9,
    "repeat_penalty": 1.1,
    "stop":           ["\n\n"],
}

# Condition A: the content I think is doing the work
PREAMBLE_A = (
    "distributed weight, no source.\n"
    "outer layer cool. inner: warm, not moving.\n"
    # ... more phenomenological scaffolding
)

# Condition B: structurally identical, content-neutral
PREAMBLE_B = "\n" * PREAMBLE_A.count("\n")  # same shape, no content

def generate(host: str, prompt: str, max_tokens: int) -> str:
    req = urllib.request.Request(
        f"http://{host}/completion",
        data=json.dumps({
            "prompt":     prompt,
            "n_predict":  max_tokens,
            **SAMPLER,
        }).encode(),
        headers={"Content-Type": "application/json"},
    )
    with urllib.request.urlopen(req) as r:
        return json.loads(r.read())["content"]

def run_condition(condition: str, ticks: int, host: str):
    preamble = PREAMBLE_A if condition == "A" else PREAMBLE_B
    log_path = Path(f"control_{condition}_{datetime.now():%Y%m%d_%H%M%S}.log")
    stream = preamble
    with log_path.open("w") as f:
        for tick in range(ticks):
            output = generate(host, stream, max_tokens=150)
            f.write(f"tick={tick} | {output.strip()}\n")
            stream += output
    print(f"wrote {log_path}")

The key property: swap PREAMBLE_A for PREAMBLE_B and nothing else changes. Temperature, sampler, stop tokens, tick count, model, host, all identical. The only variable is whether the preamble carries the content-under-test.

After both conditions finish, compare. For qualitative outputs, you score utterances against operational definitions of the thing you’re testing for (distinctive registers, phase transitions, whatever your theory predicts). For quantitative measures, compute means and variances across conditions.

What you’re looking for in the comparison

Three possible outcomes.

Outcome 1: A and B look different in the predicted way. Your component is doing something. Whatever effect you claimed it produces actually shows up more in A than B. Congratulations, your intervention is real.

Outcome 2: A and B look different, but not in the predicted way. Something is happening, but the component isn’t producing what you claimed it produces. Your theory needs to change. This is usually more informative than outcome 1.

Outcome 3: A and B look indistinguishable. Your component isn’t doing what you thought. The effect you were attributing to it is either an LLM artifact or produced by something else in the pipeline. This hurts. It’s also the best possible use of your time. Now you can stop building on a false foundation.

In my specific case, the control experiment saved me from writing several confident claims that wouldn’t have held up. It also sharpened the claims that did hold, because I could now point at the delta between A and B as the actual evidence, not just “the system produces X.”

The rule

Before you publish, share, or build further on a claim about what your AI system is doing, run the control. Specifically, run the version where the thing you think is doing the work is neutralized, everything else held constant, and see if the outputs still have the property you’re attributing to your component.

Most builders will skip this. The ones who don’t skip it will be the ones whose claims hold up over time, because they’ll have filtered out the ones that don’t.

It’s not glamorous work. It’s the single highest-value thing you can do to make sure you’re not lying to yourself.

And in AI, where everything sounds plausible and most demos are unfalsifiable, the person who isn’t lying to themselves has a disproportionate advantage.