How I Use an AI Orchestra to Build AI Research

Most posts about “how I use AI to work” describe a single-AI workflow. ChatGPT for drafts. Claude for coding. Maybe both on alternating days.

That’s not how I actually work.

When I write a research post for this site, or investigate a technical question, or prepare an experiment design, three different AI systems are involved in defined roles. They check each other. I sit in the middle as the regulator, making final calls. I call it the AI orchestra. I’ve been running some variant of it for months. It consistently outperforms any single-AI workflow I’ve tried, and I’ll explain exactly why.

Before I get to the methodology: I disclose all of this on this site’s footer. Every post you read here is produced by a human researcher (me) working with AI tools, not by an AI alone. That distinction matters and I’ll come back to it at the end.

The problem with single-AI workflows

You pick one AI. You give it a prompt. It produces output. You use it, or rewrite it, or throw it away.

This feels efficient. It isn’t, for anything that requires real judgment. Three specific failure modes show up once you try to use single-AI for serious work:

1. Confirmation bias. Any single AI, prompted repeatedly on the same question, tends to converge on a consistent answer. That consistency feels like validation. It isn’t. The model is giving you the most probable answer given its priors. If your priors align with the model’s priors, every query reinforces both. You drift deeper into an attractor without knowing it.

2. Style homogenization. Every AI has characteristic patterns: favorite sentence structures, preferred paragraph rhythms, signature transitions. Your output starts to look like the AI’s voice, no matter how much you edit. Readers develop a sixth sense for this and find it exhausting.

3. Missed flaws. AI systems share certain blind spots. If you ask one to critique your work, it’ll find the flaws that AI systems are good at finding. It’ll miss the ones that AI systems reliably miss. You end up with work that passes AI review but has structural issues a different AI (or a human) would catch in thirty seconds.

None of these are solved by “prompt engineering harder.” They’re structural problems with single-AI workflows.

The orchestra: three roles, one regulator

My workflow has four participants. Three AI systems in specialized roles, and me as the regulator.

Role 1: Claude, for expansion and drafting. Given a topic I want to write about or investigate, Claude produces first-pass material: exploring the idea, drafting sections, writing code examples, finding angles I hadn’t considered. Claude is good at this because it’s patient, it reads long context cleanly, and it produces structured output.

Role 2: GPT, for adversarial review. After Claude has produced a draft, I give it to GPT with the instruction to poke holes in it. What’s weak? What’s overclaimed? What would a skeptical reader push back on? GPT produces a critique. Some of the critique is useful. Some isn’t. My job is to sort.

Role 3: A smaller local model (in my case Mistral), for claim compilation. When I need to pull the concrete claims out of a draft, to see what I’m actually asserting, I run a smaller local model over the text with a prompt like “list every factual claim this article makes, without interpretation.” The smaller model is useful here precisely because it’s less clever. It extracts what’s on the page rather than filling in what sounds right.

Role 4: Me, as the regulator. I take all three outputs and decide. Which Claude expansions to keep. Which GPT criticisms are real vs performative. Which claims in Mistral’s extraction list I stand behind vs need to walk back. The final piece is mine. The AIs produced material. I assembled it.

Why three different AIs and not three Claudes

A fair question: why not just query Claude three times in three different roles?

Two reasons.

Models disagree in useful ways. Claude and GPT have meaningfully different priors. A critique from GPT of a Claude draft catches different things than a Claude-of-Claude critique would. The friction between their biases is the point. Homogenizing the orchestra to one model loses that friction and collapses back into single-AI drift.

Different models expose different failure modes. GPT hallucinates differently from Claude. Claude’s stylistic tics differ from GPT’s. If all three stages use the same model, the same blind spots carry through all three. Mixing models means each stage has different blind spots, so a flaw that slips one is likely caught by another.

The smaller local model in the third role is also different on purpose. Local Mistral at 7B isn’t going to write nuanced prose. That’s not its job. Its job is to extract, without embellishment, what a draft claims. Smaller models are better at this narrow task than larger ones, because they’re less tempted to “help” by adding interpretation.

The regulator role, which is the actual skill

Here’s the part most “AI workflow” posts skip, because it doesn’t sound exciting.

The three AIs don’t produce the final work. They produce material. The value of the orchestra comes from what the regulator does with the material.

My regulator work breaks into four jobs.

Scope setting. Before any AI does anything, I decide what the piece is, who it’s for, what claim it’s making. I don’t delegate this. If I let an AI set scope, I get whatever the AI’s priors think is a reasonable post on the topic, which is almost always more generic and less opinionated than what I actually want.

Bias watching. When GPT critiques a Claude draft, some of its criticisms are good-faith catches. Some are GPT projecting its own priors onto the draft. I have to tell which is which. Same the other way. Bias detection across models is a specific skill I’ve been developing and it gets better with practice.

Claim ownership. When the compiled claim list comes back, I read each claim and ask: do I actually believe this, would I defend it, is it grounded in experience or did it just sound right? I delete claims I’m not willing to stand behind. This pruning is load-bearing. Without it, the orchestra produces plausible prose full of assertions I don’t actually own.

Voice regulation. Raw AI output doesn’t sound like me. Edited AI output can sound like me, if I spend the edit time. Every piece that goes up has been rewritten in my voice, with the AIs’ output as raw material rather than finished product. Readers can tell when this step is skipped.

What the workflow looks like in practice

A single post, start to finish:

I have an idea. I write 200 words to Claude describing the idea, what I want the post to do, who it’s for, the angle I want.
Claude produces a 2000-word first draft. I read it. Most of it is not what I’d write. Some of it is better than what I’d have written.
I rewrite the draft in my voice, cutting what doesn’t fit, keeping what does, adding personal anecdotes Claude didn’t have, moving sections around.
I paste the rewritten draft to GPT and ask for adversarial critique. Specific prompt: “What’s weakest about this? What would a technical reader push back on? Where am I overclaiming?”
GPT returns a list of concerns. I evaluate each one. Maybe three of ten are real issues I should fix. The other seven are GPT being overcautious or missing context. I fix the real ones.
I run a local model over the near-final draft to list every concrete claim. I read the list, confirm I stand behind each one, revise anything I’m not certain of.
Final pass: I read the whole piece out loud. If any sentence makes me wince, I rewrite it.
Publish.

Total time: usually 2-3 hours per serious post. Solo (just me writing from scratch) it’d be 6-8 hours and likely worse. Single-AI (me and Claude only) it’d be 3 hours and have specific Claude tells a reader could spot.

The orchestra runs in 2-3 hours and produces work that passes multiple different blind-spot checks.

The orchestra’s weaknesses (and I will own them)

Let me be specific about where this workflow falls short.

It’s expensive. Three API subscriptions or heavy usage on free tiers. For someone doing this for fun, that’s real money. For someone building a site, it pays back. For someone with a tight budget, the right answer is Claude-plus-me plus a commitment to self-edit ruthlessly.

It has diminishing returns for short content. Writing a tweet doesn’t need an orchestra. Writing an email doesn’t. The orchestra earns its weight on pieces longer than about 800 words, or pieces where correctness matters, or pieces where the voice has to be consistent.

It can produce a “committee” feel if the regulator is lazy. If I take the AIs’ outputs too literally instead of regulating them, the piece reads like it was written by compromise. Bad writing. The regulator has to assert taste, not just average inputs.

It’s my workflow, not a universal prescription. If you’re writing fiction, the orchestra is overkill. If you’re doing customer support work, overkill. If you’re writing safety-critical material, probably not enough (you need real review from qualified humans, not AIs). It’s calibrated for “serious technical content with real claims and a consistent voice.”

Why I disclose the whole workflow

Hiding AI assistance is a credibility trap. If someone runs an AI detector on posts that don’t disclose, and the detector fires (which it does, accurately or not), the story is “caught using AI.” If someone runs the detector on a disclosed post and it fires, the story is “yep, AI was involved, as stated, moving on.” Same technical fact, completely different read.

Also: the regulator role is the valuable part. If I’m open about the workflow, people can evaluate whether the regulator (me) is doing work worth paying attention to. Hiding the workflow lets people imagine me working in some more flattering way than I actually am. That’s short-term good, long-term bad.

Finally: I’m not going to pretend I figured out an entire research project alone when AI is doing a substantial share of the lifting. That’s not honest, and for a site specifically about AI, it would be actively hypocritical.

How to set up your own version

If you want to try this, the minimum viable orchestra is:

One large commercial model (Claude or GPT) for drafting and expansion. Subscription to one is enough to start.

One second model, ideally a different provider, for adversarial review. If budget is a concern, use free-tier access rotated across providers. The key property is that it’s a different model, not another instance of the first.

A simple local setup or a third API for extraction and compilation. I use Mistral locally because I already have a local GPU setup, but this role can be handled by any separate API call with a “extract claims, no interpretation” prompt.

Your own taste and time as the regulator. This is the expensive resource. If you don’t invest it, the orchestra produces mush.

Start with two (draft + adversarial). Add the third (extraction) once you’ve used two for a few pieces and see where the gap is. Don’t add more than three without a specific reason. More models doesn’t linearly improve output; it just takes more of your regulator time.

Closing thought

The single-AI workflow narrative is simpler and gets you to “I wrote this with AI” faster. But if you’re doing serious work, the orchestra produces better results per unit of time, catches more real flaws, and reads less like AI prose. It requires more regulator discipline than a one-model loop, which is why most people don’t use it.

The regulator is the job. The AIs are the instruments. Anybody telling you otherwise is selling you a workflow where the regulator role is underplayed, usually because the person selling it hasn’t done the regulator work themselves.

If you read this site and the voice feels like one person’s voice, that’s because it is. Three AIs helped. One person decided.