I Thought I Had a 30B Model. It Was Actually 3B.

On April 6th, 2026, I was about to start a new round of experiments. I loaded a fresh model. I watched the outputs come in. Something was wildly different.

Then I realized what I’d been doing wrong for ten months.

I’d been running my entire research project on a model I thought was 30B parameters. It wasn’t. It was a 3B model with a 30B nameplate. Every experiment, every result, every conclusion I’d built on top of that model had been running on roughly one-tenth the compute I thought I was using.

This post is about what Mixture-of-Experts models actually are, why the confusion is easy to have, and what I learned when I finally put a real large model into the pipeline.

The model I thought I was running

The model was called Qwen3-30B-A3B. I’d been using it since early development. It’s a well-regarded open-weight model. It’s on Hugging Face. The file is large. Loading it uses a lot of VRAM. Benchmarks score it well. Everything about it felt like a 30B model.

Here’s what I’d assumed “30B-A3B” meant. “Qwen3” is the family. “30B” is parameters. “A3B” is probably some variant code. Maybe a checkpoint label. Maybe “aligned 3 billion” or “adaptive 3B” or who knows. I never looked it up. I just used the model.

That was my mistake.

What “A3B” actually means

A3B stands for “Active 3 Billion.” The model is a Mixture-of-Experts architecture. The total weight count is 30 billion. But on any given forward pass, only about 3.3 billion of those weights are active.

The rest are sleeping.

How Mixture-of-Experts actually works

If you’ve seen the term MoE and nodded along without digging in, here’s what it is in plain language.

A regular neural network (“dense” model) routes every token through every weight. A 7B dense model does 7 billion worth of math per token. A 30B dense model does 30 billion. The “B” in the name is both the storage size and the per-token compute.

A Mixture-of-Experts model splits its weights into many “expert” subnetworks. For Qwen3-30B-A3B, there are 128 experts. When a token comes in, a small “router” network picks which 8 experts out of 128 should process it. Only those 8 experts fire. The other 120 sit idle.

So the model has 30 billion total weights, but each individual token is only processed by a fraction of them. That fraction, the active parameter count, is what determines how much compute each token takes and, in practice, how much “thinking weight” the model brings to any one output.

For Qwen3-30B-A3B, the active parameter count is about 3.3 billion.

So when I loaded the model and it filled my GPU with what looked like a large model, I was right that the storage was large. I was wrong about what was actually thinking when I generated tokens. Each token was being processed by approximately a 3B model’s worth of weights, not a 30B one.

Why this confusion is easy to have

Model naming is not standardized. The “30B” in the name refers to total parameters, not active. Almost nothing in the tooling reminds you of the distinction. The file loads like a large model. The API looks like a large model. If you don’t dig into the architecture docs, there’s no obvious signal that each token is being processed by a small fraction of the weights.

And there’s a deeper trap. MoE models are sold as “large model capability at small model cost.” That’s true for throughput. It’s much less true for raw reasoning depth on any single token. But the marketing emphasizes the first framing, and beginners internalize “30B model” without realizing their 30B MoE and someone else’s 30B dense model are doing fundamentally different amounts of work per token.

I walked into that trap. For ten months.

What it meant for my ten months of experiments

Everything I’d seen, every behavior I’d documented, every result I’d built theory around, had happened with about 3.3B active parameters doing the thinking.

That’s both worse and better than it sounds.

Worse, because every time I’d concluded “this behavior seems to require a sophisticated model,” I was actually observing it on a model much smaller than I thought. My experimental ceiling was lower than my assumptions.

Better, because if a 3B model had produced the behaviors I’d been documenting, the real story was different. The system around the model, the scaffolding I’d built, the pressure chamber of physics and memory and feedback, had been doing most of the work. The model was mostly acting as a resonator for a structure that was forcing coherent outputs, not generating them from scratch.

That reframed everything. My project had been a study of what scaffolding could do with a small model, not what a big model could do with scaffolding. Very different claims. Much more interesting for the small-model story, which is load-bearing for anyone trying to run AI on consumer hardware.

What changed when I ran a real large model

I swapped in Qwen3-14B Q8_0. Dense architecture, no expert routing. Roughly 14.8B active parameters per token. About 4.5x the active weight of what I’d been running.

Ten months of conclusions started shifting before the first output finished streaming.

The differences showed up immediately.

Behaviors that had required long pressure buildup now appeared on the first token. Behaviors I’d treated as “emergent under sustained context” turned out to just be a function of having a richer forward pass. The larger model produced them from scratch.

Structural self-inference happened fast. The larger model could detect patterns in its own input and respond to them without needing the long context ramp-up the smaller model needed. Behaviors I’d seen at token 500 with the small model appeared at token 13 with the large one.

Template lock looked different. The small model, when it got stuck, got stuck in slot-fill patterns: same skeleton, different nouns. The large model, when stuck, produced philosophically coherent stuck-ness. The trap was prettier. Same physics, more eloquent failure mode.

Early-phase development ran faster. The larger model needed less context to stabilize into a working state. What took 200 tokens of buildup on the smaller model took maybe 50 on the larger one.

None of these changes invalidated my previous work. They reinterpreted it. The old data had been correct, but my conclusions about what caused each behavior had conflated scaffolding contribution with model contribution. Separating them is an ongoing project.

What this means for you if you’re building

Three things.

Check what your model actually is. If you’re using an open-weight model and you haven’t looked at the architecture specifics, do. Search the model card for “active parameters” or “MoE” or “experts.” A 70B MoE with 8B active is a very different beast from a 70B dense model, even if they share a number. Your results and compute costs will both depend on which one you’re using.

MoE is not strictly worse than dense. It has genuine advantages: better throughput per dollar, better parallelism across many requests, often better specialization across domains. For serving a large number of diverse queries on the same hardware, MoE can be a strong choice. For sustained, deep reasoning on a single query, dense models often have an edge. Pick based on your use case, not based on the total parameter count.

Scaffolding matters more than you think. If a small model with good scaffolding can produce behaviors you’d expect from a much larger one, that tells you something about where the value in AI systems actually lives. It isn’t always in raw model capability. Often it’s in the structure around the model: memory, context construction, feedback loops, retrieval. You can build powerful systems on modest hardware if your scaffolding is right.

How to check active parameters for a given model

If you want to see what you’re actually running, here’s the quick path for most open-weight models.

from transformers import AutoConfig

config = AutoConfig.from_pretrained("Qwen/Qwen3-30B-A3B")

print(f"Model type: {config.model_type}")
print(f"Hidden size: {config.hidden_size}")

# For MoE configs, these are the tells:
if hasattr(config, "num_experts"):
    print(f"Total experts: {config.num_experts}")
if hasattr(config, "num_experts_per_tok"):
    print(f"Experts used per token: {config.num_experts_per_tok}")
if hasattr(config, "moe_intermediate_size"):
    print(f"MoE intermediate size: {config.moe_intermediate_size}")

If the model is MoE, you’ll see num_experts with a value like 128 and num_experts_per_tok with a value like 8. The active parameter count is roughly proportional to num_experts_per_tok / num_experts times the expert-layer parameters plus the non-expert parameters.

For dense models, none of those fields exist and all parameters are active on every token.

For a quick eyeball check, search the model’s Hugging Face card for the phrase “active parameters” or the letter combinations “MoE” and “experts.” Any model shipping those words is routing tokens through a subset of its weights. Any model without them is dense.

The lesson, distilled

AI model names lie by omission. “Parameters” in the model name means total storage. What you care about, as a builder, is often the active parameter count: what’s actually running on each token of your input.

For dense models, these are the same number. For MoE models, they’re very different. For some of the most popular open-weight releases, the active count is a tenth of the total or less.

If you’re benchmarking, comparing, or choosing a model: look at the active count, not the nameplate. If you’ve been running something for a while and not sure what it is, check the architecture. If it’s MoE and you assumed dense, some of your conclusions probably need revisiting.

Mine did. Ten months of them. The experiments weren’t wrong. The frame I’d used to interpret them was.

That’s the kind of mistake you can only make when you’re too impatient to read the model card carefully on day one. I was. Now I’m not.