Over-Engineering AI: A Feature-Creep Post-Mortem

In March 2026 I had what I thought was my best AI companion design yet. Call it AIO Companion. It had:

A telemetry pipeline tracking 9 signals about the user: response latency, typo density, message length, hedge-word frequency, question ratio, negation patterns, vocabulary complexity, sentiment, topic abandonment.
A phase detector mapping those signals to roughly 12 recurring emotional/contextual archetypes.
Transition-probability forecasting: given you’re in phase X, predict likelihood of moving to phase Y.
Causal probes to disambiguate phases when signals were mixed.
A context packer doing “30B survivability” work, keeping internal resolution high while emitting low-entropy model input.
Structured memory with four taxonomies: corrections, constraints, soft preferences, positive examples.
Chat history sidebar with read-only session isolation.

It was beautiful. I spent weeks on it. I wrote design docs. I felt like a serious engineer.

It didn’t work.

Not in a small way. In the way where the bot felt less alive than my weekend hack.

What the symptoms looked like

The bot would respond competently to most messages but miss the emotional register.
It would occasionally make a phase-appropriate choice that felt uncanny, then three messages later be generic again.
When I turned features off one by one to debug, the bot got better, not worse.
Eventually I turned off most of the machinery and the bot became usable.

That last one is the important data point. Every piece of sophistication I added made it worse. What was going on?

The diagnosis, six weeks later

I was solving the wrong problem. Every piece of that architecture was aimed at understanding the user. None of it was aimed at anchoring the agent.

Here’s the thing. The user is generally fine. They’ll type whatever they type. You don’t need to model their typo density to respond to them well. The model is already good at reading users. That’s what LLMs are for.

What LLMs are not good at is maintaining a consistent identity across a long conversation. That’s the actual hard problem of building an AI companion. And none of my machinery was addressing it.

I had built an elaborate observer, pointed at the wrong thing.

Why the sophistication backfired

Three reasons.

1. Diffuse identity is no identity.

My structured memory stored the agent’s character across corrections, constraints, soft preferences, and positive examples. The idea was that richer memory would produce a richer persona. In practice, it meant the persona was spread across dozens of memory entries, each one pulling in slightly different directions. When the model read them all, it averaged. Average of many directions is nothing.

A concentrated, tight persona spec (200 tokens, re-injected every turn) produced a stronger character than my distributed 3000-token memory structure. Tight beats rich.

2. Telemetry-as-input, but for the wrong consumer.

The telemetry pipeline was smart. But it was feeding into agent steering logic: “user is in phase 7, adjust response style accordingly.” Steering the agent with reactive signals means the agent’s identity gets continuously perturbed by the user’s state. The user has a bad morning, the bot’s personality shifts. That’s not a companion. That’s a mirror.

The right use of telemetry is to inform the persona compiler, not steer the agent directly. Let the compiler decide whether to tweak context (“user seems tired, bias toward shorter responses today”) and keep the agent stable. Same data, different consumer, completely different results.

3. Rule accumulation is a regression to the mean of “weird.”

Every time the bot did something wrong, I added a correction. Over weeks this produced dozens of rules. The rules addressed specific symptoms. They didn’t address the underlying cause, because I didn’t know the underlying cause.

What you get from that process is a bot that’s been patched against specific failures but still fails generically. Rules don’t compose into identity. They compose into a Frankenstein of restrictions that still has no coherent center.

The simpler thing that worked

I rewrote the companion in about 300 lines of Python.

A 250-token persona spec, flat text. Re-injected every turn as the system message.
A simple facts-only memory, key-value.
No observer. No phase detection. No transition probabilities. No telemetry pipeline. No context packer.
The chat client sends: persona spec, last 5 turns of conversation, new user message. Gets response.

That version held personality longer, felt more alive, and was easier to debug. I had 10x less code and 10x better results.

The lesson: sophistication earns its place or gets cut

This is the hardest lesson for builders who like to build. You want to add the clever thing. The clever thing feels like engineering. But:

Complexity you add to an AI system has to justify itself against the baseline of “do nothing and re-inject a short spec every turn.”

That baseline is shockingly hard to beat. If your fancy feature isn’t making things better than the baseline, it’s making them worse, because it’s adding noise without adding signal.

Before you add any feature to an AI system, ask:

Does this make identity more stable or less stable across long conversations?
Does it steer the agent, or inform a compiler that produces the spec?
Does it reduce the size of the identity spec, or fragment it?
If I removed it, would the baseline (tight spec + re-injection) still work?

Most features you want to add will fail at least one of these tests. Those features are net-negative. Cut them.

When sophistication does earn its place

Not all complexity is bad. Here’s what I kept after the post-mortem:

Telemetry as input to the persona compiler, not the agent. If the compiler tweaks tomorrow’s spec based on today’s signals, fine. That’s a separate channel from agent steering.
Chat history with session isolation. UX pattern, not identity machinery. Good for the user, doesn’t corrupt the bot.
Facts memory. Still valuable. Just don’t let it store identity.

That’s it. The telemetry pipeline went in the trash. Phase detection went in the trash. Transition probabilities, causal probes, context packer, structured identity memory: all in the trash.

A broader pattern

I see this in other people’s AI projects constantly. Someone builds a companion, notices drift, and responds by adding machinery. Observer systems. Rule engines. Memory taxonomies. Context routers. Multi-agent orchestrators. Each addition feels like progress.

It’s not progress. It’s mass in the wrong direction.

The hard problem is identity stability. Everything that doesn’t directly address that problem is decoration. Decoration is fine in finished systems. Decoration on a foundation that’s still wobbling makes the wobble worse, because now you can’t even see the foundation through the decoration.

If your AI system isn’t working, the fix is probably not a new module. The fix is probably to strip it down to 300 lines and rebuild the foundation.

I know. It hurts. Do it anyway.