I Put RWKV in Production the Day I Learned It Existed

On April 8th, 2026, I was in a conversation about something unrelated. Someone mentioned RWKV in passing. I’d never heard of it. Ninety minutes later, RWKV was the active substrate in my research pipeline, replacing the synthetic physics layer I’d spent six months building.

That’s not how engineering is supposed to go. But when a piece of tech fits a problem you’ve been carrying for months, you don’t wait. You just swap.

What RWKV is (short version)

Transformers are the architecture behind every AI model you’ve heard of. GPT, Claude, Gemini, Llama, Qwen, Mistral. All transformers. They work by letting every token in a sequence “attend to” every other token. That’s why they’re smart. It’s also why they’re expensive: attention scales quadratically with sequence length.

RWKV is different. It’s recurrent. Each token passes through a stream, updates a running hidden state, and moves on. It doesn’t attend to all previous tokens directly. It carries forward a compressed summary of everything that came before, in the form of a state vector.

This has two practical consequences. First, RWKV runs at constant memory and compute regardless of sequence length. It can in principle process million-token contexts without the compute blowing up. Second, and this is the important one for me: the hidden state is a real, introspectable vector. In a transformer, what the model “knows” at any given token is scattered across the attention patterns of all prior layers. In RWKV, it’s right there. You can read it.

That’s what pulled me in.

Why it mattered for what I was building

I’d spent months building synthetic physics as a substrate layer. 512-dimensional manifold, chaotic attractors as world signal, force/damping/noise dynamics, custom Rust code for all of it. The goal was to give the language model something to “live inside.” A state that had continuity across generations, that could be perturbed, that could stabilize or collapse in measurable ways.

It worked. But it was synthetic. The physics I’d built was a parallel process running alongside the model, not something the model itself produced.

When I learned what RWKV actually exposes, I realized the model already was the substrate I’d been building next to it. Every forward pass updates a hidden state that carries the model’s own running interpretation of its context. That state is a vector. You can capture it. You can measure how it changes. You can inject perturbations into it and see how subsequent generations respond.

I didn’t need a synthetic substrate. The model had one.

The swap

The code change took half a day. The physics layer came out. RWKV state capture went in. The monitoring switched from tracking synthetic physics coordinates to tracking the L2 delta between consecutive hidden states.

What I wanted to know first: how stable is this substrate on its own? With no injection, no intervention, just the model running through a prompt, how much does its hidden state move tick to tick, and how stable is that movement?

Cold-boot numbers came in immediately.

RWKV-7 7.2B Q8_0, no injection, no intervention: coherence 0.9995. Variance around 5×10⁻⁷.

For reference, the transformer substrate had produced coherence values in the 0.97 range in comparable conditions. Two orders of magnitude more stable, from a model I didn’t have to wrap in synthetic physics at all.

The first time I saw the 0.9995 number, I wrote it down three times on three different sticky notes because I didn’t trust any single place to hold it.

Why I walked it back

Two weeks in, the honest assessment: the behaviors that had emerged under synthetic physics on a 3B-active MoE transformer were not emerging on 7.2B RWKV. The substrate was stable. The behaviors weren’t coming.

The obvious diagnosis: 7.2B RWKV has meaningfully less raw model capability than the transformers I’d been comparing against. Not because RWKV is worse per parameter, but because the publicly available RWKV models at the time topped out at 7.2B, while the transformer ecosystem had dense models at 14B, 30B, 70B, and beyond. The apples-to-apples comparison I couldn’t run was “RWKV at 70B vs transformer at 70B.” I was running “RWKV at 7B vs transformer at 14B+ (or at 30B MoE, which is its own story).”

I walked the substrate change back. Kept the insight. Kept the code. Kept the architecture path as something to revisit when larger RWKV models exist.

This is where I want to flag something that’s rare in AI writing: “walked back” is not the same as “failed.” The RWKV experiment produced real data. It answered a specific question (are the behaviors we’re seeing substrate-driven or model-driven?) with a specific answer (at 7B, on this architecture, in this pipeline, the behaviors don’t emerge). That answer is load-bearing for everything that comes after. It’s not a negative result in the pejorative sense. It’s a scale-dependent result that narrows the space of explanations.

If I’d declared the RWKV experiment a failure and moved on, I’d have lost that. If I’d pretended it was going to work “once I just tune this one thing,” I’d have burned months. The framing that holds up is: this is a scale problem, not a dead end, and the moment a 70B+ RWKV exists I’ll rerun this from scratch.

What RWKV state capture looks like in practice

For anyone wanting to see what’s actually exposed: this is a minimal pattern using llama.cpp’s state APIs. You’ll need a model file in GGUF format and llama-cpp-python.

import llama_cpp
import numpy as np

llm = llama_cpp.Llama(
    model_path="rwkv-7-7.2b-Q8_0.gguf",
    n_ctx=4096,
    seed=42,
)

def capture_state(llm) -> np.ndarray:
    """Return the current RWKV recurrent state as a flat f32 vector."""
    size = llm.ctx.state_size_bytes()
    buf = (llama_cpp.c_uint8 * size)()
    llama_cpp.llama_state_seq_get_data(
        llm.ctx.ctx, buf, size, 0
    )
    raw = np.frombuffer(buf, dtype=np.uint8)
    as_f32 = np.frombuffer(raw.tobytes(), dtype=np.float32)
    return as_f32[np.isfinite(as_f32)]

def state_delta(prev: np.ndarray, curr: np.ndarray) -> float:
    """L2 distance per valid pair."""
    n = min(len(prev), len(curr))
    diff = prev[:n] - curr[:n]
    return float(np.linalg.norm(diff) / np.sqrt(n))

prev = capture_state(llm)
_ = llm("Describe a small garden.", max_tokens=128)
curr = capture_state(llm)

print(f"State delta after one generation: {state_delta(prev, curr):.6f}")

Run this across many generations, record the deltas, and compute 1 / (1 + variance of the last N deltas). That’s your coherence metric: close to 1 means the state is stabilized, closer to 0 means it’s churning. What this lets you do is watch the model as it processes input. Not what it’s saying, what it’s becoming.

Transformers don’t give you this in the same clean way. You can extract attention patterns and layer activations, but you don’t get a single carried-forward state vector that means “the model’s current understanding.” RWKV does. That’s a different kind of tool.

The lesson

Architecture choice is real. Not every model is interchangeable for every use case. If you’re building something that needs to introspect what the model is doing beneath the text output, the underlying architecture matters. Transformers are the best general-purpose AI models we have. They’re not the only architecture worth knowing about.

RWKV is a scale problem for my use case, not a dead end. When RWKV-style architectures ship at 70B+, I’ll be running the same experiment again from the top.

Don’t wait for architecture fashion to catch up with what you actually need. If you find a piece of tech that fits a problem you’ve been carrying, try it. The cost of swapping is usually lower than you think. The cost of not looking is usually higher than you think.

That’s how the whole thing went from “RWKV is a thing” to “RWKV is in the pipeline” in one conversation.