Yamakei.info

Notes on building reliable software with AI in the loop.

Tokens Are Not Productivity

Unlimited inference does not create productivity by itself. The missing layer is the infrastructure that turns speed into trusted outcomes — routing, state, intent, and judgment.

2026-04-05

There is a seductive new equation in AI: give everyone unlimited tokens, restructure the organization around agents, and productivity follows.

It captures something real. It is also dangerously incomplete.

Tokens are potential energy. They buy inference, not outcomes. A company can consume millions of tokens and still produce incoherent systems, misaligned products, and decisions nobody can defend after the fact.

The history of electricity is instructive. Electricity did not make factories productive by itself. For decades, many factories kept the old layout and merely swapped steam power for electric motors — and the productivity gains stalled until the work itself was redesigned. The companies that won were not the ones with the most electricity. They were the ones that redesigned the work around it.

AI is following the same pattern. We are measuring access when we should be measuring the conversion system above access.

The Seductive Equation

A recent Genspark story, if taken at face value, is a clean version of this narrative: small teams, aggressive AI usage, near-unlimited token access, and $200M ARR in eleven months. The implication is clear — remove the token ceiling, and output scales.

But access is not conversion. Token supply tells you how much inference is available. It does not tell you whether that inference reaches productive work, whether the work stays coherent over time, or whether anyone can defend the resulting decisions after the fact.

What Tokens Actually Buy

A token buys one unit of inference. A thousand tokens buy a thousand units of inference.

Inference without orchestration is just heat.

Consider what happens when one engineer is given unlimited Claude Code access across six projects. Without routing, they are switching between SSH sessions, losing context, re-authenticating VPNs, managing tmux panes. The tokens are flowing. The productivity is not.

The missing piece is orchestration: a way to bind each unit of inference to the right project, context, credentials, and human interface. mpg is one concrete version of that idea — a Discord channel per project, session persistence across restarts, queue-based concurrency control. The bottleneck was never token supply. It was orchestration.

The Conversion Stack

If token access is Layer 0, then productive AI requires four more layers above it. Each layer answers a different question: not whether AI is available, but whether its work can be directed, trusted, governed, and judged.

Each layer is agent-facing substrate — software whose primary user is increasingly an agent acting on behalf of a human, not the human directly. As agents absorb the per-instance, user-facing case, the software still worth writing is the substrate underneath.

Layer 1: Routing — Does inference reach the right context?

Without routing, parallelism becomes chaos. A single person can manage five or six concurrent AI work streams, but only with infrastructure that maps the problem to an interface humans already understand. mpg demonstrated this by mapping concurrent Claude Code sessions to a tool people already check twenty times a day.

Layer 2: State — Can agents reason over trusted domain state?

AI-native applications are not chatbots with domain knowledge. They are systems where knowledge is queryable, constraints are deterministic, and trust boundaries are structural. HouseholdOS coordinates multi-person household scheduling by reasoning over structured, tenant-scoped, provenanced memory items. Schedule conflicts are resolved by a deterministic constraint engine, not the LLM. Actions require human confirmation through an architecturally enforced queue, not a prompt asking the AI to "please check first."

Layer 3: Intent — Do constraints survive velocity?

When agents write code at the speed of inference, the deepest risk is not merely bad code. It is intent drift — systems that are technically correct but philosophically misaligned, feature by feature, commit by commit.

The practical response is to make intent a first-class artifact: declared constraints with health states, risk signals, and audit trails. Not documentation that decays — living contracts that agents read before they act. IntentLayer is one version: a CLI that validates structure, not opinion. No LLM in the governance loop. The insight from the Steward exploration"Speed outpaces understanding" — is the problem this layer addresses.

Layer 4: Judgment — Are we measuring what remains scarce?

If AI makes coding cheap, then interviews that reward implementation fluency are measuring a depreciating skill. The replacement is judgment under pressure. Takumi is one form of this: simulated scenarios where a founder pressures an engineer to violate documented product invariants, with evidence-anchored scoring against verbatim transcript quotes.

The protected boundaries tell the story: Never score on speed. Never punish clarifying questions. Never auto-reject without trace. These are not features. They are a philosophical position: in the post-execution era, the scarce resource is the wisdom to say no.

The Stack

LayerQuestionFailure mode
0. AccessCan people reach AI?No AI leverage
1. RoutingDoes inference reach the right context?Context-switching and chaos
2. StateCan agents reason over trusted domain state?Agents guessing from unstructured history
3. IntentDo constraints survive velocity?Correct execution, wrong direction
4. JudgmentAre we measuring what remains scarce?Rewarding what AI already made cheap

The Genspark story lives at Layer 0. It is a real achievement. But Layer 0 without Layers 1 through 4 is a company burning tokens at scale — fast inference producing fast drift, measured by metrics that no longer correlate with value.

Layer 0 and Defensible Software

Layer 0 has winner-takes-most dynamics. Distribution, official integrations, low setup cost, and access to the human's already-logged-in browser session all favor the large platform. Claude in Chrome, Claude Dispatch, and whatever ships next will own much of that surface for the same reasons Windows owned the consumer desktop in the 90s.

That does not make independent software irrelevant. It changes where defensibility lives. The opportunity is not to compete with Claude in Chrome at the access layer. The opportunity is to build the infrastructure that makes universal access productive: the Linux-on-servers position, not the Windows-on-desktops position.

Harness Engineering

Some of this is starting to get a name. In February 2026, Mitchell Hashimoto used the phrase "harness engineering" to describe a discipline he'd developed while working with AI coding agents: every time an agent makes a mistake, you engineer the surrounding environment so that mistake becomes structurally impossible to repeat. The framing circulated quickly. OpenAI extended it days later with a more detailed treatment of how that work shows up in a codebase — repository structure, structural tests, dependency boundaries, declared constraints — so that agent output stays coherent.

That maps directly onto Layers 2 and 3 of the stack. Agent legibility — what an agent cannot access in-context effectively does not exist — is the State problem. Enforcing architecture through mechanical constraints is the Intent problem at implementation level.

But harness engineering, as currently described, is one instance of a broader pattern. The same logic operates wherever agents act on behalf of humans, not just in a codebase: routing inference to the right context, giving agents trusted state to reason over, keeping intent durable as work scales, and measuring whether the output represents judgment or just fluency. What harness engineering is to a repository, the conversion stack is to everything else.

The Real Metric

The article that sparked this thinking measures success in revenue — $200M ARR in eleven months. Revenue is a valid signal, but it is a lagging and lossy one. It tells you the market responded. It does not tell you whether the organization built the right things, whether those systems will remain coherent, or whether the people building them understand the commitments they are encoding.

The question is not how many tokens you consumed. It is what survived contact with reality after you consumed them.

What Changes

When execution is free, four things become load-bearing:

  1. Problem framing — knowing what to build, not building it.
  2. Intent durability — keeping systems aligned with their purpose as velocity increases.
  3. Judgment under pressure — the ability to say "no, that violates what we committed to" when a stakeholder, a deadline, or an AI agent pushes in the wrong direction.
  4. Substrate, not surface — the software still worth writing is the agent-facing infrastructure underneath, not the user-facing app the agent now assembles per call.

These are not soft skills. They are the hard infrastructure of the post-execution era. And they require their own tooling — routing layers, state systems, intent registries, and evaluation systems that measure conviction, not keystrokes. Together they form what I have started calling intent infrastructure: the substrate where speed and direction can coexist instead of trading off.

Tokens are not productivity. Tokens are potential energy.

The productivity comes from the stack above them: routing, state, intent, and judgment. That is the infrastructure that converts inference into trusted outcomes.

In the post-execution era, the scarce resource is no longer the ability to produce output. It is the ability to preserve direction while output becomes abundant.