Yamakei.info

Notes on building reliable software with AI in the loop.

The Arc: From Building With Agents to Directing Them

Five months, nine projects, sixteen essays. The progression was organic: agent as tool, agent as user, collaborator I could measure, and finally proxy for the smaller judgment calls I used to make myself.

2026-04-29

I didn't plan any of this.

Five months ago I started building a pickleball app with AI — first ChatGPT, GitHub Copilot, and Antigravity, then Claude Code, which became the primary harness after a few weeks. I thought I was learning a faster way to ship software. What happened was stranger: each project exposed a new layer of friction, and each layer of friction became the brief for the next project.

First I built software with agents. Then I built infrastructure for agents. Then I built tools to measure whether I was working with them well. Now I am exploring whether an agent can absorb enough context to make smaller judgment calls on my behalf.

Over time, the hard part stopped being getting agents to write code. The hard part became building the environment where agents could be trusted, steered, measured, and eventually delegated judgment.

The arc only became visible afterward.


Phase 1: Building Software with Agents

The project names matter less than the progression. Each one exposed a constraint in how I worked with agents, and the next one tried to remove that constraint.

RallyHub came first — a full-stack pickleball match tracker with identity, offline support, and DUPR integration, built solo in ten days. The follow-up essay, What AI-Assisted Software Development Actually Changes, came out of one observation: implementation time collapsed, but design clarity didn't get any easier. The bottleneck moved.

Moyai followed in January — a privacy-first coordination platform where private wants could be matched with nearby offers without exposing either side prematurely. Built in about a week. It never reached broader use, and the project is paused — not because the build hit a wall, but because getting a coordination product traction takes a different kind of work, and that bandwidth was already spoken for.

Kokoro was the third — a persona-driven chat app I started after asking my wife what she'd want to build if software development were free. Multi-channel delivery via web and LINE, immutable transcript replay, around a hundred commits co-authored with Claude. It's the project that prompted What Does 'Good Engineer' Mean When AI Builds the Software? — traditional productivity metrics stop measuring anything meaningful when an agent writes most of the code. Like Moyai, Kokoro is currently dormant for the same reason: the build is shippable, but the work of finding the right users and iterating with them is its own project, and a slower one.

Three projects, three case studies. The lesson the writing kept circling: AI compresses execution time, not judgment time. Everywhere implementation got faster, design ambiguity got more expensive — and the parts AI doesn't compress at all (recruitment, distribution, the patience to learn what a small group of users actually needs) became the visible ceiling on how many bets I could carry at once.


Interlude: From Coding Agents to Resident Agents

Somewhere in this stretch I came across OpenClaw (later Clawbot). It was the kind of demo that quietly resets your mental model: an agent reading email, parsing an invite, and writing to a shared calendar — not as an API call I made, but as a tenant operating inside someone's environment.

The category change matters. Software you invoke is one shape. Software you host is another. Once that distinction landed, every project that came after carried a question Phase 1 hadn't been asking: not just what the agent can do, but where it lives, what it can touch, and how the trust boundary holds.

This is where the arc stops being only about coding agents.


Phase 2: Building for Agents

The shift happened by accident.

I was running Claude Code sessions across five or six projects at once — SSH-ing into a Linux box, juggling tmux panes, hoping nothing broke. The official Discord plugin almost solved it, except one bot pairs with one session. Five projects, five bots. No.

I described the limitation to Claude Code as a problem I was thinking through. It started building. Three to four hours later, Multi-Project Gateway was running — one bot, multiple channels, each channel mapped to a project directory. The full story is in From tmux to Discord and From Message Router to Agent Team.

MPG is the inflection point of this whole arc. Not because it's clever — it isn't, particularly — but because it directly changed how I approach using coding agents. Six channels, six concurrent sessions, no SSH, no tmux, no machine state to babysit.

MPG solved the coordination problem for coding agents. HouseholdOS asked the same question one layer up: where can agents safely touch real life?

HouseholdOS is the practical version of the host-not-invoke thesis, built for my own household — a trust-first coordination layer over Google Calendar, Drive, Gmail, and Discord. Conversational scheduling, tenant-scoped household memory, inbox triage. Every state-changing action requires explicit confirmation, because hosting an agent in a family's calendar makes the trust boundary structural rather than behavioral. Still active.

Once agents could operate across projects, I needed a way to preserve intent across sessions, not just code changes. Then IntentLayer, a CLI for intent-driven governance. It scaffolds and validates an intent registry so durable human intent stays tracked in-repo when agents write code. The essay What 100 Amazon Interviews Taught Me About Broken Tech Hiring belongs to this phase too: if traditional interviews were already measuring the wrong thing, AI makes the gap worse, and we need governance that catches what interviews don't.

The pattern across this phase: I stopped building things with agents and started building things that make agents more useful. In retrospect, this was the shift into harness engineering — building the surrounding system that makes agents useful, directed, observable, and safe enough to rely on.


Phase 3: Measuring Person-Agent Performance

Once I'd built tools to maximize agent throughput, the next question was obvious — how do you tell if any of it is working?

Takumi was the vetting platform — placing candidate "AI Architects" in realistic simulations where a fake non-technical founder pushes them toward privacy-violating decisions, with an AI auditor scoring whether the candidate pushed back. The added takumi-eval CLI does fused behavioral and interaction-quality evaluation during real Claude Code sessions. Like Moyai and Kokoro, Takumi is built and shippable; the active constraint now is the same one — the recruitment and feedback loop that turns a working product into a useful one.

Pulse followed — a CLI that scores how well you interact with AI coding agents on three dimensions: convergence (exchanges to outcome), intent anchoring (reference to declared intents), and decision quality (do commits explain why, not what). All metrics from observable data. No LLM in scoring.

This phase was about making the work legible. If a person and an agent are co-authoring software, what does "good" actually look like?


Phase 4: Delegating Judgment

Ayumi is the current project. Calling it a "personal life-context agent system" undersells the aim.

The aim is pointed: build an agent that has enough of my context, my prior decisions, and my way of weighing trade-offs that it can make the smaller calls without me. Not as a replacement for judgment, but as a proxy for the lower-altitude calls that currently consume judgment anyway.

The scaffolding is unsurprising — Gmail and Calendar history mined into topic-scoped memory (work, travel, health, social), available to topic-specialized agents on top of MPG (orchestration), HouseholdOS (credential brokering), and Drive (storage). What matters is what that scaffolding is being pointed at.

The previous three phases made the agent available, useful, and measurable. Phase 4 is about making it directable at the right altitude. Today I describe a task; the agent executes. What Ayumi is testing is whether an agent can absorb enough of how I think to take broader briefs — "draft the next essay arc," "investigate this product idea and tell me whether it's worth pursuing," "decide whether this scheduling conflict is mine to resolve" — and return work I can ship without re-litigating each decision.

This is also where my own work changes. Software development, essay publication, research, investigation — these are the places I currently spend judgment. If an agent can replicate enough of it for the lower-altitude calls, my role moves up another rung: from operator to editor, from editor to director. Whether that endpoint is desirable, or even reachable, is the open question. Ayumi is the experiment.


What changed

The arc, compressed:

PhaseTheme
1Building software with agents
2Building infrastructure for agents
3Measuring person-agent performance
4Delegating judgment to agents

Practitioner → toolsmith → researcher → architect. None of which I planned. Each phase came from a friction the previous phase exposed.

If I started RallyHub today, I think it would take a third of the time. Not because the model is dramatically better — it's roughly the same — but because the operating system around the model is. MPG removed the SSH/tmux tax. IntentLayer pinned my durable decisions in-repo. Pulse tells me when an interaction is dragging. Takumi keeps me honest about what skill means now.

The pattern is recursive: every layer I build to amplify agents amplifies the next layer of work I can attempt. That's the part I didn't see coming.


This isn't a plan. It's a record. The next phase will probably start the same way every previous one did: not with a roadmap, but with a project whose friction tells me what to build next.