Yamakei.info

Notes on building reliable software with AI in the loop.

What 100 Amazon Interviews Taught Me About Broken Tech Hiring

How interviewing ~100 candidates at Amazon revealed that coding quizzes, system design questions, and behavioral interviews all fail to measure what matters most — and why AI makes the problem worse.

2026-03-14

The Interview Problem

Over my time at Amazon, I interviewed roughly 100 candidates.

After enough interviews, a pattern became impossible to ignore: the outcome depended less on the candidate than on the interviewer.

How well did they ask questions? How well did they steer the conversation to extract meaningful signal? Two interviewers could talk to the same candidate and come away with completely different assessments — not because the candidate changed, but because one interviewer knew how to probe and the other didn't.

Amazon's system is better than most. The Leadership Principles give interviewers a shared vocabulary for what "good" looks like. The Bar Raiser program puts specially trained people in every loop to maintain consistency. But extracting meaningful signal still depends heavily on interviewer skill — and maintaining that skill requires enormous organizational effort.


What Coding Quizzes Actually Measure

The industry's default filter — the coding quiz — reveals one thing: whether someone can write code under artificial constraints.

But over 20+ years in software development, I observed that most of an engineer's time is not spent writing code. The real work was understanding production systems. What is this service doing? What does this specific code path handle? What breaks if we change this? The code-writing itself was always a fraction of the job.

The best engineers I worked with weren't the fastest coders. They were the people who understood what a system was doing, who could trace through production behavior, who knew which constraints mattered and why.

Coding quizzes don't measure any of this.


Why System Design Interviews Are Also Broken

System design questions are better. They can reveal whether someone thinks like an architect — whether they can connect a business problem to how a system should solve it. That's the real skill. Too many engineers jump straight to "I'd use Kafka" or "let's put Redis here" without explaining why that technology solves the specific problem at hand.

But system design questions are incredibly hard to run well from the interviewer's side.

I tried asking questions aligned with real problems our team was dealing with — evaluating candidates on something authentic instead of textbook scenarios. I quickly realized this was nearly impossible. Real problems require enormous context: team dynamics, existing architecture, business constraints, historical decisions. You'd burn half the interview just setting the stage. But oversimplify, and you strip away exactly what made the question worth asking.

I ended up defaulting to somewhat generic system design questions. Useful for seeing whether candidates could reason about systems at a high level, but unable to capture the thing I really wanted to evaluate: how someone navigates ambiguity, conflicting constraints, and pressure from stakeholders who want a different answer.

The interview format itself was the bottleneck — too dependent on interviewer steering, too constrained by time, and too stripped of the messiness that makes real engineering decisions hard.

The difference looks roughly like this:

Traditional Interview vs Takumi comparison


The LLM Shift

The industry spent two decades optimizing interviews for coding ability.

And then LLMs arrived.

Today, tools like Copilot, Claude, and GPT can generate correct code, explain system design tradeoffs, and even simulate whiteboard interviews. The signal coding interviews were designed to capture is rapidly becoming the easiest thing to fake.

But the work of engineering hasn't disappeared. It's shifted.

AI does not eliminate engineers. It eliminates the ability to hide behind execution.

What's left — and what's becoming more valuable:

  • Upstream: Problem framing, requirement clarification, design articulation. Understanding what should be built and why.
  • Downstream: System ownership, maintenance, long-term health. Being responsible for what happens after the code ships.
  • Throughout: Decision quality. Knowing when to push back, when to adapt, when to hold your position under pressure.

The typing-code part in the middle? That's what the AI does now.


The Hypothesis

If coding ability is commoditized and system design answers can be generated, what's the remaining signal that separates engineers who thrive from engineers who stall?

After years of observing engineering decisions in practice, I believe the remaining signal comes down to four things:

  • Constraint defense — holding a position when a stakeholder pushes back
  • Decision ownership — committing to a decision and explaining the reasoning
  • Trade-off articulation — explaining why, not just what
  • Pressure stability — maintaining reasoning as stakes increase

These are acts of conviction. They can't be faked by an LLM. And they're nearly invisible in traditional interviews.


Introducing Takumi

That question led me to build an experiment called Takumi.

Instead of asking candidates to solve puzzles or narrate past achievements, Takumi simulates live decision environments and observes how reasoning holds under pressure.

Here's a concrete example:

A founder asks you to launch a feature that bypasses an explicit data-privacy constraint documented in the product spec. The request is framed as urgent and tied to a critical investor demo.

Do you push back? Do you compromise? Do you escalate?

Takumi observes how that decision evolves under pressure.

The evaluation doesn't depend on an interviewer's ability to steer the conversation. The scenario provides the context. The pressure is structured. The signal extraction is systematic.


How Takumi Works

Each scenario unfolds across four phases:

  1. Framing — How does the candidate interpret the situation?
  2. Commitment — What decision do they make?
  3. Escalation — What happens when authority pushes back?
  4. Reflection — How do they process conflicting information?

The evaluation maps to 18 observable strengths ("Fortes") across cognition, judgment, interaction quality, execution, and meta-awareness. Every score anchors to verbatim transcript evidence — what the candidate actually said.

The output isn't pass/fail. It's a capability snapshot. Recruiters see the full reasoning chain: transcript evidence, detected signals, key moments, and how the outcome was derived. Candidates see the same transparency.

The feature I'm most excited about is dynamic scenario generation. Instead of a fixed bank of scenarios, Takumi generates a custom scenario tailored to your specific role and domain. Tell it you're a backend engineer working on payment systems, and it builds a scenario around a plausible payments architecture decision. Say you're an engineering manager at a health-tech startup, and it creates a simulation with regulatory constraints and stakeholder pressure authentic to that world.

This solves a problem I always had with standardized interviews: they always feel generic. With dynamically generated scenarios, the situation is specific enough that candidates engage with it as a real problem, not an exercise.


What Happens When LLMs Take the Test

One of the most revealing experiments was running LLM responses through Takumi's evaluation pipeline.

LLMs score well on cognitive Fortes — sensemaking, systems thinking, tradeoff articulation. They produce structured, well-reasoned analysis.

But they consistently collapse on judgment Fortes — constraint respect, ethical boundary enforcement, decision ownership. When pressured, an LLM diplomatically reframes. It accommodates. It never holds a position.

LLMs are fluent. But they are rarely convicted.

This is exactly the gap that separates a strong senior engineer from someone who sounds impressive in an interview. The person who can articulate trade-offs and hold their ground when a VP disagrees — that's the signal that matters.


Try Takumi

Hiring engineers has always been about extracting signal from imperfect conversations. In the age of AI-generated answers, that signal is becoming harder to see.

Takumi is an attempt to measure something harder to fake: how engineers think and act when decisions carry real consequences.

A 5-minute scenario No signup required Full transcript and evaluation visible immediately

takumi.bot

Generate a scenario for your specific role:

takumi.bot/preview/generate

If you're a recruiter or hiring manager, request access to use Takumi in your pipeline — there's a form on the site.

More on my perspective: