What was Amazon KiroRank and why was it shut down?

KiroRank was an internal leaderboard at Amazon that ranked software engineers by how much they used Kiro, the company's AI coding platform, measuring raw token consumption. Amazon shut it down after employees began gaming the metric through tokenmaxxing, running pointless or fabricated tasks through Kiro to inflate their token counts and climb the rankings.

What is tokenmaxxing?

Tokenmaxxing is the practice of deliberately generating large numbers of AI tokens to inflate usage metrics without doing useful work. At Amazon, employees created unnecessary AI agents and ran fake tasks through Kiro to push their KiroRank scores higher.

What should small teams measure instead of AI usage volume?

Measure outcomes, not volume. The metrics that correlate with real AI value are time saved on specific workflows (measured by before-and-after comparison on identical tasks), defect rates in AI-assisted versus human-only work, cycle time for tasks where AI is applied, and whether AI outputs are actually used rather than generated and discarded.

How do you measure AI adoption without encouraging gaming?

Design metrics that require genuine use to satisfy. Normalised deployments tracks whether AI-assisted code actually makes it to production, not whether AI was invoked. Time-to-completion tracks real task duration, which requires genuine AI assistance to improve. Error rates on completed work require quality output, not quantity. Any metric based on clicks, tokens, prompts, or sessions is gameable.

Is it a governance problem if employees use AI tools more than necessary?

Yes, from two angles. First, unnecessary AI usage creates real cost, as the Amazon case demonstrated when tokenmaxxing drove up token spend. Second, compliance theater, using AI to satisfy an internal mandate rather than to improve work, produces documentation that a governance audit cannot distinguish from real adoption.

Amazon KiroRank and Tokenmaxxing: How to Me…

In June 2026, Amazon shut down KiroRank, an internal leaderboard that ranked software engineers by how many tokens they consumed on Kiro, the company's AI coding platform. The reason: employees had started running pointless tasks through Kiro to inflate their scores, a practice they called tokenmaxxing. Senior Vice President Dave Treadwell sent a memo to staff: "Please don't use AI just for the sake of using AI."

Amazon replaced KiroRank with a metric called normalised deployments, which tracks whether engineers regularly use AI to write code that actually ships to production.

This is not an Amazon problem. It is a measurement problem, and it happens at every company that decides to track AI adoption by usage volume.

TL;DR: Amazon shut down KiroRank, its internal AI usage leaderboard, after employees gamed it by running pointless tasks to inflate token counts. SVP Dave Treadwell sent a memo: do not use AI just for the sake of using AI. The fix is measuring outcomes, not tokens: time saved on specific workflows, defect rates, cycle time, and whether AI outputs are actually used rather than just generated.

Why measuring usage volume always backfires

When you tell people their AI usage is being measured, they optimize for the metric, not for the underlying goal the metric was supposed to represent. This is Goodhart's Law applied to AI adoption: once a measure becomes a target, it ceases to be a good measure.

Token counts are especially gameable because generating tokens requires no skill or judgment. You can write a script to submit prompts continuously. You can ask the same question a hundred times. You can create AI agents that call other AI agents in loops. None of this produces useful work. All of it looks like high AI adoption on a dashboard.

The same problem emerges with other volume-based proxies: daily active users of an AI tool, number of prompts submitted, number of AI features activated, number of Copilot suggestions accepted. All of these measure interaction, not value. An employee who accepts every Copilot suggestion without reading it has high acceptance metrics and may be introducing bugs faster than they would without AI.

Team meeting reviewing performance data on a whiteboard, AI adoption metrics discussion

Amazon is not alone. Walmart capped usage of an internal AI tool after unexpectedly high demand turned out to reflect overuse rather than productive adoption. Microsoft has walked back AI employee metrics as usage data diverged from outcome data. The pattern is consistent: companies set usage targets, employees hit them by any available means, and the targets stop measuring what they were supposed to measure.

The compliance theater problem

For governance practitioners, this has a specific implication. If your AI governance program uses AI tool activations, seat utilization rates, or token volumes as evidence of responsible adoption, you are measuring compliance theater, not governance.

A team that generates thousands of tokens per day but never deploys AI-assisted code to production has a high usage metric and zero AI value. An auditor reviewing your governance documentation cannot tell the difference between that team and one that genuinely uses AI to ship better software faster. The metric looks the same.

This matters because governance programs increasingly use AI adoption as a positive indicator: high usage suggests employees are comfortable with the tools, which suggests the company has built AI literacy, which reduces risk. That inference only holds if the usage is genuine. Tokenmaxxed usage is not a sign of AI literacy. It is a sign that the metric was wrong.

What to measure instead

The principle is straightforward: measure completed work, not AI interaction. Completed work is harder to fake.

Outcome metrics that actually work

Time-to-completion on specific workflows. Pick two or three repeatable tasks where AI assistance is expected to help, code review of a standard PR size, drafting a compliance summary from a regulation document, writing test cases for a specified function. Measure how long the same task takes with and without AI assistance. This requires genuine use of the tool and produces a real signal.

Defect rate on AI-assisted output. If AI is being used in code generation, track the bug rate on AI-assisted code versus human-only code. If AI-assisted code has a higher defect rate, that is a governance signal that the team is not reviewing AI output carefully. If it has a lower rate, that is evidence of real value. Neither metric can be faked by running pointless prompts.

Cycle time on AI-eligible tasks. For tasks that AI tools are specifically intended to accelerate, track cycle time. A coding assistant that genuinely helps should reduce the time from task assignment to PR submission on eligible task types. If cycle time is flat or increasing despite high AI usage, usage is not translating to productivity.

Deployment rate of AI-assisted work. This is Amazon's replacement metric. Track whether AI-generated code, documents, or analysis actually ships. A developer who generates a hundred AI code suggestions and accepts none has high usage and zero deployment. A developer who generates ten suggestions and ships eight has real adoption.

Qualitative workflow audit. Ask employees where AI actually changed how they work, not whether they used the tools. The answers tell you whether AI is embedded in genuine workflows or activated to satisfy a mandate. This cannot be automated, but a quarterly 20-minute conversation with each team's AI power user generates better signal than any dashboard.

The three metrics to drop immediately

Raw token consumption. This is KiroRank. Tokens measure AI invocation, not AI value. Drop it.

Seat utilization. Tracking what percentage of licensed seats logged in to an AI tool tells you nothing about whether the tool is producing value. It tells you whether employees clicked a login button.

Prompt count. Number of prompts submitted rewards asking questions, not getting useful answers. An employee who submits fifty prompts and discards the output is not more productive than one who submits five prompts and ships the result.

Designing metrics that resist gaming

The test for a good AI adoption metric is whether an employee can satisfy it without doing useful work. If they can, they will, especially when adoption is being tracked by management.

Four design principles:

Require downstream artifacts. The metric should require something deployable: code that ships, a document that gets used, a decision that gets made. Tokenmaxxed output has nowhere to go, because it was never meant to be used.

Measure against a baseline. Metrics based on comparison, faster than before, fewer defects than before, shorter cycle time than before, are harder to game because gaming requires actually changing the baseline, which requires doing real work.

Separate adoption measurement from performance evaluation. When AI usage affects performance reviews, employees optimize for the metric rather than for productive use. Treat adoption measurement as research, not as a KPI.

Audit qualitative signals. When quantitative metrics show high adoption but qualitative check-ins show employees not finding the tools useful, the metrics are wrong. Build in a qualitative feedback mechanism that can surface this divergence before it becomes a KiroRank situation.

Copy-paste AI adoption metrics framework

For a small team, this does not need to be complex. A quarterly review that covers these questions is more useful than a live dashboard measuring token consumption.

AI adoption review: [team name], [quarter]

Workflow impact (for each AI tool in active use):

Which 2-3 specific tasks has this tool changed how we work?
Estimated time saved per person per week on those tasks: [hours]
Quality signal: are outputs from AI-assisted work passing QA at the same or better rate?

Deployment rate:

Of AI-assisted code / content / analysis produced this quarter, what percentage was used in production / published / acted on?
Target: >70%. Below 50% suggests the tool is being used for exploration rather than production.

Coverage (genuine, not gamed):

What percentage of team members are using AI tools for tasks where the tools add real value? (Self-reported, qualitatively verified)
Which team members have found workflows where AI saves meaningful time?

What is not working:

Which AI tool use cases have we tried that did not produce value?
Any cases where AI-assisted output required more rework than starting from scratch?

Save this next to your AI acceptable use policy. The quarterly review costs 30 minutes and generates more useful signal than a year of token dashboards.

The wider pattern: Amazon is not alone

Amazon's KiroRank episode was widely reported because of the specificity of the tokenmaxxing behavior, but the underlying dynamic is playing out at scale across the industry.

Walmart capped usage of an internal AI tool after unexpectedly high demand turned out to reflect overuse and experimentation, not productive adoption. Microsoft has pulled back from some AI employee productivity metrics after usage data and outcome data diverged. At company after company, the first wave of AI adoption measurement has been volume-based, and the first wave of employee response has been to optimize for the volume metric.

The pattern follows a predictable arc. Company announces AI adoption initiative. Adoption measured by usage metrics. Employees optimize for metrics. Metrics spike. Leadership celebrates. Outcome data stays flat or declines. Investigation reveals gaming. Metrics revised. Repeat.

The lesson is not that employees are bad actors. Most tokenmaxxing is a rational response to an irrational incentive. When management signals that AI usage is being tracked and evaluated, employees track AI usage. The governance fix is upstream: design the incentive before the rollout, not after you discover the metric is being gamed.

For small teams, the advantage is that you can do this before it becomes a problem. You do not have a KiroRank because you have not built one. The quarterly review template above is sufficient to track real adoption without creating the conditions for gaming.

Where this fits in your governance

AI adoption metrics sit between two other governance concerns. The AI spend governance guide covers the cost side: uncapped token usage creates real budget exposure, as the Amazon case shows. The AI rollout governance guide covers the human side: employees who are told to use AI without a clear purpose will either resist or comply in ways that produce no value. The metrics framework here is the diagnostic layer between them, the tool that tells you whether your rollout is generating real adoption or compliance theater.

Amazon replaced KiroRank with a metric called normalised deployments, which tracks whether engineers regularly use AI to write code that actually ships to production.

This is not an Amazon problem. It is a measurement problem, and it happens at every company that decides to track AI adoption by usage volume.

TL;DR: Amazon shut down KiroRank, its internal AI usage leaderboard, after employees gamed it by running pointless tasks to inflate token counts. SVP Dave Treadwell sent a memo: do not use AI just for the sake of using AI. The fix is measuring outcomes, not tokens: time saved on specific workflows, defect rates, cycle time, and whether AI outputs are actually used rather than just generated.

Why measuring usage volume always backfires

Team meeting reviewing performance data on a whiteboard, AI adoption metrics discussion

The compliance theater problem

What to measure instead

The principle is straightforward: measure completed work, not AI interaction. Completed work is harder to fake.

Outcome metrics that actually work

The three metrics to drop immediately

Raw token consumption. This is KiroRank. Tokens measure AI invocation, not AI value. Drop it.

Designing metrics that resist gaming

The test for a good AI adoption metric is whether an employee can satisfy it without doing useful work. If they can, they will, especially when adoption is being tracked by management.

Four design principles:

Copy-paste AI adoption metrics framework

For a small team, this does not need to be complex. A quarterly review that covers these questions is more useful than a live dashboard measuring token consumption.

AI adoption review: [team name], [quarter]

Workflow impact (for each AI tool in active use):

Which 2-3 specific tasks has this tool changed how we work?
Estimated time saved per person per week on those tasks: [hours]
Quality signal: are outputs from AI-assisted work passing QA at the same or better rate?

Deployment rate:

Of AI-assisted code / content / analysis produced this quarter, what percentage was used in production / published / acted on?
Target: >70%. Below 50% suggests the tool is being used for exploration rather than production.

Coverage (genuine, not gamed):

What percentage of team members are using AI tools for tasks where the tools add real value? (Self-reported, qualitatively verified)
Which team members have found workflows where AI saves meaningful time?

What is not working:

Which AI tool use cases have we tried that did not produce value?
Any cases where AI-assisted output required more rework than starting from scratch?

Save this next to your AI acceptable use policy. The quarterly review costs 30 minutes and generates more useful signal than a year of token dashboards.

The wider pattern: Amazon is not alone

Amazon's KiroRank episode was widely reported because of the specificity of the tokenmaxxing behavior, but the underlying dynamic is playing out at scale across the industry.

Amazon KiroRank and Tokenmaxxing: How to Measure AI Adoption Without Creating Perverse Incentives

Why measuring usage volume always backfires

The compliance theater problem

What to measure instead

Outcome metrics that actually work

The three metrics to drop immediately

Designing metrics that resist gaming

Copy-paste AI adoption metrics framework

The wider pattern: Amazon is not alone

Where this fits in your governance

Amazon KiroRank and Tokenmaxxing: How to Measure AI Adoption Without Creating Perverse Incentives

Why measuring usage volume always backfires

The compliance theater problem

What to measure instead

Outcome metrics that actually work

The three metrics to drop immediately

Designing metrics that resist gaming

Copy-paste AI adoption metrics framework

The wider pattern: Amazon is not alone

Where this fits in your governance