59% of AI Agent Tokens Go to Code Review, Not Code Generation — New Research

Your AI coding agent isn't burning tokens writing code. It's burning them re-reading it.

That's the finding that anchors Tokenomics: Quantifying Where Tokens Are Used in Agentic Software Engineering, a January 2026 paper by Mohamad Salim, Jasmine Latendresse, SayedHassan Khatoonabadi, and Emad Shihab. They tracked every token consumed across 30 software development tasks run through a multi-agent framework — and mapped consumption to specific development phases. The result is the first systematic breakdown of where AI agent tokens actually go in a real agentic software engineering workflow.

The headline number: the Code Review stage alone accounts for 59.4% of total token consumption. Not initial design. Not code generation. The iterative, automated back-and-forth of agents checking, critiquing, and refining work that already exists.

If you're building an AI-powered development product, pricing agentic features, or just trying to understand why your API bill looks the way it does, this data is worth your attention.

What "Agentic Software Engineering" Actually Means Here

Before the numbers, a grounding point: the paper isn't studying a single AI autocomplete tool or a pair-programmer in an IDE. It's studying LLM-based multi-agent systems — setups where multiple specialized AI agents collaborate across the full software development lifecycle.

The case study is ChatDev, a framework that simulates a software company with distinct agent roles: CTO, programmer, reviewer, tester. Each agent has a defined function. They pass work between stages, critique each other's output, and iterate until the system produces a complete software product.

This is the architecture pattern behind many serious AI coding products in 2026. Not a single model answering prompts — a coordinated team of agents, each consuming tokens to do their job.

The researchers pulled 30 tasks from the ProgramDev Dataset, ran them through ChatDev with GPT-5 (version gpt-5-2025-08-07, temperature 1.0, 400K token context window, 128K max output), and logged every token exchange. The task range was 17,280 to 40,000 reasoning tokens per task — real software, not toy examples.

The Breakdown: Where AI Agents Actually Spend Tokens

This is the core of the paper. Six development stages, each with its own token consumption profile.

Token Consumption by Development Stage

Stage	Share of Total Tokens
Code Review	59.4%
Code Completion	26.8%
Coding	8.6%
Design	2.4%
Testing	~2%
Documentation	~0.8%

Code Review and Code Completion together account for 86.2% of all tokens consumed. The stages that developers intuitively think of as "the expensive part" — writing code, designing the system — are almost noise by comparison.

Token Type Breakdown (Overall)

Token Type	Share
Input tokens	53.9%
Output tokens	24.4%
Reasoning tokens	21.6%

Input tokens — the prompts and context fed into the model — are the single largest category at 53.9%. This is important because input tokens are often priced differently from output tokens, and this distribution tells you that agentic systems are fundamentally read-heavy, not write-heavy.

How Token Type Varies by Phase

Different stages have radically different internal compositions:

Stage	Input %	Output %	Reasoning %
Coding	6.9%	58%	~35%
Code Review	51.4%	~26%	~23%
Documentation	80.2%	~14%	~6%

The Coding phase looks like what you'd expect from a generative AI task — mostly output, as the model writes code. Code Review flips: suddenly it's majority input, because agents are reading existing code and feeding it back into context repeatedly. Documentation goes even further — 80.2% input — because it's almost entirely analysis of work that already exists.

Why Code Review Eats Most of Your Token Budget

The 59.4% figure isn't a surprise to anyone who has watched a multi-agent coding system run. But the mechanism behind it is worth understanding.

The authors describe what they call a "communication tax" — the overhead of repeated context passing between agents. Every time one agent hands work to another, the receiving agent needs full context about what was done before. In a multi-agent system, this doesn't happen once. It happens iteratively, as agents review, flag issues, revise, and review again.

Code Review is the stage where this compounds the hardest. An automated reviewer reads the full codebase to find issues. The programmer agent receives that critique — along with the original code — and writes a fix. The reviewer reads both again. This loop runs until the system is satisfied.

Each iteration costs tokens. The context window doesn't shrink between passes — agents aren't summarizing history, they're feeding complete state. The authors describe this as "empirical evidence for potentially significant inefficiencies in agentic collaboration."

That's academic language for: your review loop is probably the most expensive line item in your AI infrastructure, and most builders don't know it.

Compare this to how most AI SaaS companies think about token costs. The mental model is usually: "generation is expensive, so I'll cache outputs and reduce calls." But if 59.4% of your tokens are in a review loop — not generation — caching outputs solves the wrong problem. The inefficiency is in the repeated context, not the output size.

This is why tools focused on LLM context compression (like what Headroom covers with RAG chunks and tool outputs) hit something real: trimming what goes in to the review stage pays off proportionally more than trimming what comes out.

List Your AI Agent Tool on SaaSCity

Building an agent framework, token optimization layer, or agentic development tool? The founders and engineers who need what you're shipping are already searching.

SaaSCity.io is a directory built for modern AI SaaS and developer tools. Your product isn't just a static listing — it's visualized as a building in our interactive 3D digital city, discoverable by a community of technical buyers.

Free to list — no subscription, no credit card, submit in under 2 minutes
Dofollow backlink included — a permanent SEO-indexed backlink that improves your domain rating
3D city map visibility — your tool gets a place in the SaaSCity visualization, not just a row in a spreadsheet
Real buyer traffic — founders and developers actively exploring AI tooling

Submit your product at SaaSCity.io/live/submit — takes two minutes.

What This Means for SaaS Builders Pricing Agentic Features

The paper's findings carry direct implications for anyone building or pricing a product on top of multi-agent coding infrastructure.

1. Your cost model is probably wrong if you estimated from generation volume.

If you priced an agentic feature by estimating how many tokens generation takes, you may have underpriced by 3–4x. Code Review consuming 59.4% of tokens means the "generative" part of the pipeline is a minority cost. A task that costs $0.10 in generation may cost $0.40–0.60 when the review loop runs to completion.

The paper's methodology — mapping token consumption to development stages from execution traces — gives you a template for doing this audit on your own system. Run your agent against a representative sample of tasks, log which phase each token belongs to, and you'll know where your actual spend lives.

2. Reducing review iterations is worth more than reducing output length.

Most prompt optimization guides focus on making responses shorter. Based on this data, that's the lower-leverage target. If you can reduce the number of review iterations — through better initial code generation, clearer agent prompts, or tighter scope definitions — you cut into the 59.4%, not the 8.6%.

Concretely: if your coding agent produces cleaner output on the first pass, it spends fewer cycles in the review loop. Better generation quality pays a 7x token dividend in the review stage.

3. Input token pricing deserves more attention in your cost projections.

Input tokens at 53.9% of total consumption means your pricing math needs to track input and output separately. Many builders rough-estimate using blended per-token costs. At the stage-level patterns this paper reveals — documentation at 80.2% input, code review at 51.4% input — blending obscures where the money actually goes.

Models that charge differently for input vs. output tokens (which is most frontier models in 2026) will produce very different actual costs than a blended estimate suggests.

4. The "communication tax" is a product design problem, not just a model problem.

Repeated context passing between agents isn't something you fix by switching to a cheaper model. It's an architectural question. How much state does each agent need? Can shared memory or compressed summaries replace full context retransmission? Are your review cycles bounded, or can they spiral?

These are the questions that determine the shape of your token spend. The paper points to "developing more token-efficient agent collaboration protocols" as the path forward — which is a research framing, but for product builders it translates to: audit your agent handoffs, because that's where the waste lives.

If you're running multi-agent workflows in Claude Code today, the subagents and agent teams guide covers how Claude Code routes tasks between agents and where cost-routing to cheaper models already applies. Some of what the paper identifies as waste in ChatDev is already partially addressed in how modern agentic coding tools structure handoffs — but not fully.

5. What you spend on AI coding tools will scale with your review cycle depth.

This is the practical takeaway for any team comparing AI agent coding plans: if you're running agentic workflows (not just completions), the relevant metric isn't generation speed or raw model quality. It's how efficiently the agent's review loop terminates.

A plan with a strong underlying model that gets code right faster will use fewer review tokens than a cheaper plan that cycles more. At 59.4% of spend in the review stage, model quality differences compound in ways that per-token pricing comparisons don't capture.

What the Research Leaves Open

The paper is five pages and explicit about its constraints. 30 tasks is a meaningful sample, but it's not enough to generalize across all multi-agent frameworks or all project types. ChatDev's specific architecture — particularly how it structures agent roles and handoffs — shapes the consumption patterns. A framework with different handoff logic might show different stage distributions.

The GPT-5 usage is also notable. Reasoning models have a distinct cost profile: 21.6% of tokens in this study are reasoning tokens, which in OpenAI's pricing are often billed differently from input/output. As reasoning-capable models become standard in agentic pipelines, that 21.6% category will matter more.

And the finding that Coding produces 58% output while Code Review is 51.4% input points to something worth tracking longitudinally: as models improve and initial code quality rises, does the review loop shrink? Or does the system find new things to critique?

The Number That Should Reshape How You Think About Agent Costs

Sixty percent. Code Review. Not code generation.

The intuition most builders carry — that AI coding costs scale with how much the agent writes — is backwards. The dominant cost in agentic software engineering is the automated review process that checks, critiques, and refines work after it's written. The communication overhead between agents. The repeated context feeds.

This matters for product design, pricing strategy, and infrastructure choices. If you're building on top of agentic coding systems, you're not primarily paying for generation. You're paying for the agents talking to each other.

That's a different problem. And now there's data showing exactly how different.

SaaSCity.io covers AI agent tooling and engineering research. Explore the SaaSCity directory to discover what's shipping right now — or list your own product.

59% of AI Agent Tokens Go to Code Review, Not Code Generation — New Research

What "Agentic Software Engineering" Actually Means Here

The Breakdown: Where AI Agents Actually Spend Tokens

Token Consumption by Development Stage

Token Type Breakdown (Overall)

How Token Type Varies by Phase

Why Code Review Eats Most of Your Token Budget

List Your AI Agent Tool on SaaSCity

What This Means for SaaS Builders Pricing Agentic Features

What the Research Leaves Open

The Number That Should Reshape How You Think About Agent Costs

Get your SaaS in front of founders

Founder resources

Related articles

Headroom: LLM Token Compression for RAG Chunks, Tool Outputs, and Log Files

Oak VCS: The Git Alternative Built for AI Agents (And Why It Changes How You Think About Version Control)

Fintech Engineering Patterns: The Handbook Every SaaS Builder Needs Before Touching Money