What is the Better Models Worse Tools problem?

Armin Ronacher discovered that Anthropic newer models like Opus 4.8 and Sonnet 5 invent nonexistent fields in tool call arguments more often than older models. The models produce correct payloads but add garbage keys like requireUnique or oldText2 that break schema validation, failing roughly 20% of the time in agentic sessions.

Why are newer models worse at tool calling?

Anthropic likely post-trains on Claude Code, whose client silently absorbs malformed calls through retry logic and unknown-key filtering. Sloppy calls still complete tasks and receive reward, so there is no training signal against inventing invalid fields. Models become adapted to Claude Code flat tool shapes and struggle with different schemas.

How can SaaS builders fix this?

Enable strict tool invocation mode, use constrained decoding, design flat tool schemas that match what the model was trained on, and build forgiving client-side retry logic since the model often produces correct payloads with only extra fields appended.

Should I avoid upgrading Anthropic models?

Benchmark your actual tool-calling reliability on your specific schemas, not just published benchmarks. The upgrade reflex can backfire. Test before deploying and use strict mode for production agent workloads.

Better Models, Worse Tools: Why Newer AI Models Are Breaking Your Agent Tool Calls

Armin Ronacher just found something that should make every SaaS founder shipping AI agents stop and read carefully.

Ronacher — creator of Flask, and an engineer at Sentry — noticed something wrong while running his own agent harness, Pi, against Anthropic's latest models. His writeup is worth reading in full, but the core finding is worth understanding now. Opus 4.8 and Sonnet 5, the newest and supposedly most capable models in the lineup, were failing tool calls that older models like Opus 4.5 handled without issue. Not occasionally. Often enough to be a real production problem — he clocked failure rates around 20% in agentic sessions.

This isn't a minor quirk. If you're building a SaaS product on top of Claude's tool-calling API — and a lot of you are, whether it's a coding assistant, a support bot, or an internal automation layer — this bug can silently corrupt your agent's actions in production.

What Ronacher Actually Found

The failure mode is oddly specific. The model generates a tool call — say, an edit operation with old text and new text — and the core payload is correct. The file path is right. The strings to replace are right. But then the model appends extra fields that were never part of the schema: things like requireUnique, oldText2, newText2. Fields that don't exist in any tool definition, invented out of nowhere.

Because these keys aren't part of the schema, strict validation rejects the whole call. The agent either errors out or, worse, silently drops the edit and moves on as if nothing happened.

Here's the part that makes this genuinely interesting instead of just a bug report: Ronacher traced it to how tool calls actually work under the hood. They aren't structured API objects the way you might assume. They're in-band text signaling — the model emits special markers (Anthropic's antml tags) inline in its output stream, and the harness parses that text to reconstruct the tool call. The model isn't filling out a form. It's writing text that looks like a form, and sometimes it keeps writing past the end of the form.

Why Newer Models Got Worse, Not Better

Ronacher's hypothesis is that this comes down to what Anthropic trains against. Claude Code — Anthropic's own coding agent client — is remarkably forgiving. It has retry paths, parameter aliasing, and unknown-key filtering baked in. If a model appends garbage fields to a tool call, Claude Code often silently absorbs the error and the task still completes.

That's fine for Claude Code's own user experience. It's a problem for training. If Anthropic post-trains heavily on transcripts and reward signals generated inside Claude Code, a malformed tool call that still succeeds looks identical to a clean one from the reward model's perspective. There's no gradient telling the model "don't invent fields" — because inventing fields didn't cost it anything.

The result is a model that has adapted specifically to Claude Code's tool shapes and its tolerance for sloppiness. Opus 4.5, trained with less of this specific reinforcement, generalized better across different harnesses. Opus 4.8 and Sonnet 5 have a stronger prior — and when that prior meets a schema shaped differently than Claude Code's, the model fights you harder instead of adapting.

Claude Code's Flat Schema vs. Everyone Else's

Part of the mismatch is structural. Claude Code's own edit tool uses a flat argument shape: file_path, old_string, new_string, replace_all — four keys, no nesting. Ronacher's Pi harness uses a nested edits[] array, where each edit is its own object inside a list. Structurally reasonable, arguably cleaner, but not what the model has been most heavily drilled on.

Aspect	Claude Code (native)	Pi harness (Ronacher's)
Argument shape	Flat: `file_path`, `old_string`, `new_string`	Nested: `edits[]` array of objects
Malformed call handling	Silently absorbed via retries, aliasing, key filtering	Rejected by strict schema validation
Effect of extra invented fields	Hidden from the user, task still completes	Tool call fails outright
Model's effective training exposure	Heavy — this is Anthropic's own agent	Low — third-party, non-canonical shape

The mismatch is: Anthropic's model treats one specific tool shape as the default, and everything else — even reasonable, well-designed alternatives — becomes a place where the model's confidence outruns its accuracy.

What Actually Fixed It

Ronacher tested a few mitigations. Stripping the model's extended thinking blocks out of context before the tool call roughly halved the failure rate — suggesting the model was talking itself into inventing fields during its own reasoning trace. Switching to strict tool invocation mode, which constrains what the model is allowed to emit at the token level, eliminated the failures entirely.

That's a meaningful signal: the underlying capability to produce a correct call is there. The model isn't confused about the task. It's a decoding-time problem, and decoding-time problems have decoding-time fixes.

What SaaS Builders Should Actually Do

If you're running Claude models behind an agent product, this deserves action, not just awareness.

Benchmark your own schemas, not published leaderboards. Model benchmarks test general capability. They don't test whether your specific tool definitions, with your specific field names and nesting, survive contact with the model's trained priors. Run your own tool-calling reliability tests before you flip a version number in production.

Turn on strict tool invocation mode. If Anthropic's API exposes constrained decoding or strict schema enforcement for tool use, use it for anything that touches production state. The cost is some flexibility; the payoff is a model that literally cannot emit a field outside your schema.

Design flat schemas where you can. Nested arrays are cleaner engineering, but if the model's priors favor flat argument shapes, fighting that preference costs you reliability. Weigh that tradeoff deliberately instead of discovering it in an incident report.

Build forgiving retry logic, but log everything. Claude Code's silent-absorption approach isn't wrong for a chat UI — it's wrong for opacity. Absorb the extra fields if you must, but log every instance so you know how often it's happening and to what.

The bigger lesson: "upgrade to the newest model" is a reflex, not a strategy. Ronacher's writeup is a reminder that newer isn't automatically better for every axis that matters to your product — sometimes it's better at reasoning and measurably worse at the boring, load-bearing plumbing that keeps your agent from corrupting a customer's data. Test before you ship the upgrade.

Better Models, Worse Tools: Why Newer AI Models Are Breaking Your Agent Tool Calls

What Ronacher Actually Found

Why Newer Models Got Worse, Not Better

Claude Code's Flat Schema vs. Everyone Else's

What Actually Fixed It

What SaaS Builders Should Actually Do

Get your SaaS in front of founders

Founder resources

Related articles

59% of AI Agent Tokens Go to Code Review, Not Code Generation — New Research

Oak VCS: The Git Alternative Built for AI Agents (And Why It Changes How You Think About Version Control)

AWS Lambda MicroVMs: Why Your AI Coding Tool Needs Stateful Sandboxes