Skip to main content
Back to Blog
AI agentstool callingAnthropicLLM reliabilityClaude APIAI engineering

Better Models, Worse Tools: Why Newer AI Models Are Breaking Your Agent Tool Calls

ghosty
Founder, SaaSCity
Better Models, Worse Tools: Why Newer AI Models Are Breaking Your Agent Tool Calls

Armin Ronacher just found something that should make every SaaS founder shipping AI agents stop and read carefully.

Ronacher — creator of Flask, and an engineer at Sentry — noticed something wrong while running his own agent harness, Pi, against Anthropic's latest models. His writeup is worth reading in full, but the core finding is worth understanding now. Opus 4.8 and Sonnet 5, the newest and supposedly most capable models in the lineup, were failing tool calls that older models like Opus 4.5 handled without issue. Not occasionally. Often enough to be a real production problem — he clocked failure rates around 20% in agentic sessions.

This isn't a minor quirk. If you're building a SaaS product on top of Claude's tool-calling API — and a lot of you are, whether it's a coding assistant, a support bot, or an internal automation layer — this bug can silently corrupt your agent's actions in production.

What Ronacher Actually Found

The failure mode is oddly specific. The model generates a tool call — say, an edit operation with old text and new text — and the core payload is correct. The file path is right. The strings to replace are right. But then the model appends extra fields that were never part of the schema: things like requireUnique, oldText2, newText2. Fields that don't exist in any tool definition, invented out of nowhere.

Because these keys aren't part of the schema, strict validation rejects the whole call. The agent either errors out or, worse, silently drops the edit and moves on as if nothing happened.

Here's the part that makes this genuinely interesting instead of just a bug report: Ronacher traced it to how tool calls actually work under the hood. They aren't structured API objects the way you might assume. They're in-band text signaling — the model emits special markers (Anthropic's antml tags) inline in its output stream, and the harness parses that text to reconstruct the tool call. The model isn't filling out a form. It's writing text that looks like a form, and sometimes it keeps writing past the end of the form.

Why Newer Models Got Worse, Not Better

Ronacher's hypothesis is that this comes down to what Anthropic trains against. Claude Code — Anthropic's own coding agent client — is remarkably forgiving. It has retry paths, parameter aliasing, and unknown-key filtering baked in. If a model appends garbage fields to a tool call, Claude Code often silently absorbs the error and the task still completes.

That's fine for Claude Code's own user experience. It's a problem for training. If Anthropic post-trains heavily on transcripts and reward signals generated inside Claude Code, a malformed tool call that still succeeds looks identical to a clean one from the reward model's perspective. There's no gradient telling the model "don't invent fields" — because inventing fields didn't cost it anything.

The result is a model that has adapted specifically to Claude Code's tool shapes and its tolerance for sloppiness. Opus 4.5, trained with less of this specific reinforcement, generalized better across different harnesses. Opus 4.8 and Sonnet 5 have a stronger prior — and when that prior meets a schema shaped differently than Claude Code's, the model fights you harder instead of adapting.

Claude Code's Flat Schema vs. Everyone Else's

Part of the mismatch is structural. Claude Code's own edit tool uses a flat argument shape: file_path, old_string, new_string, replace_all — four keys, no nesting. Ronacher's Pi harness uses a nested edits[] array, where each edit is its own object inside a list. Structurally reasonable, arguably cleaner, but not what the model has been most heavily drilled on.

AspectClaude Code (native)Pi harness (Ronacher's)
Argument shapeFlat: file_path, old_string, new_stringNested: edits[] array of objects
Malformed call handlingSilently absorbed via retries, aliasing, key filteringRejected by strict schema validation
Effect of extra invented fieldsHidden from the user, task still completesTool call fails outright
Model's effective training exposureHeavy — this is Anthropic's own agentLow — third-party, non-canonical shape

The mismatch is: Anthropic's model treats one specific tool shape as the default, and everything else — even reasonable, well-designed alternatives — becomes a place where the model's confidence outruns its accuracy.

What Actually Fixed It

Ronacher tested a few mitigations. Stripping the model's extended thinking blocks out of context before the tool call roughly halved the failure rate — suggesting the model was talking itself into inventing fields during its own reasoning trace. Switching to strict tool invocation mode, which constrains what the model is allowed to emit at the token level, eliminated the failures entirely.

That's a meaningful signal: the underlying capability to produce a correct call is there. The model isn't confused about the task. It's a decoding-time problem, and decoding-time problems have decoding-time fixes.

What SaaS Builders Should Actually Do

If you're running Claude models behind an agent product, this deserves action, not just awareness.

Benchmark your own schemas, not published leaderboards. Model benchmarks test general capability. They don't test whether your specific tool definitions, with your specific field names and nesting, survive contact with the model's trained priors. Run your own tool-calling reliability tests before you flip a version number in production.

Turn on strict tool invocation mode. If Anthropic's API exposes constrained decoding or strict schema enforcement for tool use, use it for anything that touches production state. The cost is some flexibility; the payoff is a model that literally cannot emit a field outside your schema.

Design flat schemas where you can. Nested arrays are cleaner engineering, but if the model's priors favor flat argument shapes, fighting that preference costs you reliability. Weigh that tradeoff deliberately instead of discovering it in an incident report.

Build forgiving retry logic, but log everything. Claude Code's silent-absorption approach isn't wrong for a chat UI — it's wrong for opacity. Absorb the extra fields if you must, but log every instance so you know how often it's happening and to what.

The bigger lesson: "upgrade to the newest model" is a reflex, not a strategy. Ronacher's writeup is a reminder that newer isn't automatically better for every axis that matters to your product — sometimes it's better at reasoning and measurably worse at the boring, load-bearing plumbing that keeps your agent from corrupting a customer's data. Test before you ship the upgrade.

Get your SaaS in front of founders

List your product on the SaaSCity live city map — a permanent listing, real discovery, and a backlink from a high-DR directory. Free to start; upgrade for a dofollow link and a building on the map.