Degrading Changes: What Do You Do When A Better Model is Worse?

May 5, 2026
May 5, 2026

When Anthropic announced Opus 4.7 in Claude, they said it was the best coding model they’ve ever released. 

But the developer response was swift and frustrated: 4.7 seemed much worse than 4.6 in real coding workflows and agentic applications.

4.7 is a smarter, more capable model. But its evolving behavior means existing prompts, tools, and architectures aren’t the right fit anymore.

It’s not technically a breaking change; the API still works the same and the model can still call the same tools or generate the same output schemas. With AI models, we have a new problem – the degrading change, or when a new model performs worse inside existing scaffolding. 

With model releases coming faster than ever, how should engineering teams manage degrading changes?

The contract used to be the API. Now it's the behavior.

Software vendors have spent decades building discipline around backwards compatibility. Stripe, AWS, and Linux all have careful versioning, deprecation timelines, and contracts. A breaking change is loud. You know it's coming and what to fix.

Foundation model APIs might look like normal APIs, but they aren't. The contract used to be the API spec; now it's the behavior. New models often have different:

  • Prompt sensitivity: How responsive the model is to instruction language
  • Tool use: When and how the model decides to call tools
  • Reasoning: How much it thinks, and how it approaches problems
  • Context handling: How it manages long histories and conflicts between older and newer information
  • Refusal patterns: What it does or doesn’t refuse to do, under what conditions
  • Judgment quality: The tone of its responses and the calls it makes on ambiguous decisions

These are hard to measure directly, and they interact. When a new model lands, the regression is often subtle. Users might feel it before evals catch it.

Sometimes the fix might be straightforward: a few lines tweaked in a prompt get the new model back on track. Often it isn't: maybe the old system had dozens of specialized agents, while the new model works best as a single orchestrator running the whole workflow.

But it’s never a simple bump to a package version. This is the defining tax on agent companies.

The labs aren’t making promises

API users are not foundation model companies’ top priority. The labs’ main goal is to push the frontier of model capabilities, and they’ll use whatever scaffolding is needed to maximize performance. In fact, they’re incentivized to co-develop models with their own harnesses to drive differentiation for products like Claude Code and Cowork. 

The labs aren’t operating like traditional software companies, but like Google search.

Entire businesses were built on Google's search algorithm, and were periodically destroyed or created by announced and unannounced updates that changed its behavior in qualitative ways. SEO became a discipline of constant adaptation.

That is the world agent companies now live in. Frontier labs aren’t offering behavioral backwards compatibility, and structurally they can't: their differentiation comes from pushing the frontier of the models and their coupled systems. This is a permanent dynamic, not a temporary growing pain.

How agent companies build adaptive engineering teams

A new, smarter model comes out but makes your agent perform worse. The fix might be simple or might require a full system redesign. The new model might be dramatically better but only after you've rebuilt around it. Or maybe you spend months refactoring, and the difference is negligible. What should you do? 

  • Invest in excellent evals: Subtle behavior changes make excellent evals table stakes. Evals must track complex multi-step workflows to understand where and how models fail. They also have to measure qualitative end-to-end performance: if an eval passes but new users complain, your eval isn’t capturing what matters.
  • Design modular, model-agnostic scaffolding: Prompts, tools, and agent boundaries should be swappable. You can’t encode current-model quirks deep in the architecture.
  • Treat scaffolding as disposable: On every major release, default to: "what would I build today if I were starting clean?" Sunk cost in scaffolding is the enemy of adaptation.
  • Watch the labs' first-party agents: Claude Code and Operator are co-developed with the model. Their architecture signals where the puck is going.
  • Self-optimizing systems: The most sophisticated teams we see are starting to build systems that improve themselves: agents that read their own changelogs, interpret their own evals, and tune their own prompting or scaffolding. It’s early, but we think this is where the world is moving.

The common thread is adaptation over optimization. The best AI agent teams will separate from the pack because their systems adapt to and benefit from each model release; everyone else will spend the cycle just scrambling to catch up.

What should the labs do?

Even if it’s not their top priority, the labs can be better at supporting API customers through new model releases.

Today, guidance comes from ad-hoc tweets and developer word of mouth. Two things would help. 

  • First, structured release notes on behavioral changes and best practices for prompting & system design. This includes official, concrete documentation of how this model differs from the last one, with prompt-level examples and named anti-patterns. 
  • Second, treating migration as a first-class problem. Labs could ship tooling that rewrites existing prompts for new models, diagnostics that ingest your eval suite and flag regressions, and migration assistants released alongside each model.

If you're building agent companies and thinking about these kinds of problems, I'd love to hear from you: at@theoryvc.com

Get the latest in AI & data, straight to your inbox.

Thanks for subscribing!
Oops! Something went wrong while submitting the form.