Agents Need More Than Evals: They Need Their Own Isolated Worlds

Apr 12, 2026

Feb 24, 2026

It’s becoming clear that AI coding agents will do much more than draft PRs and ship webapps to localhost. They will be able to refactor your authentication stack, migrate a database schema, or deploy a microservice.

Models are quickly getting there: even smaller models are excellent at multi-step planning and actions across code, infrastructure, and live systems.

But they don’t get it right every time. Agents iterate, explore dead ends, and sometimes make mistakes. That’s fine in a sandbox. It’s terrifying in production.

For code, we already have guardrails: version-controlled git repos, CI/CD pipelines, unit tests. But what happens when agents need to interact with real data? With a deployed product your customers depend on? With sprawling cloud infrastructure?

Git and offline evals won’t cut it. We need entire isolated worlds for agents to operate: sophisticated sandbox environments that mirror complex production systems, where mistakes can be made freely.

We think this represents one of the most exciting infrastructure opportunities as agentic development matures.

From Human to Agent-Grade Sandboxes

Human developers have used sandboxes for decades. These are environments designed to be:

Realistic: Identical or nearly identical to real systems, so that actions can transfer to production.
Isolated: Isolated from the real world: actions generally shouldn’t impact the live app or data, and should be easily reversible.
Easy to deploy: Today, one-click/command and a few minutes of wait time should suffice to spin up one development environment.

But agents work differently from humans, so their sandboxes will need to be different too. We envision agent sandboxes must also be:

1. Stateful: Existing DevOps tools like Terraform or Kubernetes are great at spinning up the blueprint of an application: new servers, empty or mocked databases, etc. But for an agent tasked with fixing a multi-step checkout bug or migrating a legacy database, a blueprint of an application is useless. They need to interact with a real, stateful environment: manipulating real data, messages in a Kafka queue, etc. Current pipelines cannot generate massive, interconnected, stateful environments; we need new infrastructure that makes state as branchable and deployable as code.

2. Scalable: Human developers will work for hours, then take a 15-minute coffee break while their preview environment builds. Agent developers may want a new environment every minute as they iterate on a solution, or even to spin up multiple environments at once to test a number of different solutions. Agent sandboxes need to be instantly available, lightweight, and highly concurrent.

3. Child-proof: Traditional staging environments are not fully isolated; they might rely on human developers to know not to run a script sending 10,000 test emails to real customers, or hitting the Stripe API 10,000 times. We cannot rely on agents to have this intuition. A “child-proof” agent sandbox will need to intercept outbound API calls and synthesize realistic responses; containing the agent’s blast radius so it truly cannot impact the real world.

4. Machine-readable: When a human deploys code, they might click around a Vercel preview UI, look at line charts in Datadog, and read through logs. Agents will want much more structured telemetry in their environment: in addition to rendering the UI, they might receive a structured event stream, descriptive state diffs, and detailed network payloads.

‍

Three Examples of Sandbox Infrastructure

Product Development

AI won’t just make software engineers faster. It will collapse the boundaries between engineering, product, and design. We’re seeing early signals of this already: with Claude Code, some designers can inspect a front end and make simple changes to the codebase themselves.

But rendering a UI in a local environment (or looking at a hosted Vercel preview) isn’t enough. In the future, we expect agents and humans will build together on deployed full-stack products: collaborating on code, interfaces, and data in a coherent, shared development environment.

This will not just democratize and accelerate product development, but also enable new types of learning: agents can take signals from real usage and associate them directly with product components, observing the impact of changes and using those to guide further iterations (potentially even simulating user behavior). This will shift product evolution from a slow, build-wait-measure cycle to an ongoing loop of learning and refinement.

Data Engineering

Working with data is fundamentally harder than working with code. Code is text: it’s easy to branch it, diff it, and roll it back if needed. With data, scale alone makes duplication impractical or impossible. Reprocessing is slow and expensive. And mistakes can be devastating: if an agent drops a critical table, you can’t just revert it.

Even experienced data engineers struggle with the operational complexity: understanding what’s current, running backfills, managing migrations. It’s no surprise that production data is one of the last things any enterprise would let an AI agent touch.

‍

Agents will need infrastructure that lets them work with real data in an environment that feels like production but doesn’t impact live systems, and can be tested, version-controlled, and reversed like code.

Enterprise Environments

Enterprises operate with sprawling webs of infrastructure and external tools/services: systems of record, ticketing systems, identity providers, cloud consoles, communication platforms, each with its own APIs, permissions, and data models.

For agents to explore, learn, and improve in these environments, they need sandboxes that replicate the full complexity. This means spinning up real infrastructure, connecting to real or mocked services, and even potentially rendering tool UIs.

Two of our AI security portfolio companies, Dropzone and Maze, have already invested heavily in this capability: their engineering teams can programmatically spin up hundreds of realistic enterprise environments (with first and third-party platforms) on demand to train, test, and iterate. This isn’t a nice-to-have; it’s core infrastructure for building & testing reliable agents.

We see analogous opportunities across enterprise domains. Whether it’s operational/supply chain platforms, IT/SRE automation, or sales/GTM teams, any domain where agents need to interact with complex, multi-tool environments will need this kind of simulation infrastructure.

—

The best agents won’t just be smarter. They’ll have better worlds to practice in. The companies building these simulation layers will become foundational infrastructure for the agentic era, just as CI/CD and preview deploys became foundational for human-driven development.

If you’re building agent environments, simulation infrastructure, or sandboxing tools, I’d love to hear from you: at@theoryvc.com.

Thanks to Cris Dobbins for feedback on this piece.

‍

Andy Triedman

Connect on LinkedIn

Get the latest in AI & data, straight to your inbox.

Thanks for subscribing!

Oops! Something went wrong while submitting the form.