
As AI systems have become more complex over the past 3 years, their orchestration has too. More tools. More sophisticated retrieval and context engineering pipelines. And more routing, prompt engineering, and workflow logic to help the models figure out what capabilities to use.
But a different approach is quickly gaining traction among leading AI application builders: Scrap nearly all of this and replace it with a coding agent that has a virtual computer. Give it a filesystem, a code interpreter, and a small set of basic tools, then let it figure out how to manage itself.
Companies including Anthropic, Cursor, Cloudflare, and Vercel have all shared variants of this approach, with huge impacts: equivalent or improved performance while using an order of magnitude fewer tokens (meaning a huge reduction in cost and latency). In our fall hackathon where 100 participants built agents for complex data analysis, we also saw it emerge as the dominant approach.
This shift, from agents defined by systems carefully designed by humans, to agents dynamically constructing their own, is one of the most important architectural changes in how we build AI systems. And it has major implications for the next generation of AI infrastructure.
So is a coding agent all you need?
The dominant architecture for agentic systems over the past two years has been built on tool calling and programmatic orchestration.
Teams define a set of tools (search_database, send_email, process_file, etc.), each described through MCP in a JSON schema so an LLM knows how to select and use them.
Agent behavior is controlled with external frameworks like LangChain and OpenAI SDK, which use a combination of logic and human prompting to instruct the model. Teams also build bespoke systems to handle context management: retrieving documents/data, delivering tool descriptions, and storing memories.
This approach has worked remarkably well, but it is becoming evident it won’t scale to the next generation of applications.
The clearest limitation is around context. In the existing model, all information is pushed to the model through its context window. It’s hard to figure out precisely the information the model will need to achieve its outcome, and the context window can quickly get filled with unnecessary tool descriptions, lengthy tool responses, or full documents retrieved in search. This adds cost and latency, and can even degrade performance from context bloat.
These systems also become very hard to maintain and optimize. Context engineering, tool design, and agent orchestration are all closely coupled, so changes are hard to implement and evaluate, and any regression can break the whole workflow.
The result are agents whose performance (with a given base model) stalls; it becomes harder to make improvements as context length, codebase complexity and maintenance burden all increase.
The emerging alternative is surprisingly simple: give the agent its own compute environment that it can manage autonomously.
In this architecture, data, prompts, and tool descriptions live in a sandboxed filesystem. The agent gets simple instructions to start, then writes code to explore the directory structure to learn more. It uses code to read and manipulate data. And it can use code to call tools (or build its own).
There are a few huge benefits to this approach:
LLMs are best at writing code: LLMs are amazing at writing code, and improving more quickly in this domain than anywhere else due to the focus/investment and verifiability. For many tasks, it’s easier for an LLM to write code to solve the problem, than trying to solve it on its own. This also gives agents the ability to handle novelty. If a predefined tool fails, the agent is stuck. But a coding agent can write a script, encounter an error, debug it, and try again.
They’re also better at pulling in their own context: Context engineering is engineering, and it’s becoming clear that LLMs can find the right information they need to do a task much more efficiently than humans. Instead of loading all 100 tool descriptions into context, it can find and open just the one it needs. Instead of loading a full CSV as plaintext, it can just read the header or processed results. Early examples show a dramatic reduction in context length: Anthropic illustrated a >90% potential decrease in total tokens used, and in real-world applications Vercel saw a 37% decrease and Cursor saw 47%.
Files are a great place to store information: A filesystem gives the agent durable memory and a scratchpad for long-horizon memory and reasoning. Agents that write and re-read their own plans perform materially better on multi-step tasks. While other space-saving techniques are lossy (e.g. asking an LLM to summarize preceding conversation), a filesystem can store near-unlimited raw source material that it can refer back to if needed.
The results for AI agent builders is game-changing:
What’s especially exciting is that these skills emerged naturally out of general pre- and post-training for code generation. Foundation model companies can (and are starting to) train models specifically for this particular flavor of coding: agents who code to make and use tools, retrieve their own context, and store information in files. We expect these capabilities will continue to improve rapidly.
While coding agents are powerful, they are not sufficient to power most agentic applications on their own. We see two main limitations:
They require lots of domain knowledge: Almost any function, whether finance, operations, legal, or security, needs a huge amount of context and domain knowledge to do the job well (see our earlier articles on the Business Context Layer and Context Platforms). This knowledge isn’t just facts that can be retrieved from data: it’s procedural, normative instructions of how to think like an expert. As LLMs improve, they’ll naturally build better expertise at engineering-like tasks, but will likely not learn the nuance of how to investigate a niche security vulnerability or follow a complex transaction through the supply chain. Even in coding-adjacent domains like data analysis, our hackathon showed that agents still require human expert guidance to figure out how to analyze the kind of complex, messy data that exists in real-world businesses.
And need reliability: In many enterprises, the process by which a result is derived (and data/tools used along the way) is as important as the result itself. A coding agent could generate new logic or tooling for each request. Even if it had a similar overall success rate, it would create variability and non-reproducibility that could cause problems down the road. If there are heuristics an agent should always follow (“always start with step A” or “if you use tool X, you must check its results with tool Y”), those should be specified explicitly.
The benefits of coding agents are real, and almost every AI agent company we work with or talk to is moving to this model. But because of their limitations, coding agents do not spell the death of agent systems as we know them; instead, it is mostly a change in how they are architected. Here’s what we predict:
Companies will rearchitect from complex orchestration and retrieval infrastructure to coding agents with access to an execution environment and filesystem
Today, a huge amount of engineering effort goes into complex orchestration logic and context engineering systems. Teams use frameworks like LangChain and the OpenAI Agents SDK to manage multi-step workflows, tool selection, and maintain state. Alongside these, they've built sophisticated retrieval systems: conditional logic, reranking pipelines, etc., all designed to get the right information to the model at the right time.
We think much of this complexity will collapse into something simpler: sandboxed execution environments like E2B or Modal, with a well-specified set of tools and a filesystem containing all the relevant data. Instead of orchestration code deciding what the agent sees, the agent largely orchestrates itself and uses tools/code/data as needed.
Human expert-designed prompts and tools will remain a critical source of differentiation, but will (mostly) reorganize to files.
This shift doesn't eliminate the importance of prompt engineering, tool design, or context curation; it just changes where that work lives and how agents consume it.
Domain knowledge, procedural instructions, and expert heuristics will be carefully crafted and a main source of differentiation. The change will be more organizational: Instead of directly injecting context into LLM requests, most will live in files which agents open as needed.
Excellent tools will remain critical, especially for interfacing with complex external systems: ERPs, CRMs, SIEMs, etc. Here, the pattern shifts in a similar way: tool descriptions and instructions become files the agent reads, not schemas hardcoded into orchestration logic.
Agents will gain more flexibility in deciding how and when to pull these resources into its direct context window, but their decisions will be bounded by guardrails or explicit instructions (you guessed it, written in a file) where desired behaviors are already known, or guarantees of specific procedures are required.
Theory portfolio company Maze is building agents that evaluate massive numbers of potential security vulnerabilities, investigating each one like an experienced analyst to determine which are real issues and how to fix them.
The architecture behind these agents has gone through multiple iterations.
Thanks to Santiago Castiñeira at Maze, Wiley & Arnav at Doss, Phil Cerles, and Dan Shiebler for our discussions about coding agents. If you're interested in learning more about our research, or building agent architectures like this yourself, I'd love to hear from you: at@theoryvc.com.