Stay up-to-date on our team's latest theses and research
A blow to the head from a surfing incident left me couch-ridden over the holidays. For someone who struggles with taking time off, the forced sedentary lifestyle was a challenge. Naturally, my brother and I became prediction market fanatics. With news and sports now one of my primary information flows, the order book became our competitive arena.
We traded everything: from geopolitical tail risks like the probability of a U.S. invasion of Venezuela (a short position that paid out at the end of the year – just barely), to the NBA. We eventually landed on a strategy that 12x’d our account over 10 days, primarily by identifying mispriced underdogs where the order book skewed too heavily toward favorites. You could calculate the deviation between the Vegas lines and the prediction market books pretty easily. For instance we took the Timberwolves against OKC at nearly 10:1 odds. If you’ve seen Anthony Edwards play recently, you know the man is a monster; it was a fundamental mispricing. Last, we looked at arbs across platforms with some opportunities for almost 10% “risk free” profits, ignoring settlement risks.
But as we watched the liquidity move and risk adjust in real-time, the cracks in the current infrastructure became obvious. While the wisdom of the crowds is a poetic concept, the current reality of prediction markets faces a structural ceiling.
The primary hurdle for prediction markets today is a lack of deep liquidity. Liquidity is the bridge between a theoretical price and a tradable one; it ensures that a high-conviction trade reveals new information to the market rather than simply breaking the order book. If a prediction market is thin (low liquidity), it's just a place where a few people are guessing. For it to become a real financial tool, it needs to be deep (high liquidity) so that big players can move millions of dollars without accidentally breaking the price. Liquidity and execution reliability are the crux of trading infrastructure. The inability for Market Makers (MMs) to update spreads quickly in deep orderbooks causes one to be picked off. When MMs can't hedge, or price accurately, liquidity vanishes.
Today, most of the liquidity is concentrated in sports. This is because the Designated Contract Market (DCM) license (only 20 issued total) allows people to trade on sports in states where sports betting is illegal.
To move beyond sports betting, these markets must solve the Toxic Flow problem.
Prediction markets inherently encourage participants with inside information to place asymmetric bets. In a traditional equity market, insider trading is a crime; in a prediction market, it is a mechanism for price discovery. This creates a hostile environment for MMs. If an MM knows they are constantly being run over by insiders, they widen their spreads or leave the book entirely. This is known as the adverse selection problem.
In addition, one of the most significant barriers to professionalizing prediction markets is the lack of capital efficiency. Currently, most prediction markets are fully collateralized (1:1 margin). If you want to bet \$100, you must put up \$100.
In the world of professional trading, this is an anomaly. In traditional derivatives, like Perpetual Swaps (Perps) or Futures, traders utilize leverage without tying up their entire balance sheet. Without leverage, you cap the potential Return on Equity (ROE) necessary to prioritize these markets. Because there is no margin, unwinding a position can be more difficult. In a fully collateralized market, you can't simply 'net' your way out of a losing position; you are forced to hold your high-cost, binary bet to the bitter end unless you find a new buyer willing to pay the full face value upfront. This capital lock-up is a non-starter for high-frequency market makers who need to recycle capital every millisecond.
If prediction markets are to become a multi-trillion dollar asset class, they must evolve into something resembling Lloyd’s of London – a marketplace where specialized groups compete to underwrite unique, high-stakes risks.
The value proposition isn't just knowing who will win an election; it’s the ability to represent and trade risk that was previously unquantifiable specific to your financial situation. Imagine a world where:
The price signal provided to the rest of the world might be a prediction market's greatest utility. By allowing corporations to hedge specific existential risks, prediction markets move from a retail speculators tool to a fundamental piece of financial infrastructure.
There is, however, a final boss in this evolution: Counterparty Risk.
An old colleague of mine, who built the derivatives desk at a major bank, reminded me that in high-stakes finance, people sometimes prefer counterparty risk; provided they know who to sue. There is a psychological and legal comfort in facing a regulated bank that has a history of government backstops (the Too Big to Fail insurance).
This leads to a fundamental question for the next generation of traders and corporate hedgers:
Would you rather face the settlement risk of a decentralized protocol like Uma or Polymarket, or would you rather face a Tier-1 bank?
While many people view code as law, the institutional world still prefers a throat to choke. For prediction markets to reach the scale of the global derivatives market, they must bridge this gap between the trustless efficiency of the blockchain and the legal recourse of traditional finance.
To fix this, prediction markets need to look more like the ISDA (International Swaps and Derivatives Association).
The ISDA Master Agreement is the Holy Grail of finance. It’s a standardized contract that dictates exactly what happens when things go wrong: defaults, bankruptcies, or settlement disputes. It removes the need for a middleman to decide the winner because the rules are pre-agreed upon globally.
We are starting to see a move away from:
The future is Standardized Event Contracts. These are environments where the settlement source (e.g., a specific BLS data point or a federal court filing) is hard-coded into the contract structure. By removing the centralized processor, or the human oracle, we move toward a world where a prediction market contract is as legally and financially robust as a Japanese yen swap. One could argue that event contracts are overexposed to definitional edge cases, which make them inherently difficult to compare to ISDA.
However, standardization enables the institutionalization of finite outcomes: elections, referenda, regulatory decisions as tradeable financial instruments. Events like Brexit, or the 2016 U.S. election created enormous economic consequences, yet there was no direct way for institutions to express, hedge, or transfer that risk. Standardized event contracts turn discrete outcomes into portfolio components, allowing investors to size exposure, hedge downstream impacts, and construct event-driven strategies with the same rigor applied to rates, FX, or credit. That capability unlocks a new layer of demand from asset managers, corporates, and risk desks who have historically had views on these events, but no clean way to trade them. This is how prediction markets become a trillion dollar asset class.
The potential is there. The monster in the room isn't just Anthony Edwards; it's the untapped liquidity of corporate risk waiting for a robust enough venue to call home.
As AI systems have become more complex over the past 3 years, their orchestration has too. More tools. More sophisticated retrieval and context engineering pipelines. And more routing, prompt engineering, and workflow logic to help the models figure out what capabilities to use.
But a different approach is quickly gaining traction among leading AI application builders: Scrap nearly all of this and replace it with a coding agent that has a virtual computer. Give it a filesystem, a code interpreter, and a small set of basic tools, then let it figure out how to manage itself.
Companies including Anthropic, Cursor, Cloudflare, and Vercel have all shared variants of this approach, with huge impacts: equivalent or improved performance while using an order of magnitude fewer tokens (meaning a huge reduction in cost and latency). In our fall hackathon where 100 participants built agents for complex data analysis, we also saw it emerge as the dominant approach.
This shift, from agents defined by systems carefully designed by humans, to agents dynamically constructing their own, is one of the most important architectural changes in how we build AI systems. And it has major implications for the next generation of AI infrastructure.
So is a coding agent all you need?
The dominant architecture for agentic systems over the past two years has been built on tool calling and programmatic orchestration.
Teams define a set of tools (search_database, send_email, process_file, etc.), each described through MCP in a JSON schema so an LLM knows how to select and use them.
Agent behavior is controlled with external frameworks like LangChain and OpenAI SDK, which use a combination of logic and human prompting to instruct the model. Teams also build bespoke systems to handle context management: retrieving documents/data, delivering tool descriptions, and storing memories.
This approach has worked remarkably well, but it is becoming evident it won’t scale to the next generation of applications.
The clearest limitation is around context. In the existing model, all information is pushed to the model through its context window. It’s hard to figure out precisely the information the model will need to achieve its outcome, and the context window can quickly get filled with unnecessary tool descriptions, lengthy tool responses, or full documents retrieved in search. This adds cost and latency, and can even degrade performance from context bloat.
These systems also become very hard to maintain and optimize. Context engineering, tool design, and agent orchestration are all closely coupled, so changes are hard to implement and evaluate, and any regression can break the whole workflow.
The result are agents whose performance (with a given base model) stalls; it becomes harder to make improvements as context length, codebase complexity and maintenance burden all increase.
The emerging alternative is surprisingly simple: give the agent its own compute environment that it can manage autonomously.
In this architecture, data, prompts, and tool descriptions live in a sandboxed filesystem. The agent gets simple instructions to start, then writes code to explore the directory structure to learn more. It uses code to read and manipulate data. And it can use code to call tools (or build its own).
There are a few huge benefits to this approach:
LLMs are best at writing code: LLMs are amazing at writing code, and improving more quickly in this domain than anywhere else due to the focus/investment and verifiability. For many tasks, it’s easier for an LLM to write code to solve the problem, than trying to solve it on its own. This also gives agents the ability to handle novelty. If a predefined tool fails, the agent is stuck. But a coding agent can write a script, encounter an error, debug it, and try again.
They’re also better at pulling in their own context: Context engineering is engineering, and it’s becoming clear that LLMs can find the right information they need to do a task much more efficiently than humans. Instead of loading all 100 tool descriptions into context, it can find and open just the one it needs. Instead of loading a full CSV as plaintext, it can just read the header or processed results. Early examples show a dramatic reduction in context length: Anthropic illustrated a >90% potential decrease in total tokens used, and in real-world applications Vercel saw a 37% decrease and Cursor saw 47%.
Files are a great place to store information: A filesystem gives the agent durable memory and a scratchpad for long-horizon memory and reasoning. Agents that write and re-read their own plans perform materially better on multi-step tasks. While other space-saving techniques are lossy (e.g. asking an LLM to summarize preceding conversation), a filesystem can store near-unlimited raw source material that it can refer back to if needed.
The results for AI agent builders is game-changing:
What’s especially exciting is that these skills emerged naturally out of general pre- and post-training for code generation. Foundation model companies can (and are starting to) train models specifically for this particular flavor of coding: agents who code to make and use tools, retrieve their own context, and store information in files. We expect these capabilities will continue to improve rapidly.
While coding agents are powerful, they are not sufficient to power most agentic applications on their own. We see two main limitations:
They require lots of domain knowledge: Almost any function, whether finance, operations, legal, or security, needs a huge amount of context and domain knowledge to do the job well (see our earlier articles on the Business Context Layer and Context Platforms). This knowledge isn’t just facts that can be retrieved from data: it’s procedural, normative instructions of how to think like an expert. As LLMs improve, they’ll naturally build better expertise at engineering-like tasks, but will likely not learn the nuance of how to investigate a niche security vulnerability or follow a complex transaction through the supply chain. Even in coding-adjacent domains like data analysis, our hackathon showed that agents still require human expert guidance to figure out how to analyze the kind of complex, messy data that exists in real-world businesses.
And need reliability: In many enterprises, the process by which a result is derived (and data/tools used along the way) is as important as the result itself. A coding agent could generate new logic or tooling for each request. Even if it had a similar overall success rate, it would create variability and non-reproducibility that could cause problems down the road. If there are heuristics an agent should always follow (“always start with step A” or “if you use tool X, you must check its results with tool Y”), those should be specified explicitly.
The benefits of coding agents are real, and almost every AI agent company we work with or talk to is moving to this model. But because of their limitations, coding agents do not spell the death of agent systems as we know them; instead, it is mostly a change in how they are architected. Here’s what we predict:
Companies will rearchitect from complex orchestration and retrieval infrastructure to coding agents with access to an execution environment and filesystem
Today, a huge amount of engineering effort goes into complex orchestration logic and context engineering systems. Teams use frameworks like LangChain and the OpenAI Agents SDK to manage multi-step workflows, tool selection, and maintain state. Alongside these, they've built sophisticated retrieval systems: conditional logic, reranking pipelines, etc., all designed to get the right information to the model at the right time.
We think much of this complexity will collapse into something simpler: sandboxed execution environments like E2B or Modal, with a well-specified set of tools and a filesystem containing all the relevant data. Instead of orchestration code deciding what the agent sees, the agent largely orchestrates itself and uses tools/code/data as needed.
Human expert-designed prompts and tools will remain a critical source of differentiation, but will (mostly) reorganize to files.
This shift doesn't eliminate the importance of prompt engineering, tool design, or context curation; it just changes where that work lives and how agents consume it.
Domain knowledge, procedural instructions, and expert heuristics will be carefully crafted and a main source of differentiation. The change will be more organizational: Instead of directly injecting context into LLM requests, most will live in files which agents open as needed.
Excellent tools will remain critical, especially for interfacing with complex external systems: ERPs, CRMs, SIEMs, etc. Here, the pattern shifts in a similar way: tool descriptions and instructions become files the agent reads, not schemas hardcoded into orchestration logic.
Agents will gain more flexibility in deciding how and when to pull these resources into its direct context window, but their decisions will be bounded by guardrails or explicit instructions (you guessed it, written in a file) where desired behaviors are already known, or guarantees of specific procedures are required.
Theory portfolio company Maze is building agents that evaluate massive numbers of potential security vulnerabilities, investigating each one like an experienced analyst to determine which are real issues and how to fix them.
The architecture behind these agents has gone through multiple iterations.
Thanks to Santiago Castiñeira at Maze, Wiley & Arnav at Doss, Phil Cerles, and Dan Shiebler for our discussions about coding agents. If you're interested in learning more about our research, or building agent architectures like this yourself, I'd love to hear from you: at@theoryvc.com.