Stay up-to-date on our team's latest theses and research
AI systems are typically evaluated with humans as the gold standard. How many college-level math problems can it solve? How many medical questions can it answer? How accurately can it extract information from a contract or purchase order?
Even as LLMs become superhuman test-takers, it’s clear they still can’t reliably do many actual jobs. Most of the thought and work going into AI applications is oriented towards building value with a system that’s only ~80% as good as a human. How do you limit AI to tasks that it can do more reliably? How do you incorporate humans to check AI activities? This is critical, and drives our research in many areas of AI software.
But there are also jobs where AI systems are structurally better than humans. For these, AI won’t be struggling to keep up with the average employee; it will be many multiples better than any person could be. And while automation in general can reduce costs, integrating AI into jobs they do better than humans can create new company capabilities, improve customer experience, and drive revenue.
We think these categories are particularly ripe for new AI software startups.
No matter how smart humans are, they are fundamentally limited in how much they can do. You can only read so many pages, or draft so many emails.
For jobs that require high throughput, humans must build systems to make decisions in aggregate. They might just search for specific keywords across those pages, or use rules-based decision trees to select from pre-drafted email responses. Perhaps a fraction of the tasks are escalated to human review.
Regardless of how carefully you draft your email campaigns or how many rules are in your decision tree, we know that these systems will always fail. Every page or email is slightly different. While humans can handle that variability, the rules-based systems they implement cannot.
In an earlier post, we described LLMs as an infinite supply of near-free interns.
Like a new intern, the tasks they do should be straightforward. That could be because the task:
When tasks are simple, imagine how much work an infinite number of these AI interns can do. They can process 10,000 pages just as easily as they process 100. They can operate 24/7, without getting tired or bored. While there can be issues with hallucinations, when provided with contextual information they are less likely than a human to forget a name or make a typo.
Perhaps most important of all, the AI interns look at each task and apply human-like reasoning independently for each one. For high-throughput jobs, the alternative to an AI doing a task is not actually a human doing it: it’s a legacy rules-based system, or nobody doing it at all.
What will happen to these jobs that AI does best? As discussed in a previous post, they won’t go away entirely, but could dramatically change in scope. Generally, they will uplevel – from execution to orchestration, or first-pass to escalation/review. Without entry-level work, it may be challenging to onboard and train talent.
Security operations
Analysts in security operations centers (SOCs) get overwhelmed by a deluge of alerts generated by their detection tools. Often there are dozens of near-identical ones; each could take the better part of an hour to investigate fully. To stay afloat, they compile rules-based filters and playbooks, even though they know these let things slip through the cracks.
Dropzone AI, a Theory portfolio company, has shown that agentic systems can replicate manual investigations. Their AI systems have expert skillsets – for example, they have deep knowledge of dozens of tool-specific querying languages. But they wouldn’t need to be the best analyst in the world to have a massive impact. The fact that AI agents can review each individual alert, in minutes, at any time of day or night, and with perfect memory, is a dramatic step change from the small fraction of alerts that get reviewed (often far too late) today.
Customer engagement
Customer engagement platforms like Salesforce and Braze send billions of texts, notifications, and emails per week. Of course, it would be impossible for a human to write each one of them. Instead, marketing teams must draw out rules-based journeys. This is the sequence we’ll send to new users. Here’s what we’ll do when a customer leaves something in their cart. But every user is different. Even if you could identify and define micro cohorts/segments, it’s just not possible to manage many thousands of different messaging campaigns at once.
AI agents don’t have this constraint. They can use millions of different strategies for millions of users, experimenting with content, channel, and timing. It doesn’t matter how perfectly crafted a message is if it’s directed at the wrong person – agents that personalize messages for each individual are much more likely to find the specific attributes that drive business outcomes. We’ll share more on this space next week!
Investment research
In investment research, ideas mean nothing unless they can be translated into actions. Say you want to invest in businesses that will benefit from increased AI usage. Of course, Microsoft and NVIDIA will be on the list, but there are scores more companies along the value chain – datacenter component manufacturers, system integrators, REITs, etc. Researching this thesis would require analysts comb through thousands of pages of documents and create massively complex financial models.
Human analysts can cover a small number of companies and race to update models when new earnings reports drop. AI analysts can easily screen hundreds of companies on an ongoing basis, just as fast/accurately as they could scan one. We expect this will dramatically change how investment firms operate, allowing firms to systematize qualitative strategies in the way they run quant strategies today.
Supply chain operations
Supply chain organizations manage hundreds of vendors supplying thousands of goods and services (if not more). The challenging cognitive work is dealing with inevitable problems that arrive daily, and figuring out how to optimize procurement/logistics over time. But most of the day-to-day work is data collection and relationship management. Each day, professionals spend hours tracking status updates, copying numbers, reviewing RFP responses, and matching invoices.
AI systems can easily maintain thousands of email conversations at once. They can instantly read through lengthy PDFs and spreadsheets and extract just the relevant information. In addition to freeing up time for humans to focus on more important work, they will enable new strategic capabilities – like dramatically expanding the frequency and scope of RFPs, or proactively monitoring and alerting for supplier issues.
—
If you’re building automation for jobs where AI has a structural advantage, we’d love to hear from you! Reach out to at@theory.ventures.
Recent online discussions have raised alarms about “the end of scaling laws” and a plateau in LLM performance.
The main catalyst for these musings was a Reuters article with anecdotes of disappointing performance from frontier models in development, along with quotes from experts in the field like Ilya Sutskever. In parallel, some great research has demonstrated the limits of scaling effective compute via quantization/low-precision training.
It is certainly possible that we are starting to see diminishing returns directly scaling foundation model pre-training, though there isn’t enough evidence to say for sure.
But even if true, we have high confidence that AI application capabilities will continue to expand dramatically in the coming years. Foundation model progress slowing would not impact the prospects for teams building AI products. In fact, it might benefit them.
Foundation models like GPT, Gemini, and Claude are pre-trained with massive amounts of compute. Over the past 2 years, scaling these training runs with more compute (via more data and/or more model parameters) has resulted in improvements in capabilities across the board. When people refer to the end of scaling, they mean that we will see diminishing returns as training runs get larger and larger.
But when you think about what AI applications can actually do, the pre-trained base model is only part of the picture. Capabilities will depend just as much on:
Surrounding systems & infrastructure: We have long believed that LLMs are only a small part of AI applications. LLM inference endpoints are surrounded by systems to (1) find and retrieve relevant information, (2) orchestrate and execute actions, and (3) integrate responses into the broader application/interface. In each of these areas, we see rapid development and emerging best practices.
Today’s LLMs are capable enough to power years of new AI applications and workflow automations, as surrounding infrastructure improves. For example, a workplace assistant will answer questions more accurately as its internal search/retrieval system improves. It will be more functional when deeply integrated with every enterprise tool, and easier to use as interfaces for human-AI collaboration mature.
Domain-specific reasoning data: As discussed in more depth in our previous blog post, OpenAI’s o1 model showed the massive impact of adding reasoning data. For an LLM, human reasoning is just another data distribution, no different from basketball stats or European history. The problem is that we typically don’t write down the inner monologue of our reasoning and all the assumptions behind it. OpenAI generated a lot of math, physics, and general reasoning data; but consider the wealth of domain-specific data yet to be harnessed. How does a security analyst work through solving a problem? What about a software engineer, accountant, or lawyer?
Building a dataset of collected and synthetic reasoning data dramatically increases today’s LLMs’ performance in that application. Out of the box, a foundation model might get stuck on a particularly difficult security investigation. But when augmented (via fine-tuning or in-context learning) with thousands of analyses done by security experts, it will be able to reason through more and more complex ones on its own.
Inference-time scaling laws: OpenAI’s o1 announcement also formalized inference-time scaling laws. Giving a model more time to try different options, evaluate paths, and iterate on its responses substantially improved its ability to find the right answer. This is particularly true in applications with complex multi-step reasoning or tool calls.
Many use cases will involve multi-step reasoning – say you want an AI assistant to analyze some data, or find and book a restaurant. Inference time search/scaling improvements will make these work more reliably, so long as it’s possible for a model to do it in the first place.
Model cost & speed: Applications today are often limited by model cost and latency. As the cost and latency of model inference continues to drop precipitously, developers can step up to larger model sizes, iterate more during inference, or create new product experiences that wouldn’t be feasible today.
A huge number of AI applications will be offered for free because model costs are so low. Other applications will be able to process more data, or take more tries to get to an answer, because they can iterate hundreds of times cheaply and quickly.
No matter the job, many of the fundamental tasks we do are simple enough that today’s LLMs can do them. With levers to pull in application infra, domain-specific data, and inference-time compute, there is massive headroom to continue to expand the capabilities of AI applications – even if foundation model progress were to halt today.
We expect to see no slowdown in the number of new AI applications or new capabilities in those domains. For the vast majority of companies building AI apps, incremental improvement on today’s models will be sufficient to build nearly anything they want to.
(Note: There are a small number of extremely difficult frontier tasks, like general software engineering. These companies may face more risk if foundation model progress slows.)
If anything, a plateau in foundation model performance is a net positive for app developers: it decreases the risk that competitive advantages are obviated by the next generation of models. Investments in complex engineered systems and domain-specific data will be more durable moats. Companies can focus on these areas and benefit from rapidly dropping inference costs.
We remain as excited as ever about the next generation of AI applications, and would love to hear from anyone working on them — please reach out to at@theory.ventures.