Recent online discussions have raised alarms about “the end of scaling laws” and a plateau in LLM performance.
The main catalyst for these musings was a Reuters article with anecdotes of disappointing performance from frontier models in development, along with quotes from experts in the field like Ilya Sutskever. In parallel, some great research has demonstrated the limits of scaling effective compute via quantization/low-precision training.
It is certainly possible that we are starting to see diminishing returns directly scaling foundation model pre-training, though there isn’t enough evidence to say for sure.
But even if true, we have high confidence that AI application capabilities will continue to expand dramatically in the coming years. Foundation model progress slowing would not impact the prospects for teams building AI products. In fact, it might benefit them.
Foundation models like GPT, Gemini, and Claude are pre-trained with massive amounts of compute. Over the past 2 years, scaling these training runs with more compute (via more data and/or more model parameters) has resulted in improvements in capabilities across the board. When people refer to the end of scaling, they mean that we will see diminishing returns as training runs get larger and larger.
But when you think about what AI applications can actually do, the pre-trained base model is only part of the picture. Capabilities will depend just as much on:
Surrounding systems & infrastructure: We have long believed that LLMs are only a small part of AI applications. LLM inference endpoints are surrounded by systems to (1) find and retrieve relevant information, (2) orchestrate and execute actions, and (3) integrate responses into the broader application/interface. In each of these areas, we see rapid development and emerging best practices.
Today’s LLMs are capable enough to power years of new AI applications and workflow automations, as surrounding infrastructure improves. For example, a workplace assistant will answer questions more accurately as its internal search/retrieval system improves. It will be more functional when deeply integrated with every enterprise tool, and easier to use as interfaces for human-AI collaboration mature.
Domain-specific reasoning data: As discussed in more depth in our previous blog post, OpenAI’s o1 model showed the massive impact of adding reasoning data. For an LLM, human reasoning is just another data distribution, no different from basketball stats or European history. The problem is that we typically don’t write down the inner monologue of our reasoning and all the assumptions behind it. OpenAI generated a lot of math, physics, and general reasoning data; but consider the wealth of domain-specific data yet to be harnessed. How does a security analyst work through solving a problem? What about a software engineer, accountant, or lawyer?
Building a dataset of collected and synthetic reasoning data dramatically increases today’s LLMs’ performance in that application. Out of the box, a foundation model might get stuck on a particularly difficult security investigation. But when augmented (via fine-tuning or in-context learning) with thousands of analyses done by security experts, it will be able to reason through more and more complex ones on its own.
Inference-time scaling laws: OpenAI’s o1 announcement also formalized inference-time scaling laws. Giving a model more time to try different options, evaluate paths, and iterate on its responses substantially improved its ability to find the right answer. This is particularly true in applications with complex multi-step reasoning or tool calls.
Many use cases will involve multi-step reasoning – say you want an AI assistant to analyze some data, or find and book a restaurant. Inference time search/scaling improvements will make these work more reliably, so long as it’s possible for a model to do it in the first place.
Model cost & speed: Applications today are often limited by model cost and latency. As the cost and latency of model inference continues to drop precipitously, developers can step up to larger model sizes, iterate more during inference, or create new product experiences that wouldn’t be feasible today.
A huge number of AI applications will be offered for free because model costs are so low. Other applications will be able to process more data, or take more tries to get to an answer, because they can iterate hundreds of times cheaply and quickly.
No matter the job, many of the fundamental tasks we do are simple enough that today’s LLMs can do them. With levers to pull in application infra, domain-specific data, and inference-time compute, there is massive headroom to continue to expand the capabilities of AI applications – even if foundation model progress were to halt today.
We expect to see no slowdown in the number of new AI applications or new capabilities in those domains. For the vast majority of companies building AI apps, incremental improvement on today’s models will be sufficient to build nearly anything they want to.
(Note: There are a small number of extremely difficult frontier tasks, like general software engineering. These companies may face more risk if foundation model progress slows.)
If anything, a plateau in foundation model performance is a net positive for app developers: it decreases the risk that competitive advantages are obviated by the next generation of models. Investments in complex engineered systems and domain-specific data will be more durable moats. Companies can focus on these areas and benefit from rapidly dropping inference costs.
We remain as excited as ever about the next generation of AI applications, and would love to hear from anyone working on them — please reach out to at@theory.ventures.