Stay up-to-date on our team's latest theses and research
This is the third post in our 4-part series on the LLM infra stack. You can catch up on our last two posts on the Data Layer and Model Layer. If you are building in this space, we’d love to hear from you at info@theory.ventures.
LLMs allow teams to build transformative new features and applications. However, bringing LLM features to production creates a new set of challenges.
LLMs:
Product owners will feel these issues acutely: it may be easy to build a demo, but it is hard to get that feature to work reliably in production.
LLM product launches also pull in other teams. Security and compliance teams will be concerned about the risks LLMs introduce. Platform engineering teams will need to make sure the system can run performantly without exploding the cloud budget.
A new generation of infrastructure will be required to observe and manage LLM behavior. When we talk to executives who are building LLM features, this layer is often the main blocker for deployment.
The key components of the Deployment Layer include:
Security and governance
LLMs create frightening new security and safety vulnerabilities. They take untrusted text (e.g. from user inputs, documents, or websites) and effectively run it as code. They are often connected to data or other internal systems. Their behavior is unpredictable and hard to evaluate.
It’s easy to imagine a variety of ways this can go wrong.
If you’re building a chat support agent, a malicious user could ask the bot to leak details about someone else’s account.
If you’re building an LLM assistant that’s connected to email and the web, someone could make a malicious website that, when browsed, would instruct the LLM agent to send spam from your account.
Anyone seeking to harm a company's brand image could convince its LLM to write inappropriate or illegal content and post a screenshot online.
Today, there is no known way to make LLMs safe by design.
Large foundation models like GPT-4 are trained specifically to avoid bad behavior but can be tricked into breaking these rules in minutes by a non-expert. If these large models can be fooled, it’s not clear why a smaller supervisor model would catch an attack.
There is some emerging research around new architectures, like a dual-LLM system where one LLM system deals only with risky external data and the other deals only with internal systems. This work is still in its infancy.
So what is a product owner to do? The 33% of companies that say security is a top issue (Retool 2023 State of AI) can’t be paralyzed waiting for these issues to be solved. While the state of the art evolves, companies should take the traditional approach of defense in depth.
To do this, product owners can implement security and governance controls throughout the LLM stack.
At the Data Layer, they will need data access and governance controls so that LLMs can only ever access data that is appropriate to the user and context.
At the Model Layer, supervisor/firewall systems can try to identify and block things like PII leakage or obscene language. These systems will have limited effectiveness today: it is not hard to circumvent them, and model size will be limited due to latency. But it’s a place to start.
Post-deployment monitoring systems should flag anomalous or concerning behavior for review.
There’s a lot to dig into in this space. We’re doing more research and expect to share a model security/governance-specific blog post in the coming months.
Observability/evaluation
LLMs show non-deterministic behavior that is sensitive to changes in model inputs as well as underlying infrastructure and data.
Today, product teams try to identify edge cases and bugs pre-launch. Most common issues are deterministic: the code for a button is broken, then it gets fixed. There are usually few enough features that they can be tested manually.
LLM systems must respond to infinite variations of input text and documents. Manually testing them all would be impossible. Teams will need evaluation tooling to measure performance programmatically. This might involve re-running historical queries or generating synthetic ones. They will need to convert unstructured outputs into qualitative and quantitative metrics. Because model outputs can change dramatically based on surrounding components, each test example will need to be traced throughout the LLM infra stack.
Post-launch, today’s observability platforms monitor for performance degradation, system outages, or security breaches.
The high cost and complexity of running LLMs (described in The Model Layer) make this observability critical. On top of this, LLM observability platforms must provide real-time insights into LLM inputs and outputs. They will need to identify anomalous responses – e.g., ones that are zero-length or just a single repeating character. They can search for potential data loss or malicious use to flag to security teams. They also can evaluate responses for content and qualitative measures of correctness or desirability.
These systems (along with product analytics described below) will be the foundation for a new workflow that must be created in LLM product orgs. Tracking poor responses will create an evaluation dataset of edge cases and undesired behavior.
These issues will need to go to a cross-functional team since it will often not be clear what change(s) are required to fix it. A data engineer might need to fix an issue in the data retrieval system. A different engineer might need to fine-tune or switch to a different model. A product or marketing person might need to change the natural language prompt instructions.
These workflows don’t yet exist, but we think will be a core competency for companies building the future generation of LLM products.
Traditional ML observability platforms serve some of these functions today. However, there are unique characteristics of LLMs:
We believe these will be best served by a purpose-built platform.
Product analytics
Product analytics for LLM systems look different from the analyses product teams conduct today.
Today’s product analytics are designed to track a series of discrete user actions. On a traditional customer support page, a user might open section A and then click on subheader B.
LLM applications must track very different usage patterns. For a customer support chatbot, every user action might be the same: a sent message. How should a PM analyze which help sections the users are asking about? How can they evaluate the quality of the chatbot’s responses?
LLM applications are also built with complex systems. Let’s say the chatbot’s answer didn’t satisfy the user. Did it not understand the question? Did the data retrieval system not find the right document for the LLM to refer to? Or did the model just make a reasoning error in its generation?
LLM analytics platforms will need to be built for this new paradigm. They will use intelligent systems (often language models themselves) to categorize and quantify unstructured data. They will also provide interfaces and workflows to evaluate individual examples and trace issues through the system.
LLM product analytics platforms share many core capabilities with LLM observability/evaluation tools. Differences in user types and workflows (e.g., platform engineers monitoring latency vs. product managers analyzing user flows) might require separate platforms. But it’s possible these two categories converge.
Orchestration & LLMOps
LLM applications will often need to coordinate multiple processes to complete a task.
Let’s say I am building an LLM-based travel agent. To plan an activity, it might need to:
Simple implementations can be built in the application code itself. More complex systems will require purpose-built platforms to orchestrate all of these actions and maintain memory/state throughout.
As a control plane, there’s a natural opportunity to expand control to more components of the LLM infra stack. We call these companies “LLMOps” broadly. There are about 40 of them today.
LLMOps companies include a variety of features across the LLM infra stack. Some are focused on production inference, while others also provide tooling for fine-tuning. There are sub-categories of LLMOps companies focused on developer tools vs. low/no-code business applications.
Some companies will choose LLMOps platforms for convenience. With limited team capacity, they may be willing to pay extra or give up flexibility for an end-to-end solution.
It’s also possible that end-to-end LLMOps platforms can be more effective than standalone components. For example, they might integrate an observability platform with data storage and model training infrastructure to facilitate regular fine-tuning. They also might be easiest to instrument with security or governance controls.
In the modern data stack, we have seen best-in-breed modular components emerge. We expect that for most companies building LLM applications, the same will be true.
If you are building in this space, we’d love to hear from you at info@theory.ventures. In our next post, we’ll explore the Interface Layer.
This is the second post in our 4-part series on the LLM infra stack. You can catch up on our last post on the Data Layer here. If you are building in this space, we’d love to hear from you at info@theory.ventures.

The core component of any LLM system is the model itself. They provide the fundamental capabilities that enable new products.
LLMs are foundation models: models trained on a broad set of data that can be adapted to a wide range of tasks. They tend to fall into two categories: large, hosted, closed-source models like GPT-4 and PaLM 2 or smaller, open-source models like Llama 2 and Falcon.
It’s unclear which models will dominate. The state of the art is changing rapidly – a new best-in-class model is announced monthly. So few products have been built with LLMs to test trade-offs. There are two emerging paths:
1. Large foundation models dominate: Closed-source LLMs continue to provide capabilities other models can’t match. A managed service offers simplicity & rapid cost reduction. These vendors solve security concerns.
2. Fine-tuned models provide the best value: Smaller, less-expensive models prove just as good for most applications after fine-tuning. Businesses prefer them because they control the model and can create intellectual property.
Large foundation models will work best for use cases that require broad knowledge and reasoning. Asking a model to plan a month-long vacation itinerary in Southeast Asia for a vegetarian couple interested in Buddhism requires planning and lots of working memory. These models are also best for new tasks where you don’t have data.
Smaller, fine-tuned models will excel in products composed of repeatable, well-defined tasks. They’ll perform just as well as larger models for a fraction of the cost. Need to categorize customer feedback as positive or negative? Extracting details from an email into a spreadsheet? A smaller model will work great.
Which models are safer and more secure? Which are best for compliance and governance? It’s still unclear.
Many companies would prefer to self-host their models to avoid sending their data to a third party. However, emerging research suggests fine-tuning may compromise safety. Foundation models are tuned to avoid inappropriate language and content. LLMs, especially smaller ones, can forget these rules when fine-tuned. Large model providers may also find it easier to demonstrate that their models and data are compliant with nascent regulations. More work needs to be done to establish these conclusions.
In the long term, we can’t predict how LLMs will develop. A large research org could discover a new type of model that is an order of magnitude better than what we have today.
Progress in model development will drastically change the structure of the Model Layer. Regardless of the direction, it will have the following key components:
Core model:
Training LLMs from scratch (also known as pre-training) costs hundreds of thousands to tens of millions of dollars for each model. OpenAI is said to have spent over $100 million on GPT-4. Only companies with huge balance sheets can afford to develop them.
Conveniently, fine-tuning a pre-trained model with your own data can provide similar results at a fraction of the cost. This can cost well under $100 with data you already have on hand. During inference, fine-tuned models can be faster and an order of magnitude cheaper to boot.
Fine-tuning is supported by a rapidly improving set of open-source models. The most capable ones, like Meta’s Llama 2, perform similarly to the best LLMs from 6-12 months ago (e.g., GPT-3.5). When fine-tuned for a concrete task, open-source models often reach near-parity with today’s best closed-source LLMs (e.g., GPT-4).
Serving/compute:
Training LLMs is expensive, but inference isn’t cheap, either. LLMs require an extreme amount of memory and substantial computing. For a product owner experimenting on OpenAI, each GPT-3.5 query will cost $0.01-0.03. GPT-4 is an order of magnitude higher at up to $3.00 per query.
If you try to self-host for privacy/security or cost reasons, you won’t find it easy to match OpenAI. Virtual machines with NVIDIA A100 GPUs (the best one for most large foundation models) run $3.67 to $40.55 per hour on Google Cloud. Availability of these machines is very limited and sporadic, even through major cloud providers.
For LLM applications with thousands or millions of users, product owners will face new challenges. Unlike most classic SaaS applications, they’ll need to think about costs. At $3 per query, it’s hard to build a viable product. Similarly, users won’t hang around long if it takes 30 seconds to return an answer.
It will be critical to serve LLMs in a performant and cost-effective way. Even “smaller” models described above still have billions of parameters.
How can you reduce cost and latency in production? There are lots of approaches. On the inference side, batch queries or cache their results. Optimize memory allocation to reduce fragmentation. Use speculative decoding (when a smaller model suggests tokens and a large model “accepts” them). Infrastructure usage can be optimized by comparing GPU prices in real-time and sending workloads to the cheapest option. Use spot pricing if you can manage spotty availability and failovers.
Most companies won’t have in-house expertise to do all these themselves. Startups will rise to fill this need.
Model routing/abstraction:
For some LLM applications, it will be obvious which model to use where. Summarize emails with one LLM. Transform the data to a spreadsheet with another.
But for others, it can be unclear. You might have a chat interface where some customer messages are simple, and others are complex. How will you know which should go to a large model vs. a small one?
For these types of applications, there may be an abstraction layer to analyze each request and route it to the best model.
As a product owner, it is already difficult to evaluate the behavior of an LLM in production. If your system might route to many different LLMs, that problem could get even more challenging.
However, if dynamic routing improves capabilities and performance, the trade-off might be worth it. This layer could almost be thought of as a separate model itself and evaluated on its own merit.
Fine-tuning & optimization:
As we wrote about above, many product owners will fine-tune a model for their use case.
For most use cases, fine-tuning will not be a one-off endeavor. It will start in product development and continue indefinitely as the product is used.
The key to great fine-tuning will be fast feedback loops. Today, product owners triage user feedback/bugs in code and ask engineers to fix them. For LLM applications, product owners will monitor qualitative and quantitative data on LLM behavior. They will work with engineers, data ops teams, product/marketing, and evaluators to curate new data and re-fine-tune models.
We’ll discuss post-deployment monitoring more in our next post on the
Deployment Layer:
To fine-tune a model, there are two major components. There is an important cohort of LLM infrastructure companies serving each of them.

If you are building in this space, we’d love to hear from you at info@theory.ventures! In our next post, we’ll explore the Deployment Layer.