This is the third post in our 4-part series on the LLM infra stack. You can catch up on our last two posts on the Data Layer and Model Layer. If you are building in this space, we’d love to hear from you at info@theory.ventures.
LLMs allow teams to build transformative new features and applications. However, bringing LLM features to production creates a new set of challenges.
LLMs:
Product owners will feel these issues acutely: it may be easy to build a demo, but it is hard to get that feature to work reliably in production.
LLM product launches also pull in other teams. Security and compliance teams will be concerned about the risks LLMs introduce. Platform engineering teams will need to make sure the system can run performantly without exploding the cloud budget.
A new generation of infrastructure will be required to observe and manage LLM behavior. When we talk to executives who are building LLM features, this layer is often the main blocker for deployment.
The key components of the Deployment Layer include:
Security and governance
LLMs create frightening new security and safety vulnerabilities. They take untrusted text (e.g. from user inputs, documents, or websites) and effectively run it as code. They are often connected to data or other internal systems. Their behavior is unpredictable and hard to evaluate.
It’s easy to imagine a variety of ways this can go wrong.
If you’re building a chat support agent, a malicious user could ask the bot to leak details about someone else’s account.
If you’re building an LLM assistant that’s connected to email and the web, someone could make a malicious website that, when browsed, would instruct the LLM agent to send spam from your account.
Anyone seeking to harm a company's brand image could convince its LLM to write inappropriate or illegal content and post a screenshot online.
Today, there is no known way to make LLMs safe by design.
Large foundation models like GPT-4 are trained specifically to avoid bad behavior but can be tricked into breaking these rules in minutes by a non-expert. If these large models can be fooled, it’s not clear why a smaller supervisor model would catch an attack.
There is some emerging research around new architectures, like a dual-LLM system where one LLM system deals only with risky external data and the other deals only with internal systems. This work is still in its infancy.
So what is a product owner to do? The 33% of companies that say security is a top issue (Retool 2023 State of AI) can’t be paralyzed waiting for these issues to be solved. While the state of the art evolves, companies should take the traditional approach of defense in depth.
To do this, product owners can implement security and governance controls throughout the LLM stack.
At the Data Layer, they will need data access and governance controls so that LLMs can only ever access data that is appropriate to the user and context.
At the Model Layer, supervisor/firewall systems can try to identify and block things like PII leakage or obscene language. These systems will have limited effectiveness today: it is not hard to circumvent them, and model size will be limited due to latency. But it’s a place to start.
Post-deployment monitoring systems should flag anomalous or concerning behavior for review.
There’s a lot to dig into in this space. We’re doing more research and expect to share a model security/governance-specific blog post in the coming months.
Observability/evaluation
LLMs show non-deterministic behavior that is sensitive to changes in model inputs as well as underlying infrastructure and data.
Today, product teams try to identify edge cases and bugs pre-launch. Most common issues are deterministic: the code for a button is broken, then it gets fixed. There are usually few enough features that they can be tested manually.
LLM systems must respond to infinite variations of input text and documents. Manually testing them all would be impossible. Teams will need evaluation tooling to measure performance programmatically. This might involve re-running historical queries or generating synthetic ones. They will need to convert unstructured outputs into qualitative and quantitative metrics. Because model outputs can change dramatically based on surrounding components, each test example will need to be traced throughout the LLM infra stack.
Post-launch, today’s observability platforms monitor for performance degradation, system outages, or security breaches.
The high cost and complexity of running LLMs (described in The Model Layer) make this observability critical. On top of this, LLM observability platforms must provide real-time insights into LLM inputs and outputs. They will need to identify anomalous responses – e.g., ones that are zero-length or just a single repeating character. They can search for potential data loss or malicious use to flag to security teams. They also can evaluate responses for content and qualitative measures of correctness or desirability.
These systems (along with product analytics described below) will be the foundation for a new workflow that must be created in LLM product orgs. Tracking poor responses will create an evaluation dataset of edge cases and undesired behavior.
These issues will need to go to a cross-functional team since it will often not be clear what change(s) are required to fix it. A data engineer might need to fix an issue in the data retrieval system. A different engineer might need to fine-tune or switch to a different model. A product or marketing person might need to change the natural language prompt instructions.
These workflows don’t yet exist, but we think will be a core competency for companies building the future generation of LLM products.
Traditional ML observability platforms serve some of these functions today. However, there are unique characteristics of LLMs:
We believe these will be best served by a purpose-built platform.
Product analytics
Product analytics for LLM systems look different from the analyses product teams conduct today.
Today’s product analytics are designed to track a series of discrete user actions. On a traditional customer support page, a user might open section A and then click on subheader B.
LLM applications must track very different usage patterns. For a customer support chatbot, every user action might be the same: a sent message. How should a PM analyze which help sections the users are asking about? How can they evaluate the quality of the chatbot’s responses?
LLM applications are also built with complex systems. Let’s say the chatbot’s answer didn’t satisfy the user. Did it not understand the question? Did the data retrieval system not find the right document for the LLM to refer to? Or did the model just make a reasoning error in its generation?
LLM analytics platforms will need to be built for this new paradigm. They will use intelligent systems (often language models themselves) to categorize and quantify unstructured data. They will also provide interfaces and workflows to evaluate individual examples and trace issues through the system.
LLM product analytics platforms share many core capabilities with LLM observability/evaluation tools. Differences in user types and workflows (e.g., platform engineers monitoring latency vs. product managers analyzing user flows) might require separate platforms. But it’s possible these two categories converge.
Orchestration & LLMOps
LLM applications will often need to coordinate multiple processes to complete a task.
Let’s say I am building an LLM-based travel agent. To plan an activity, it might need to:
Simple implementations can be built in the application code itself. More complex systems will require purpose-built platforms to orchestrate all of these actions and maintain memory/state throughout.
As a control plane, there’s a natural opportunity to expand control to more components of the LLM infra stack. We call these companies “LLMOps” broadly. There are about 40 of them today.
LLMOps companies include a variety of features across the LLM infra stack. Some are focused on production inference, while others also provide tooling for fine-tuning. There are sub-categories of LLMOps companies focused on developer tools vs. low/no-code business applications.
Some companies will choose LLMOps platforms for convenience. With limited team capacity, they may be willing to pay extra or give up flexibility for an end-to-end solution.
It’s also possible that end-to-end LLMOps platforms can be more effective than standalone components. For example, they might integrate an observability platform with data storage and model training infrastructure to facilitate regular fine-tuning. They also might be easiest to instrument with security or governance controls.
In the modern data stack, we have seen best-in-breed modular components emerge. We expect that for most companies building LLM applications, the same will be true.
If you are building in this space, we’d love to hear from you at info@theory.ventures. In our next post, we’ll explore the Interface Layer.