Blog

But with the explosion of LLMs and retrieval systems to support them, every LLM company suddenly needs to have world-class search embedded within their product just to make it work.

With this emerging need, how will this new wave of AI companies solve search?

Retrieval is a critical component of LLM systems, and isn’t going away

Retrieval-augmented generation (RAG) systems deliver relevant information to an LLM to help it respond to a query. This grounds LLM generation in real and relevant information.

Imagine an LLM is answering a question on a history test. Without RAG, the LLM would have to recall the information from things it’s learned in the past. With RAG, it’s like an open-book test: along with the question, the model is provided with the paragraph from the textbook that contains the answer. It’s clear why the latter is much easier.

Finding the right paragraph in the textbook may not be easy. Now imagine trying to find one code snippet in a massive codebase, or the relevant item in a stack of thousands of invoices. Retrieval systems are designed to tackle this challenge.

New LLMs have longer context windows, which allow them to process larger inputs at once. Why take the effort to find a paragraph out of the textbook if you could just load in the entire book?

For most applications, we think that retrieval won’t go away even with >1M token context windows:

Businesses often have multiple versions of similar documents, and delivering them all to an LLM could present conflicting information.
Most interesting use cases will require some type of role and context-based access control for security reasons.
Even if computation becomes more efficient, there’s no need to incur the cost and latency to process thousands or millions of tokens that you don’t need

Semantic similarity search is just one piece of the puzzle

As LLM prototyping exploded, people quickly turned to semantic similarity search.

This approach has been used for decades. First, separate data into chunks (e.g. each paragraph in a word document). Next, run each chunk through a text embedding model that outputs a vector which encodes semantic meaning of the data. During retrieval, embed the query and retrieve the chunks with the nearest vector representations. These chunks contain the data that (in theory) have the most similar meaning to the query.

Semantic similarity is simple to build, but results in pretty mediocre search. Some key limitations of this approach are:

It can miss useful content that is semantically different from the query. Users don’t always clearly articulate what they mean, or a query might not include context that would be relevant to their request. (e.g., a customer describing a product vaguely, or not mentioning a recent purchase).
The approach is sensitive to the embedding model. A general text embedding model might not perform well in your domain.
They’re very sensitive to how you process input data. The system will function differently depending on how you parse, transform, and chunk your input data. Dealing with different types of data (e.g. tables) is complex.
Even with optimizations, text embeddings are expensive to compute. This hinders the ability to iterate on ingestion and embedding pipelines, and to serve applications that need near real-time data.

To the first point, this approach only searches based on the semantic meaning of the query. If you look at any of the companies that do search well, semantic similarity is only one piece of the puzzle.

The goal of search is to return the best results, not the most similar ones.

YouTube combines the meaning of your search query with vectorized predictions of what videos you’re most likely to watch, based on global popularity and your viewing history. Amazon makes sure to prioritize previous purchases in search results, which it knows you were probably looking to re-order.

The future of retrieval systems

Google was founded on the PageRank algorithm, a simple way to rank web pages. But today’s Google search would be unrecognizable to the initial team: it is an incredibly complex system that uses many approaches to return the best results.

Similarly, teams building RAG systems started with simple semantic similarity search. We believe they will quickly become more complex and end up looking like today’s production search or recommender systems. The problems are not that different: from a large set of candidate items, select a small subset that is most likely to achieve some goal.

Today, most retrieval systems look like:

A future system might look something like:

Retrieval systems will have dramatic impacts on the capabilities of LLM applications: their effective memory, response quality, reliability, and performance/latency. We think that for many applications, these systems will have more of an impact on end capabilities than the LLM itself.

Because of this, we think that most companies will build these systems in-house as a core competency and differentiator. These builders will rely on a new set of infrastructure to build retrieval systems specific to their application.

To date, most investment has gone to databases to store vectors and retrieve nearest neighbors. But thinking through a future stack like the one described above, the database is only a small part of the solution.

Building these new systems will require better tooling to:

Create, manipulate, and use multiple types of vectors that can encode different types of signals (e.g. combining semantic search with recommendation or behavioral vectors). Superlinked, a Theory portfolio company, is building this tooling.
Ingest, parse, and process different types of input data.
Orchestrate retrieval systems and execute them in near real-time; these processes (especially embeddings) are expensive and many applications will not tolerate minutes or hours of latency.
Provide observability and monitoring to the retrieval stack.

Although most companies will build retrieval systems themselves, it’s possible that the stack consolidates in a couple ways:

Infrastructure providers could expand their offerings to cover broader parts of the infra stack (e.g. ingestion, manipulation, and storage).
Companies could build productized “retrieval as a service” that focus on specific applications or modalities – for example, retrieval systems for e-commerce websites, or for chat applications.

We’re excited for the evolution of retrieval and search as product enablers. If you’re building infrastructure for retrieval systems, or retrieval systems to power a new application, we’d love to chat!

May 1, 2024