No items found.

A 1 in 15,787 Chance Blog Post

November 5, 2025

A 1 in 15,787 Chance Blog Post

No items found.

Five years ago, a simple workflow automation task seemed not worth the effort. Now, everyone’s asking themselves how aspects of our workflows can be transitioned to automation via AI, or at least augmented by AI.

Unfortunately, this process has headwinds. The most common issues are usually related to the complexity of the workflow, the availability of interconnected tools, or model performance. Of these, model performance is often the easiest to overcome, as its challenges can frequently be mitigated through effective human-AI collaboration.

During my internship at Theory, I focused on model performance within workflows: turning unstructured and nonuniform information from startup board decks into clean, structured data that investors could utilize.

Startup’s board decks often contain crucial information about a company’s performance and priorities; however, they arrive in a variety of formats. Some have clean tables. Some have beautiful but dense slide designs. Many decks hide important metrics inside charts, graphs, and images.

We want easy access to these data, which today sit locked in disparate sources. We also want the ability to slice, join, and analyze them without burdening our investors or portfolio companies with manual data entry.

‍

Our key criteria for this work were:

Highly automated processes that fit into simple workflows,
high accuracy and high trust in the data collected, and‍
compatibility with other data systems and applications to enable analytics workflows.

‍

Hence, the Board Deck Extraction project was born.

Core Components – Why Were They Chosen?

Email Ingestion

To satisfy our first criterion, we needed a frictionless method to get the decks into our processing system. Our first solution was to send the board deck PDFs to a service email account and use an Apps Script to poll that email for board deck-related emails. No deployments necessary, just a lightweight, scheduled polling script.

This ended up working well enough, so we stuck with it. Additionally, Google Apps Script is completely free, and we were already using Google Cloud Services as our cloud provider.

Text Recreation

Page layout, page co-occurrence, and multi-page context are all crucial in extracting the meaning of text. This means that pure text extraction was likely not enough.

‍

Our goal was to send the LLM strings of text for processing, so our initial naive solution was to run the PDF through Tesseract OCR in Python. While this worked for text recognition, it flattened all formatting, losing the page structure, table alignment, and any sense of where data lived in context. A number in a table could be sitting beside the wrong label, and Gemini wouldn’t know the difference.

That’s when we turned to Google Document AI’s Layout Parser. Unlike traditional OCR, it:

Identifies blocks, paragraphs, and tables with their spatial relationships,
supports mixed content (text + images), and
outputs a structured JSON representation of the document.

This meant we could rebuild the document as text with layout preserved, giving Gemini the context it needed to correctly interpret table data, extract numbers embedded in charts and graphs via image OCR, and understand relationships between headers, subheaders, and values. While there are other services specifically designed for applications like this, the Layout Parser was an easy and effective option already pre-existing in our GCS infrastructure.

‍

The result? Accuracy jumped, and Gemini could now pull metrics from places that had been previously out of reach.

A Note on Evals

Evals are a crucial part of any LLM capability. We used some simple hand labeling of board decks data and a homegrown app to try different prompts, products, and services for these tasks. We didn’t go too deep on measuring fancy statistics or metrics about the quality of extractions; we just stuck to simple accuracy and instruction following for these tasks.

However, we did test out multiple evaluation platforms and each of their usefulness to this project. We performed an evaluation on the evaluators.

Metric Extraction

It’s easy to frame the objective as a simple yes/no, “did it extract the text correctly?”, but the reality is more nuanced. This task is closer to translation than to straightforward text recognition.

Let’s use a simple example: some metrics (e.g., burn rate) aren’t always present and must be calculated. As is known, LLMs are arithmetically challenged, so how should we get the LLM to calculate it? The answer is via tools. Using Gemini with Vertex AI, you can ask it to extract metrics and write code to calculate new ones. This is extremely useful, especially with metrics that have a triangular relationship. For example, if we are provided a cash balance and a monthly runway, we can extract them with Gemini and use Vertex AI to generate code to calculate burn rate.

Evals became even more essential in this stage; we are asking the model to perform a series of steps where a mistake at one step ruins the chances for the rest of the steps. By creating a ground truth for every metric extraction, we are able to evaluate each model and prompt’s accuracy.

The prompt was essentially built through iteration: write prompt, analyze errors, incorporate into prompt context, repeat. There are plenty of platforms that offer an application workflow for this process, and we even developed our own internal app to streamline this. However, for much of this work, we used Google Sheets.

Metric Verification

This workflow deals with some critical data. For this reason, we decided to create a human review step. While humans should not be relegated to computer verifiers, this is a good use case while we build confidence in the consistency of the capability.

To make this work as pleasant as possible, we created a UI experience optimized for rapid review:

Side-by-side context: Metric value + originating page number + option to be automatically taken to originating page.
Definitions & explanations: Each metric’s meaning and extraction rationale.
Save state: Progress can be preserved during reviews.

Story in Data

Now that we have the data extracted, we can finally start unearthing insights.

Think of a collection of board decks as a story, and each deck is a chapter. With these metric extractions, we can easily summarize while maintaining and visualizing the key points that drove the story.

‍

Here, we can easily surmise that something isn’t going well. The raw board deck itself cannot display this level of clarity to a company's story.

From Manual Work to Automated Workflows

Five years ago, hiring a dedicated worker for these kinds of tasks was normal. Automation tasks for unstructured data just couldn’t be viable. Today, it is.

With effective human–AI collaboration, we can overcome common pitfalls and achieve workflows that are both streamlined and reliable, delivering high-accuracy and trusted data.

At Theory, board deck extraction is only a fraction of the beginning. We’ve built an entire system of internal services that speed up our workflows, focused on speed and quality. This is just one service among many, and we intend to keep pushing the boundaries of how far AI augmentation can take us.

November 4, 2025

Automating Slides to Signals

November 4, 2025

Automating Slides to Signals

No items found.

November 3, 2025

OpenAI's $1 Trillion Infrastructure Spend

November 3, 2025

OpenAI's $1 Trillion Infrastructure Spend

No items found.

October 31, 2025

Small Data Becomes Big Data

October 31, 2025

Small Data Becomes Big Data

No items found.

October 30, 2025

$555B of Cloud Spend

October 30, 2025

$555B of Cloud Spend

No items found.

AI

The two hottest new roles in the bay are the Forward Deployed AI Engineer and Agent Product Manager. We think they shouldn’t exist.

Okay, that’s a bit hyperbolic. But the best approach to implement AI applications is shifting from ad-hoc context engineering to productized context platforms.

These roles were created to address a real need: closing the gap between AI’s capabilities in the lab versus the real world. We recently wrote about how context, not intelligence, is the limiting factor to making AI work in enterprise. Most of the work FDAIEs and APMs do is gathering context from their customers to manually update system prompts and evals.

This approach works, but it’s slow, labor-intensive, and non-scalable. As AI promises to automate every aspect of work, why is its implementation so manual?

It doesn’t have to be this way: context creation and implementation will become more productized as part of a core platform, instead of a post-sales service.

Context Engineering → Context Platforms

As we wrote previously, emerging context platforms we see all share three core capabilities:

Automate context creation (via screen recordings, historical queries, email threads, tickets, existing documents, etc.)
Deliver the right context human and/or AI workers need to complete a task
Empower human and/or AI workers to maintain and improve context over time

Let’s compare them to today’s context engineering:

	Context Engineering	Context Platforms
Context creation	Mostly manual, provided by the vendor’s forward-deployed resources	Mostly automated, with some manual QA by the vendor and/or the customer
Context delivery	Prompts hardcoded into orchestration workflows	Context is dynamically retrieved from a database
Context maintenance	Manual services; vendor-owned	Either automated or self-serve/customer-owned

‍

Products built around context platforms will be faster and cheaper to deploy. They’ll do tasks more reliably because context is more easily kept up to date.

And critically, they will allow customers to own and manage their operational processes: creating a valuable data asset/intellectual property, as opposed to ceding control and information to their vendors.

Forward-deployed resources will still be critical, but will spend most of their time on things like integrations, edge case handling, and change management; not writing and tweaking prompts.

Meanwhile, as most business jobs shift from doing work to managing AI agents, these context platforms will become key systems of record and operational tools to manage agent behavior.

Where Are the Best Places to Build Context Platforms?

Context will be a critical part of any enterprise AI application. But there are some areas where it will just be a component of vertical platforms, versus others where the context platform is the dominant feature and might even be a standalone company.

In spaces like cybersecurity, there is a ton of sophisticated domain knowledge required to do the job successfully: writing complex queries, reasoning about attack paths, etc. Much of this knowledge is universal – phishing attacks and cloud misconfigurations look similar anywhere. Customer-specific context is still very useful, but the most important thing is having the latest security expertise. These functions will best be served by sophisticated vertical applications like Maze and Dropzone, which build features for customer context.

In other functions, company-specific context massively outweighs general domain knowledge. Consider procurement or customer support. Individual tasks might be simpler, like looking up data in one system and entering it in another. But knowing what to do entirely depends on how your organization operates and makes decisions. In these domains, getting AI to work requires huge amounts of context gathering, and the context platform will actually be the most important feature driving product value and scalability.

Two areas where we think context platforms will be critical are:

Operations: Ops are extremely complex and variable, requiring coordination across every function in the business. Processes are documented in scattered SOPs, past conversations, and tribal knowledge. Outside of basic project management and brittle RPA tools, ops leaders have no core software platform where they can view or manage how their operations are executed. We wrote about this opportunity extensively in our Business Context Layer article.

‍EPD: A huge amount of context and history underlie every product and engineering decision. Today’s PRDs are not sufficient to direct coding agents: they might understand the desired features and general best practices, but won’t have the background context on priorities, tradeoffs, and historical decisions that inform the right way to build them. It’s possible that context is a byproduct of coding agents, but we think there is an opportunity to build a context platform across engineering, product, design, and customer teams.

It’s clear that productizing context will be a key unlock for deploying the next generation of AI applications.

If you’re building context platforms for enterprise, or AI applications where context is a core component of your product, I’d love to hear from you: at@theoryvc.com.

October 28, 2025

From Context Engineering to Context Platforms

October 28, 2025

From Context Engineering to Context Platforms

No items found.

October 27, 2025

The Growth Premium Persists in SaaS Valuations

October 27, 2025

The Growth Premium Persists in SaaS Valuations

No items found.

October 23, 2025

Product-Market Fit is No Longer Static

October 23, 2025

Product-Market Fit is No Longer Static

No items found.

October 21, 2025

Engineering for the AI-Native Era : Office Hours with Calvin French-Owen

October 21, 2025

Engineering for the AI-Native Era : Office Hours with Calvin French-Owen

No items found.

The finale of most hackathons is the demos. Fun, but ultimately dominated by presentation skills, slick visuals, and subjective "vibes," leaving core technical quality unmeasured. This fails to even consider the central challenge of building modern AI systems: Can you build an agent that is objectively useful?

We designed the America's Next Top Modeler: The Context Engineering Hackathon to answer that question. This is the first hackathon (that we are aware of) that moves beyond demos to focus entirely on AI engineering quality. Participants will compete to design and optimize Context Agents that navigate complex data environments. Your agents will be judged by a set of objective evaluations designed to expose flaws, not by subjective judges. If you are an engineer or builder who wants to test your skills and prove your approach delivers real performance, this is your chance.

Beyond the Framework Hype

Many believe that installing a popular framework is the final step in building an effective AI agent. One could even say there are several tribes that have coalesced around tools, claiming to have found the secret sauce that makes AI “just work”.

This hackathon invites you to put your beliefs and code to the ultimate test. We invite practitioners to bring their favorite tools to bear, whether that be DSPy, LangGraph, LlamaIndex, TextQL, BAML, or pure ingenuity. Do you believe in the primacy of your favorite programming language so much that you think it gives you an unreasonable edge?

The goal is to build Context Agents that can reliably extract, transform, and reason over structured and unstructured data that simulates real enterprise environments. Our suite of evals will reveal which approaches are effective at solving these problems, allowing us to learn what actually works.

What’s Special About This Hackathon?

Pre-defined evals that solving proves value. Your agents will be judged solely on their performance against rigorous, pre-built evals. There are no presentation scores, no biases, and no subjective opinions to influence the rankings. The outcome is based purely on how effective your agent is at completing the given tasks.
The Data Science for AI Imperative. This hackathon highlights the essential role of applying data science principles to debug and optimize modern AI systems. Simply using an agent framework is insufficient; success depends on how smartly you approach feature engineering, tool definition, and context engineering itself. You will learn that achieving reliable agent performance requires iterative data work, not just configuration.
A Focus on Engineering Depth. The goal is to build a high-quality agent. Our evaluations will measure agent performance across multiple stages of complexity, including multi-hop reasoning and combining information from diverse data sources (e.g., databases, PDFs, logs). This process mirrors the challenges faced when deploying a truly valuable, work-ready agent.

Hosted by Theory Ventures and featuring applied AI engineering experts Bryan Bischof and Hamel Husain, this event emphasizes the reliability and quality of AI systems. Most importantly, this hackathon offers you a rare chance to earn bragging rights based on quantifiable performance.

Logistical Details

Join us in San Francisco to put your AI engineering skills to the test.

Date and Time: Saturday, November 15, from 10:00 AM - 8:00 PM PST.
Location: San Francisco, California.
Format: Individuals or teams of two.
Provided: OpenAI credits to power your agents, along with food and drinks.
Special prizes and swag.

Register now to join.

October 21, 2025

America's Next Top Modeler: A Hackathon Built for AI Engineering

October 21, 2025

Theories

An Agent Skills Platform for the Real World: Our Investment in BackOps

Latest

Recruiting in the Land of Ice

Recruiting in the Land of Ice

The Best Software Engineering Teams Are Building Agent Factories

The Best Software Engineering Teams Are Building Agent Factories

Agents Need More Than Evals: They Need Their Own Isolated Worlds

Agents Need More Than Evals: They Need Their Own Isolated Worlds

Get the latest in AI & data, straight to your inbox.

All articles

A 1 in 15,787 Chance Blog Post

A 1 in 15,787 Chance Blog Post

Core Components – Why Were They Chosen?

Email Ingestion

Text Recreation

A Note on Evals

Metric Extraction

Metric Verification

Story in Data

From Manual Work to Automated Workflows

Automating Slides to Signals

Automating Slides to Signals

OpenAI's $1 Trillion Infrastructure Spend

OpenAI's $1 Trillion Infrastructure Spend

Small Data Becomes Big Data

Small Data Becomes Big Data

$555B of Cloud Spend

$555B of Cloud Spend

Context Engineering → Context Platforms

Where Are the Best Places to Build Context Platforms?

From Context Engineering to Context Platforms

From Context Engineering to Context Platforms

The Growth Premium Persists in SaaS Valuations

The Growth Premium Persists in SaaS Valuations

Product-Market Fit is No Longer Static

Product-Market Fit is No Longer Static

Engineering for the AI-Native Era : Office Hours with Calvin French-Owen

Engineering for the AI-Native Era : Office Hours with Calvin French-Owen

Beyond the Framework Hype

What’s Special About This Hackathon?

Logistical Details

America's Next Top Modeler: A Hackathon Built for AI Engineering

America's Next Top Modeler: A Hackathon Built for AI Engineering