Stay up-to-date on our team's latest theses and research
The internet would have you believe that AI for Data is a solved problem. This claim is often stated as a corollary to ‘text-to-SQL’ being solved, and ‘ripgrep is all you need’.
Unfortunately, the reality of the modern enterprise is not a simple data warehouse, a nice Google Drive, and a log store that makes sense. It is a sprawling, messy ecosystem of structured tables, cryptic server logs, and endless folders of PDFs.
Data analysis has always been one third investigative work, another third engineering tools, and a final third decoding context hidden in questions. So while a tightly scoped question on a provided data schema may give the impression that text-to-SQL is a problem for 2023, there’s work to be done in building AI assistants for ‘quick questions’.
If you want an LLM capability to exist in the next 6-12 months, your best bet is to build a benchmark – so we did. We have seen exceptional progress on chat and coding, but robust benchmarks for complex, multi-modal data analysis simply do not exist.
To test the limits of current agents, we didn't just want a Q&A dataset on a fixed schema; we needed a simulation of reality. We built a fake business called Retail Universe from the ground up to serve as the backdrop for a true "data mess." Creating this benchmark was a significant engineering effort in itself, consisting of:
On November 15, over 100 people gathered to throw their best agents at this wall of data. The goal? To see if AI can answer the kind of specific, high-stakes analytics questions that drive real businesses.
The challenge was to solve 63 data science and analytics questions about Retail Universe. Here’s an example:
“Our largest customer in 2022 should sign up for a recurring order program. Which customer, which item, and how many months did they order that item?”
Or another one:
“We ran a signup promotion #102 in Q1 2020 for $20 off their first order. Calculate its ROI based on 12-month value, assuming all signups are incremental.”
For readers who have experience working as a data scientist, these may feel familiar, but with one conceit: the desired answers are extremely specific.
This competition was presented in four phases:

Contestants submitted solutions to the questions as CSV files with a column for each of the ‘checkpoints’ along the way to a correct answer. For example:
“Sales for an item spiked for a few weeks, but revenue is down. Figure out which item and calculate the lost revenue.”
Expected output:
{
"item_sk": 2,
"lost_revenue": 45770.69
}
Which was graded automatically, and the results for each key were returned. A point was awarded if they got all the values for a question correct.
Over 6 hours, folks battled for prominence in our simulated world of data analysis showdown.
But we had one final gimmick: a human data analyst with 20 years of experience but no AI tooling, – a human baseline.
Our winners were:

Note: in the last minute, TableFlatteners tied up the score with ChrisBerlin, and we had to go to a thrilling showdown: I asked three questions where the two teams had to race to solve each one; best of 3 won.
Here’s how the teams performed, based only on questions where no answers were provided:

While the best performing teams solved less than half of the challenges, a real evaluation of performance looks even less rosy:

You can see here that beyond the first 10 questions (which we gave the answers to), performance was quite poor on many questions, and over a third received no correct solutions.
If we relax to look at all checkpoints, not only full solutions, we see a bit more spread across the challenge:

So our first insight is that most of the checkpoints had some correct answers.
The next thing we wondered was how the day unfolded for the top 10 teams:

Several teams would clear multiple evals at a time, and the majority of the action happened around minutes 300-330.
You can also see that after our top teams (and human baseline) premiered, they stayed in the lead:

We looked at both contestant and question correlation of solutions, and while they made pretty heatmaps, there weren’t any impressive seriations worth talking about.
If you watch the interviews with the winners, they knew virtually nothing about Retail Universe. It would be nearly impossible to answer any questions as a data scientist for a real company with the lack of knowledge the participants had. Imagine attempting to explain sales phenomena at your organization while not knowing the number of stores you have, where they are, or even your most popular items.
The winning solutions, and many in the lower tier, didn’t build data agents; these teams simply used pre-existing agents like Claude Code, Cursor, and Codex, and asked them to write code to solve the problems. As part of the setup for the challenge, we gave them access to a MotherDuck MCP for easy use of the tabular data and LanceDB instructions for easy use of the PDF data.
We saw a lot of usage of the MotherDuck MCP, and the winning solutions even called out how useful it was for exploring the tabular data. You can see how, amongst the teams using this MCP, different clients' popularity broke down, and you can see how many requests to this MCP were made over just a few hours. Similarly, we saw several teams utilizing LanceDB to make searching over unstructured data much simpler.

This brings us to the first lesson:
Many data problems are coding problems if you squint hard enough.
The agents grepped and parsed and regexed and scanned as much data as they could. They build ephemeral assets to assist them with a laser focus on the task given. And when they confidently landed on an answer, they spit it out. Unfortunately, most of the time, what they spit out was wrong.

Our second lesson:
If you give people unlimited tries, they won't build trustworthy agents.
We chose to give contestants as many submissions as they wanted, without penalty for incorrect answers. This was a mistake. We could have incentivized building trustworthy systems by limiting submissions to 1 incorrect attempt per checkpoint. Future iterations on this challenge will be graded in a smarter way.
Our number one learning about this event:
If you ask people to compete on a set of explicit evals, hill-climbing won’t move towards a useful agent.
A useful data agent is similar to a useful data scientist; a trustworthy, curious, technical partner – one that helps frame the question that can guide a decision. While we learned a lot about the current capabilities of models in investigating data, we aren’t yet confident that we’ve seen a useful data agent just yet.
Look forward to follow-up details on this dataset as a benchmark! If you’re interested in Data Agents and would like to partner with us on evaluating yours, please email bryan@theoryvc.com.
The LLM operations space is crowded, and differentiation is fading fast. Platforms race to match each other's features, leaving users to wonder what really sets them apart.
During my internship, both because I needed to evaluate these capabilities and because I wanted to understand the landscape better, I ran a deeper comparison across the major evaluation platforms.
In my last post, I mentioned that we tested several tools and assessed their usefulness in my Board Deck Extraction project. To understand how the user experience differs across platforms, I built a small dataset of two mock board decks and designed a set of evals to judge each platform’s output. I gave myself around two hours with each platform to trace, evaluate, and refactor my prompt.
Helicone, Braintrust Arize, LangSmith, and Weights & Biases Weave were the candidates of this test. Each platform was graded on the integration process, specifically the ease of integration, duration of setup, and the amount of refactored code, out of a total score of 30.
The quality of features was also graded out of 30. 20 for how well it does its job, and 10 for the look, feel, UI, and UX.
The quality of aid to my issue was graded out of 10, for a final score out of 70.
At its core, Helicone is a platform for tracing LLM calls and reporting each call’s statistics. They aren’t really an evaluation platform, and even say so themselves. Even so, I tried using the platform as such for the purpose of this experiment.
The setup was relatively easy. I was able to configure traces for my LLM calls within 20 minutes. From there, I created my own evaluation scoring system and got my scores into the Helicone dashboard, where I could compare different requests’ prompts and outputs along with each score.
I also wanted to try out the prompt playground, but at the time of recording the experiment, the playground was broken. A few other buttons didn’t work, like the “Get Started” button that pops up as soon as you create an account.
Besides that, Helicone was a very positive experience. Setup was super easy, and not a lot of code had to be refactored to get Helicone integrated. A few bugs with the UI, but besides that, it got the job done despite not being an evals platform.
Braintrust is an evaluation platform, and I was able to use it as such. The integration into my codebase was very smooth.
For scoring each request’s output, I used heuristic evaluation against the expected output, which Braintrust’s SDK provides.
Within 40 minutes, I was able to get my evaluation workflow set up, and get Braintrust to send my prompt evaluation scores to the dashboard.
I was then able to add traces with one extra line of code.
Overall, the documentation was able to fully guide me through the process of integrating the Braintrust platform into my code. With my input, output, expected output, and evaluation scores all in one place, I was easily able to refactor my prompt.
Arize was the only platform that I struggled to set up. The documentation had broken links and a layout that made it hard to find what you were looking for.
The plan was to follow the quick start guide for traces. Once they were in the dashboard, I could then create an evals pipeline and start iterating on my prompt.
The reality was a mix of poor documentation, poor instruction, and a lack of coffee…
LangSmith felt very similar to Braintrust, which made the setup quick and easy.
LangSmith doesn’t include built-in heuristic evaluators. Instead, it allows custom code evaluators. It also supports composite, summary, and pairwise evaluations, which are useful for combining scores, averaging results, or comparing outputs.
The documentation was clear, detailed, and included concept explanations and examples (a notable improvement over Arize). Integration was straightforward and user-friendly.
LangSmith is well-designed, easy to integrate, and effective for iterative prompt evaluation. No major complaints as it performs its role reliably and efficiently.
Weave’s evaluation system mirrors Braintrust and LangSmith: you create an evaluation object with a dataset and score, define a scoring function and model, then run evaluations.
Setup went smoothly, aside from a small logging bug quickly fixed via Cursor.
The platform is geared toward RAG and fine-tuning applications, with strong support for bias and hallucination detection and detailed setup guides. However, this focus means it’s less aligned with your use case and requires more setup steps.
Once running, customizability and usability were excellent. The dashboard nicely formats LLM outputs with collapsible attribute views.

All in all, this was just my short experience with each platform. This wasn’t a full in-depth analysis, so I encourage you to try each of them out yourself. If you are unsure which platform is right for you, here’s an overview:
Helicone excels at LLM observability for dev teams who want quick logging, cost tracking, and debugging with minimal instrumentation. They offer a quick and easy way to switch between LLM providers via their AI gateway. If you want the fastest, lightweight, open-source LLM logging and cost telemetry, go with Helicone.
Braintrust is for teams that want a way to create a structured workflow to evaluate quality, consistency, and monitor live performance of their LLM application. Braintrust bridges prompt/LLM development with quality control and monitoring.
Arize Phoenix is also open-source and allows for deep visibility into complex LLM-based flows. It’s very strong when it comes to diagnosing and iterating your LLM application. If you are in the early stages of developing a complex LLM app and want an open-source option, Arize Phoenix is for you
Like Braintrust, LangSmith is also kind of like an all-in-one package. If you are actively debugging and improving your LLM application, especially if you're already using LangChain, LangSmith is a great option.
If your team is already using Weights & Biases, Weave becomes almost trivial to integrate. If you are doing ML experiments and model tuning with W&B, adding weave allows you full observability and evaluation options for your LLM applications.
Currently, each platform has its own niche that makes it unique. However, I can see a future where each platform’s niche becomes adopted by every other platform. The user experience of each platform is an important differentiating factor, and all platforms should strive to constantly improve it!
Bryan previously reviewed three of these platforms for their evaluation capabilities in a Video Series: 'Mystery Data Science Theatre' with Hamel Husain and Shreya Shankar. View episodes 1, 2, and 3.
We're excited to announce our AI in Practice survey where we sought to map where adoption is happening, where the gaps are, and how teams are hardening and scaling AI in practice. Here's what we found.
Why We Built This Survey: A Market Map for Builders Building for Builders
AI infrastructure is moving faster than any of us can track. Every week brings a new agent framework, evaluation suite, orchestration layer, or open-source model. But the question that matters for founders is simple:
What are teams actually adopting in production — and where are the white spaces no one has solved yet?
To answer that, we went straight to the builders.
Not the people posting demos, not the vendors pitching abstractions — the practitioners responsible for shipping AI into real systems at startups, SMBs, and enterprises.
We surveyed 413 senior technical builders across company sizes, sectors, and geographies to understand how organizations are actually adopting AI components and what they’re trying next.
To make this most actionable for our founders, we wanted to provide the results in an interactive dataset which you can use to pressure-test GTM strategy, refine ICP hypotheses, and identify segments that are over-served, under-served, or completely unserved.
The result is a diagnostic to address:
We’ve shared our core findings here, and we invite founders to explore their own questions and gather specific results.