Stay up-to-date on our team's latest theses and research
As the demands for data increases and cost of storage decreases, most data pipelines have moved to an ELT paradigm: loading all data to a warehouse in unaltered form, then building series of transformations for downstream consumers.
Historically, building these pipelines required the expertise of centralized data teams who could develop in analytics engines like Spark. But in recent years, we’ve seen the rise of open-source, easy-to-use frameworks that democratized the development of data pipelines with SQL.
These tools work great at small scale, but as an organization grows – in terms of headcount, data volume, or number of models – they start to break. That’s because these platforms typically re-run all computations after any change to data or a model.
As a result, pipelines take hours to run and cloud costs balloon because huge volumes of data are processed unnecessarily. Developers don’t know if a change will break someone else’s model downstream, and must wait hours each time the pipeline runs. Maintaining development/staging and production environments is complex; each deployment typically involves a full re-run. Running backfills and forward-only changes requires custom infrastructure or manual workarounds.
Tobiko Data solves these problems with an open-source data transformation framework that is just as simple to develop in with SQL, while reducing costs significantly and bringing DevOps best practices at scale. We are thrilled to lead their $17.3 million Series A, joined by Unusual Ventures and angels including George Fraser, CEO of Fivetran, and Jordan Tigani, CEO of MotherDuck.
Tobiko Data: a transformation framework that understands your data
The journey of Tobiko started with an open-source project called SQLGlot, a no-dependency SQL parser, transpiler, optimizer, and engine. Co-founder Toby Mao, then at Netflix, realized how useful it would be to have a simple Python library that could parse and transpile across the various SQL dialects his team used.
Because SQLGlot allows you to understand the meaning of and build abstract syntax trees (ASTs) for any SQL code, Toby and co-founders Tyson Mao and Iaroslav Zeigerman realized it could be the foundation for a new type of data transformation framework built to address some of the large-scale data pipeline challenges they saw at Airbnb, Netflix, and Google.
The core problem with current data transformation frameworks is that they (1) don’t understand how the code you write corresponds to flows of data through the pipeline, and (2) are stateless. Because of this, the entire computational graph is re-run after any change to a model or data, unless a user implements complex custom logic and instrumentation.
Enter SQLMesh – an open-source transformation framework based on a semantic understanding of SQL. This allows SQLMesh to keep track of data as it flows through data pipelines, and when your team updates models and data over time. This helps organizations develop more effectively and save huge amounts of time and money, providing:
In addition to the core transformation framework, SQLMesh includes an orchestrator, CI/CD testing framework, and virtual data environments to manage the promotion of changes to production – all in the open-source library.
SQLMesh has a robust community and has partnerships with companies including Harness, Fivetran, Pipe, Wealthsimple, Textio, and Dreamhaven.
For Harness, switching to SQLMesh reduced their cloud warehouse spend by 30-40% by avoiding unnecessary recalculations. It also made their developers more productive, reducing model build time by 80% and flagging breaking changes to allow for fast iteration.
This month, Tobiko is launching a managed version of SQLMesh called Tobiko Cloud. This will allow any organization to easily run SQLMesh without managing pipeline state, while all data processing remains on the customer’s own infrastructure.
Conclusion
As the importance of data continues to grow, the biggest challenge companies face is how to manage it effectively. How can the data analysts build pipelines quickly without giving the data platform engineer a headache? How can teams support larger and more complex pipelines without blowing performance and cost out of the water?
We are so excited to partner with Tobiko Data, who are ending the historical tradeoff between data transformation usability and scalability to enable the next generation of great data companies.
Imagine a web with no websites. All your favorite apps are now just databases, queried for information by agents that zip around on your behest.
Bots made up nearly half of global web traffic in 2023. With LLM agents, this number will explode. It seems plausible that 90% or more of web traffic will be non-human in the near future.
Today’s products and websites are designed for humans. They render information visually via carefully crafted UIs. They erect captchas and other roadblocks to try to stop bots, which are associated with spam and fraud.
In a world primarily composed of bots, what happens to websites? Do agents learn to navigate them like a human? Do they go away entirely?
It’s tempting to think that websites are obsolete in a world full of agents. Websites are just visual interfaces to interact with data. Why make an AI agent click through a set of buttons when it could just query the application directly?
Even though some applications will build agent-friendly APIs, for the foreseeable future we believe the web will remain dual-use. This is because:
UIs are effective
The information bandwidth of human vision is over 1 million times higher than the bandwidth of reading or listening to language. Product designers take advantage of this with carefully crafted visual interfaces that efficiently convey complex information and help users take action on it.
Say you want to compare restaurant options. Most likely, a scrollable map with location pins, overlays with photos or reviews, and buttons for common filters will be more effective than typing all your criteria into a chat interface and reading the results one at a time.
There will certainly be embedded applets and content cards, but the best and most robust UIs will remain on native web apps themselves – more on that below.
Building good APIs is hard
There are many beautiful web apps but few amazing APIs. Even when web apps have one, it usually doesn't provide the same feature set as the app and often has poor DX/usability. This is partly because APIs are not prioritized and partly because building effective APIs is hard – especially if the application was originally architected around a UI.
For AI agents, even the best API might not be good enough. LLMs struggle to use tools reliably, and user flows that involve multiple API calls (e.g. “find a hotel and book it for me”) pose a challenge. Developers will likely need to enrich APIs with additional documentation, prompting, or validation to ensure that AI agents can use them reliably.
Companies are incentivized to prioritize their surface over others
The 2010s demonstrated the impact demand-aggregating platforms like Facebook had on the publishing industry. These platforms boosted publishers’ traffic initially but had a damaging long-term impact as many consumers stopped going to the publishers’ sites entirely (depriving them of advertising dollars).
With AI assistants, many other companies face the same kind of disintermediation.
Airbnb wants you to think of its brand when planning a weekend trip. If you choose to go to Airbnb to initiate a search, you’re likely to complete a transaction on the platform.
Imagine you instead ask your AI assistant to plan the trip and get results from Airbnb, Vrbo, and Booking.com (perhaps not even showing which is which). You’re much less likely to complete that transaction on Airbnb; the company also gets less information about your behaviors/interests, less ability to serve targeted recommendations, and loses brand equity.
There will be a spectrum of responses – some companies embracing bots, others attempting to block them entirely. But on the whole, we expect that most companies will prioritize their own websites, chatbots, and apps over working smoothly with third-party AI assistants.
For human users, we expect the dual-use web will look pretty similar to today. But behind the scenes, there are early signs of how internet tooling will change to support AI agents.
Browser + automation infrastructure
Outside of flashy demos, AI agents will not browse websites on actual computer screens. Instead, they will use headless browsers, which replicate the functionality of a regular browser but can be run in large numbers in the cloud.
Similarly, AI agents will not move around a cursor or type with a keyboard. They’ll use a browser automation framework that lets them interact with web pages programmatically using code.
There is a healthy ecosystem of headless browsers and automation frameworks today. Historically, they’ve been designed for web scraping and web app testing. These use cases share some common needs with AI agents, such as resource management, bot avoidance, and various tricks to improve reliability of actions (e.g. only clicking a button when it’s active).
But AI agents will also have a set of unique needs that are unlike scrapers of the past. They will:
In addition to enabling AI agents, browser and automation infrastructure will support the collection of data for foundation model training, fine-tuning, and retrieval systems. Because LLMs can easily generate automation framework scripts, we expect there will be an explosion of AI-generated end-to-end tests for every application.
Combined, these will all dramatically increase the need for and value of next-generation browser and automation infrastructure. For more on this topic, check out this great memo from Paul Klein at Browserbase.
Authentication/authorization
Most websites today have defenses to try to identify and block bots. Accessing and actioning requests will be even more difficult for AI agents.
Many things you’d want an assistant to do require not just that it accesses a website, but that it takes actions on your behalf. Booking a flight, changing a dinner reservation, or sending a message all require the user to be signed in. How will AI agents do that?
In the short term, many agents will spoof human activity to take these actions. This is complex, requiring realistic human IP addresses, devices, and behaviors, as well as managing sensitive user credentials. It’s also relatively high-risk: a misstep could result in a user’s account being banned. But we expect that so long as the actions are reasonable and directed by a user, most platforms should be amenable (or at least turn a blind eye) to the activity.
In the longer term, we expect there will be separate pathways for machine authentication and authorization. These might look like consumer-directed service accounts. We think it will be a while before these are commonplace and might only be delivered with agent APIs instead of the web interface.
Website design/implementation
Today’s web applications are designed for visual consumption. The actual website code is often difficult to interpret, whether for system design (e.g. calling external services) or to stop scraping through deliberate obfuscation.
If they want to support AI agents, websites of the future might undergo redesigns that are invisible to the human eye but help agents navigate. You could imagine redesign or decoration of HTML elements with additional comments to help an agent understand what is what. Websites could even have invisible “agent sections” that are specifically designed for AI users.
We think dual-use websites are likely to remain important for a long time. But there are a couple potential trends that could change their relevance.
AI-generated APIs: As discussed above, it’s hard to build and maintain useful APIs. As AI code generation matures, the effort required to create a full-featured, agent-friendly API could drop by an order of magnitude. If that’s the case, more companies might maintain both websites designed for humans and APIs designed for agents.
New business models: Most businesses today rely on direct traffic to make money. There are inherent disadvantages to a site’s usage come through a third-party assistant versus a direct visitor, but it’s possible more platforms find a usage-based model, exclusive partnerships, or other business arrangements that incentivizes them to prioritize agents.
Assistant adoption/demand aggregation: It’s possible that assistant platforms like ChatGPT, Gemini, and Meta AI become the predominant way many consumers engage with the web. Much of this change is behavioral – how many people enjoy browsing the web and will prefer to shop themselves versus having a virtual assistant provide 1-2 options? If a majority of people use assistants to make purchasing decisions, most companies will have no choice but to provide first-class support to AI agents (just as most brands have to sell on Amazon).