Adopting a 'horses for courses' approach to building your data platform
Today's data platforms too often start with an engine-first mindset, picking a compute engine and force-fitting data strategies around it. This approach seems like the right short-term decision, but given the gravity data possesses, it ends up locking organizations into rigid architectures, inflating costs, and ultimately slowing innovation. Instead, we must flip the model: by putting open, interoperable data at the heart of the data platform, and selecting specialized engines as needed, for e.g., Apache Flink for Stream processing and Ray for Machine Learning. A 'horses for courses' approach acknowledges that no single engine is best for every workload, and embraces a modular, future-ready architecture from the ground up.
This talk will make the case for a radical but proven idea: treat your data as a first-class citizen, and treat compute engines as interchangeable tools. We'll explore real-world examples where decoupled data strategies have allowed companies like LinkedIn, Uber and Netflix to evolve quickly across generations of technologies, and discuss practical strategies to avoid the endless migration treadmill. We will illustrate this using real-world comparisons of compute engines across key workloads, such as analytics, data science, machine learning, and stream processing.
Transcript
AI-generated, accuracy is not 100% guaranteed.
Demetrios - 00:00:07
We are live Vinoth. I'm excited for your talk. Every time that I chat with you, I really enjoy everything that you have to say. I feel like you bring a different dimension and perspective. You've got the first keynote. Gonna give it up strong. I'll let you take over. I'm gonna bring your slides up onto the stage, and I'll be back in 20 minutes.
Vinoth Chandar - 00:00:34
Thanks for that, Demetrios, and thanks for having me. Hello everyone. Welcome to the first of many OpenX data conferences. We have an exciting day in store for you. As Demetrios went over that, I'm Vinoth, founder CEO at Onehouse. I lead some major open source data projects, and I build large scale data infrastructure at Uber and LinkedIn. Today, we are gonna kick off with a deceptively simple, but a very fundamental topic on open data, how we can use the right tools to build your data platform.
Vinoth Chandar - 00:01:08
Let's start by reviewing the state of affairs for cloud data. There is a lot of time and effort spent in a very large market. Just the top two popular cloud data platforms have over a thousand companies spending more than a million dollars on their cloud data platforms. Companies predominantly start their data journeys on a single engine, typically a warehouse. Many of them, now one third, are doing other use cases than BI, like data science, ML, and all of that. Your first engine kind of starts as your foundation. But in many of these cases, there's a feeling that it becomes a ceiling for what you can achieve with your data. As AI projects are exploring, lock-in remains a top three concern across the board. Issues like privacy, data sovereignty, and flexibility to bring these rapidly evolving new set of tools for AI to your data remains a top concern.
Vinoth Chandar - 00:02:07
There's a general feeling that cloud data is expensive, lacks flexibility, and responds to lock-ins. I think this is due to something I call the engine first trap. The first simple thing many companies do is they focus on a single engine too much without considering their data and the use cases that they're going to have just the next year or two years out. Data has gravity. You start building a data warehouse and start storing data into it. This data gravity pulls queries to it, access to it. Before you know it, you have a lot of data in a single system and you're struggling to bring new use cases, better engines even if they exist in the market. At that point, it's pretty hard to consider a migration because the data has inertia.
Vinoth Chandar - 00:03:07
There's a typical pattern. You pick the engine, do whatever the vendor recommended on formats, tools, and whatnot. But the minute you try to bring a new engine, you hit a lot of issues because the data format is either incompatible or inefficient when you access it from the new engine. It's expensive to migrate data over. I know companies who spend millions and sometimes tens of millions of dollars on single compute platforms planning migration projects every year and make like 20% progress. This is a big decision that shouldn't be made lightly upfront. But if you look at where the world is right now, it's actually in a different place and moving rapidly towards a different direction. There's this thing called the Cambrian explosion.
Vinoth Chandar - 00:03:59
This was a period in New York's history where a lot of new species and life forms emerged at a very rapid pace. We are kind of living in something like that around the cloud data ecosystem. There are a lot of different specialized engines. They can all read data in open formats and are replacing the one size fits all model that we had for decades. The database industry, open source software is outpacing closed source databases significantly, officially crossed over a couple of years ago. The actual Cambrian explosion in nature was caused by changes in oxygen levels. Likewise, there are small but profound factors at play here that are furthering this shift. Let's examine the push.
Vinoth Chandar - 00:04:57
First, cloud is now de facto data storage. In the cloud, everything is on demand: on demand storage, on demand compute, you pay for these two separately. You can have a world where you pay for your ingestion, ETL vendors separately, and your query engine separately. Data lakes and data warehouses are converging. This is one area where I spent a lot of my last seven, eight years on, blurring the lines between these two main storage models. Cloud storage is getting faster and faster, unlocking new possibilities. The rise of open table formats like Apache Iceberg and others finally open up data warehouses to open formats. This eliminates the islands you build and start with.
Vinoth Chandar - 00:05:47
You no longer have to build an island for BI and then think about the broader ecosystem. You can start open. We ask this question at most conferences we go to: what do you use? From open source to commercial engines, there are plenty of companies using many engines in this market. Even within a single engine like Spark, people use it in different ways: do it themselves, use EMR from cloud providers, or other vendors. No engine is best on every workload. We've done a lot of research comparing engines across different data use cases: analytics, data science, machine learning, and stream processing.
Vinoth Chandar - 00:06:44
You can check out the QR codes and blogs for deep dives. We compared engine design, vectorized processing, push versus pull-based processing for analytics engines, GPU support, Python support for ML and data science. There's a reason for all these engines to exist. They do certain things really well, and your data should be able to leverage this. Performance and total cost of ownership are key aspects. The engine you pick fundamentally affects your teams and company's budgets. For example, ETL pipelines are the lion's share of your cloud data costs. As a vendor, we focus on EL workloads. We announced our own engine, Quant, yesterday that focuses on lowering TCO for these workloads.
Vinoth Chandar - 00:07:33
We generally distrust and frown upon benchmarks for valid reasons, since benchmarks have been misused by vendors in the past. But learning to benchmark for yourself on your own workloads and data is a very important skill because cloud data workloads cost a lot of money. Engine choice matters. Now that we've established strong reasons for thinking beyond a single engine or at least being mindful when you pick your first engine, you don't want to close any doors. You want to preserve modularity or optionality to bring other engines to your data. Hopefully that's clear. The fix is simple but challenging to implement and practice.
Vinoth Chandar - 00:08:25
The basic idea is you store data in lean, open data formats like Parquet or ORC on cloud storage, then try open source engines on top of your data. Understand your data, evaluate needs and gaps. See where it falls short, what access patterns you have, the shape of your data. Take time to understand your data using open source engines. You still haven't made any commercial commitments yet. Once you know these gaps, you can move to places where you need commercial solutions. They exist for good reasons.
Vinoth Chandar - 00:09:13
You can upgrade fully understanding what voices cannot solve for you. Migrating data is extremely hard, but switching engines is relatively easy once you adopt this model. This is a tried and true approach. I worked in two thirds of these companies shown here in this table. The common pattern is you put data in open file formats on highly scalable storage, and use different engines for different use cases: ETLs, warehousing, interactive analysis, and data science.
Vinoth Chandar - 00:10:09
Why is this not happening everywhere? Because it takes a village. These companies had deep engineering benches with many engineers to build this. Not every company has the time or is organized the same way. This table shows for a cluster size the amount of work needed to do something in production. You can spin something up in dev to play, but this is what it takes to go live. On top of this, being hard for users, this is not how the ecosystem works today. All engines want to be best for everything. Every engine claims to be good at everything, so there's less incentive to make multi-engine experience a primary thing, even though it benefits users. Many managed services and open source are sold on these gaps.
Vinoth Chandar - 00:11:01
On the right, you see open source core, then upgrades, managed catalogs, access control, optimizations. These are the things you lack to go to production easily on open source software, which are typically sold in managed services. Warehouses still default to closed formats. They incentivize the mode to start on an open format but default to closed. Open is optional. This increases the cost of switching. There should be a core set of services that remain interoperable. Your file, table format, table optimization, catalogs should be switchable at any time.
Vinoth Chandar - 00:11:47
If you can achieve this, you can implement this model today. But what if we could make this even easier, more accessible? What if we had an open switch moving from the engine first model of limited choices and proprietary things to an open data first model: open formats, sync to multiple catalogs, and query with the engine of your choice? This is at the core of the company I founded called Onehouse. We recently built One Open Engines to address this head-on.
Vinoth Chandar - 00:12:47
It gives you an easy ability to spin up purpose-built engines on the same copy of data for different use cases. You can bring data ingested by open source or managed tools, put it in open table formats, and it provides essential services to go to production. For example, it can sync with multiple catalogs, maintain clusters, auto-scaling. We made it easy if you don't have many engineering resources to get started on this open data first model. This should make your life easier.
Vinoth Chandar - 00:13:38
This flips the defaults to open. It's priced lower than self-managed voices. You use the exact same open source tools, get the same community support. Nothing changes except it makes it easy to connect your data to open source engines and get something basic to go to production with. It eliminates lock-in points around storage optimizations that exist. Many engines bundle storage optimizations with the engine, which creates lock-in. If you move to a second engine that doesn't optimize the same way, query performance suffers. We automate that, made it work with all catalogs. Permissions translate when you switch engines and catalogs. You can seamlessly upgrade to commercial engines.
Vinoth Chandar - 00:14:13
This is not to say don't use commercial engines. We flip the model to go methodically layer by layer. Here's how Onehouse can help. We build it together. We are open source contributors, build and contribute to many open source projects. You can use these open source tools and build it yourself if interested in this data first approach. Onehouse has managed services. Pairing ingestion service and open engines together can help you put data in front of open source and commercial engines at the same time. You can benchmark side by side, compare apples to apples, and make an educated decision on what engine to pick. We are the most open cloud data platform with broad interoperability across open source and commercial ecosystems. We can run many core workloads right away.
Vinoth Chandar - 00:15:01
Some final takeaways today: specialized deliberately, no single engine is good at everything. In today's world, we should match engine to specific workloads because data scale is high and we spend a lot of money. Open formats give you flexibility combined with open services and portability across catalogs. This creates a good architecture model where you avoid data migration projects. Data has gravity, choose wisely. Your editions are critical. Do regular assessments to keep your data architecture and stack up to date.
Vinoth Chandar - 00:16:24
Thank you all for being here for this talk. There are a lot of fun talks in the conference today. Be sure to tune into the panel on open data platforms. That's gonna be fun and will touch upon many of these aspects.
Demetrios - 00:16:24
Right on my man, that was great. We've got a lot of questions coming through in the chat, and I'm sure people will be asking away as time keeps rolling. Let me start with a question on many people's minds: yes, all of this will be recorded and we will be giving the slides to you. We'll have the replay going for the next 24 hours and will package these talks on the OpenX data website so you can watch at your leisure. Now to the real meat and bones questions for you. What is the ideal cost-effective analytics engine setup?
Vinoth Chandar - 00:17:21
Great question, and it's a little subjective. For analytics specifically, if you have small amounts of data, start with something like Postgres and outgrow it at about a TB scale. Then get your data in an open format on cloud storage. Connect engines like open source Presto, Trino, or Starburst, and evaluate performance. You'll find great price-performance. If queries are complex, start trying warehouses and more specialized systems around these. If open source engines are good enough but you don't want to manage them, there are plenty of managed services to help with operational aspects.
Demetrios - 00:18:34
I was just choked on my drink. That was a great answer. Let me compose myself here. Onehouse is open sourced and able to be self-hosted?
Vinoth Chandar - 00:18:50
Good question. Onehouse is a cloud SaaS service but self-hosted in the sense it's a BYOC serverless model we run in your VPC. It's almost like you are self-hosting it, but we handle all cluster management and operations. The managed service is not open source, but we are completely built on open source technologies from the ground up. On our website, you can find the open source stack we use. What Onehouse provides on top is making it workload aware, dynamically working for your specific workloads, adapting to lags in data pipelines, and so forth. We are built on open source software completely.
Demetrios - 00:19:51
Excellent. Are there any opportunities for integration between Onehouse and Dexter? I see Snowflake is a first class citizen but not Onehouse.
Vinoth Chandar - 00:20:04
We allow integration across the ecosystem. The great part is we build on open technologies. You can use Dexter today and submit jobs if you're doing DBT or if Dexter integrates with DBT and Spark. You can use the same integration against our Spark and SQL clusters we announced yesterday. We don't have to build a special integration; you can use Dexter on top of us like you can use Airflow. Based on that, I think it should work well.
Demetrios - 00:20:49
Last one for you: Is Onehouse both OLTP and OLAP engine?
Vinoth Chandar - 00:20:59
No, we are not an operational system. Onehouse is more for OLAP or analytics. While the world is more complex than OLAP for data, OLAP is BI predominantly. Onehouse unlocks use cases beyond analytics: data science, machine learning, stream processing, and more. But we are not in the OLTP operational database category.
Demetrios - 00:21:37
Holy smokes. I said one more question and then five more came flying in. How is Onehouse used in real companies? What makes it different?
Vinoth Chandar - 00:21:49
Great question. The common pattern for Onehouse usage is you have a warehouse or single vertical system, closed or open. People want to move towards a model where data is in open format horizontally. Analysts want to stick on warehouses, data scientists want Spark notebooks, and interactive query engines power operational dashboards. That is the sweet spot. Onehouse uniquely unlocks that architecture while keeping engine flexibility intact. For ELT workloads, we run all kinds of pipelines at better cost and price performance than other tools.
Demetrios - 00:22:49
Brian's asking an amazing question. Any proven strategies for convincing non-data-first companies with limited data engineers that an open data architecture is better than closed compute engines? It's difficult to ask non-data engineers to choose complex OSS data versus a one-stop data warehouse.
Vinoth Chandar - 00:23:20
Great question. If you are a tech lead, senior IC, or manager looking to do this, point to your management how the big five, the three clouds, Databricks, Snowflake, everyone talks about open table formats. Open lakehouses are mainstream; the industry agrees this is the way forward. There are new open source catalogs, table formats, and file formats emerging. Point to the vendor ecosystem moving towards open data first architecture. Also point to what data-forward companies have built. There are countless examples of companies benefiting from building this way.
Vinoth Chandar - 00:24:14
The gap is how to do it. That is why we built Open Engines. I was on these teams at Uber and LinkedIn with engineers to build it, but I see day in and day out as a data vendor that many customers face this gap preventing them from making the leap.
Demetrios - 00:25:03
If everyone could be as privileged as the Ubers of the world with all those resources to throw at data teams. Great answer.