Adopting a 'horses for courses' approach to building your data platform

calendar icon
May 21, 2025
Speaker
Vinoth Chandar ‍
‍CEO, Onehouse

Today's data platforms too often start with an engine-first mindset, picking a compute engine and force-fitting data strategies around it. This approach seems like the right short-term decision, but given the gravity data possesses, it ends up locking organizations into rigid architectures, inflating costs, and ultimately slowing innovation. Instead, we must flip the model: by putting open, interoperable data at the heart of the data platform, and selecting specialized engines as needed, for e.g., Apache Flink for Stream processing and Ray for Machine Learning. A 'horses for courses' approach acknowledges that no single engine is best for every workload, and embraces a modular, future-ready architecture from the ground up.

This talk will make the case for a radical but proven idea: treat your data as a first-class citizen, and treat compute engines as interchangeable tools. We'll explore real-world examples where decoupled data strategies have allowed companies like LinkedIn, Uber and Netflix to evolve quickly across generations of technologies, and discuss practical strategies to avoid the endless migration treadmill. We will illustrate this using real-world comparisons of compute engines across key workloads, such as analytics, data science, machine learning, and stream processing.

Transcript

AI-generated, accuracy is not 100% guaranteed.

Speaker 0    00:00:00    
<silence>

Speaker 1    00:00:07    
We are live vino. I'm excited for your talk. Every time that I chat with you, I really enjoy everything that you have to say. I feel like you bring a different dimension and perspective. You've got the first keynote. Gonna give it up strong. I'll let you take over. I'm gonna bring your slides up onto the stage, and, uh, I'll be back in 20 minutes.  

Speaker 2    00:00:34    
Thank, thanks for that, Dimitris, and thanks for having me. Um, hello everyone. Welcome to the first of many OpenX data conferences. We have a exciting day in store for you. As <inaudible> went over that. I'm Hun, founder CEO e at Onehouse. I lead some major open source data projects, and I build large scale data infrastructure at Uber and LinkedIn. Before this, today, we are gonna kick off, uh, with a deceptively simple, but a very fundamental topic on open data, how we can use the right tools to build, uh, your data platform.  

Speaker 2    00:01:08    
Let's start by reviewing the state of affairs for cloud data. There is a lot of time and effort spent in a very large market, right? Just the top, just two popular cloud data platforms have over a thousand companies spending more than a million dollars on, on their cloud data platforms. And companies, you know, predominantly start their data journeys on a single engine, typically a warehouse. And, but many, many of them, uh, now one third, uh, are doing other use cases than bi, like data science ml and all of that. And, uh, your first engine kind of, you know, starts as your foundation. But in many of these cases, there's a feeling that it becomes a ceiling for what you can achieve with your data. And as the, you know, the AI projects are exploring, uh, lock-in remains a top three concern across the board. And, you know, issues like privacy, data sovereignty, and flexibility to bring these rapidly evolving new set of tools for AI to your data remains a top concern.  

Speaker 2    00:02:07    
There's a general feeling that cloud data is expensive. Uh, it lacks flexibility and respond to lock ends. And I think this is due to, uh, something I call the engine first trap, right? So the first simple thing, and, and it's like, uh, it's, uh, you know, that many companies do, is they focus on a single engine too much, right? Without considering their data and the use cases that they're going to have just the next year or two years out. And data has siemen gravity, right? So one, you start building it a data warehouse and, uh, you know, start storing data into it. Uh, but then there this data gravity pulls queries to it, access to it. And you know, before you know it, uh, there is, you know, you, you, you kind of have a lot of data in a single system that you, and you're struggling to bring new use cases, better engines, even if they exist in the market, right? And at that point, it's pretty hard to consider a migration because again, the data has in no shared.  

Speaker 2    00:03:07    
So then there's this typical pattern, right? Click double clicking one level down. So you pick the engine, essentially did, did whatever the vendor recommended on formats, tools, and whatnot. But the minute you try to bring a new engine, you hit a lot of these issues because it's, the data format is either incompatible or it's inefficient when you access to the new engine. And it's, it's expensive to migrate data over, right? So you're, I know of companies who spend millions and sometimes tens and millions of dollars, single compute platforms plan a migration project every year and make like 20% progress, right? So this is like a very, uh, big decision that shouldn't be, uh, made very lightly upfront. But if you look at where the world is right now, it's actually in a different place, right? And it's moving rapidly towards a different direction. So there's this thing called, uh, Kim Brain explosion.  

Speaker 2    00:03:59    
So, which is a period in New York's history where a lot of new species and life farms emerged at a very rapid pace. So we are kind of living in something like that around, uh, when it comes to cloud data ecosystem, right? So there is a lot of different specialized engines often, you know, they can all data read data in open formats, and they, they are replacing this like one size fits all model that we had for decades, right? And, uh, as a whole, the database industry, uh, open source software is outpacing closed source databases, uh, significantly officially crossed over a couple of years ago. So the actual br explosion in nature like that was caused by, you know, kind of change in, uh, oxygen levels, right? So likewise, there are, this is not an accident, right? Uh, there are seemingly small but very profound factors at play here that are helping, you know, and furthering this kind of, uh, shift, uh, let's examine the push.  

Speaker 2    00:04:57    
So first, cloud is now defacto data storage, right? So in the cloud, everything is on demand, on demand storage, on demand compute, you pay for these two separately. So you can have a world where you are paying for your ingestion, ETL vendors separately, and your query engine separately, right? And as a technology, data lakes and data viruses are converging. And this is one area where I spent a lot of my last seven, eight years on, which is, uh, you know, to blur these lines between these two main storage models and cloud storage on the whole is getting faster and faster, which unlocks new possibilities for us. And the rise of open table formats like data warehouses, uh, are finally opening up to open formats like Apache Iceberg, other table formats. This actually, you know, makes it, uh, eliminates these islands that you build and start with.  

Speaker 2    00:05:47    
So you no longer have to build an island for bi. And then think about the broader ecosystem you can start open, right? And we ask this question at most conferences that we go to, right? Which is, Hey, what do you used, right? So from open source to commercial engines, and, uh, you can see, right? The, the, there are plenty of companies who are using, like, you know, there are a lot of engines in use in this market, and even within the, a single, uh, engine, right? For example, spark is a very popular engine. Even that engine is bought in a bunch of different ways, right? People like to do it themselves. People like to use something like, you know, EMR provided by a cloud provider, uh, or, you know, other, uh, vendors around the thing. So, and no engine is best on every workload. And, and I think, uh, you know, we, we've done a lot of research around this compared engines across these different data use cases, analytics, data science, machine learning, and stream processing.  

Speaker 2    00:06:44    
You can check out these QR codes and, you know, go, go check out your blogs. We've gone very deep and compared, for example, uh, engine design, uh, vectorized processing, push versus pull based processing for analytics engines and for ML and data science, how well do they support GPUs, Python support, things like that, right? So there's a reason for all these engines to exist. They do certain things really well, and your data should be, you know, kind of be able, you should be able to leverage this on top of your data, right? And the other thing is performance. And, you know, TCO, total cost of ownership, this is another key aspect, right? So this, the engine that you pick fundamentally affects your teams and company's budgets as simple as that. So for example, ETL pipelines are the line share of your cloud data costs. Uh, and as a vendor, we focus on EL workloads, right?  

Speaker 2    00:07:33    
And we announced, uh, our own, our engine quant yesterday that focus on lowering the TCO for these workloads. Um, the, a general comment is we generally distrust frown upon benchmarks, the valid reasons, you know, since benchmarks are being used in certain way, way vendors in the past. But learning to benchmark for yourself on your own workloads, on your own data, um, you know, is very important skill, uh, because cloud data workloads cost this amount of money. So again, engine choice matters. So now that we've established that there are strong reasons for thinking beyond a single engine, or at least be very mindful when you pick your first engine, that you don't want to close any doors, right? You want to preserve the modularity or the optionality, uh, to, to, to bring other engines to your data. Um, hopefully that's pretty clear. And the fix is actually very simple.  

Speaker 2    00:08:25    
It's not gonna blow your mind. Uh, it's just that it's super challenging to implement and practice as we'll see in later slides. But the basic idea is you store, you know, data in one lean, open data formats, right? Like a wide, any kind of closed data format storage, put it on like cloud storage, and then try open source engines on top of your data, right? So, you know, understand your data, uh, evaluate needs and gaps. See where, you know it falls short. What kinda access patterns do you have? What's the shape of your data? Take the time to actually understand your data, right? Using open source engines. And you still haven't made your, you know, any in your commercial editions yet, right? And once you know these gaps, you can then move to, you know, you know, places where you need commercial solutions, right? They exist for good reasons.  

Speaker 2    00:09:13    
You can kinda upgrade to them fully understanding what voices cannot solve for you. The, the simple principle here is migrating data is extremely hard, right? As we saw before, but comparatively switching engines is like relatively easy, right? When once you adopt this kind of model, and this is a tried and true approach, I worked in two third of these companies that are, you know, shown here in this stable. And the common pattern here is you put data on like open formats, open file formats, stable formats on howly, scalable storage, and you use different engines for different use cases from like, you know, ETLs or warehousing, interac analysis and data science. Um, but why is this not happening everywhere, right? Because it does take a village to do this, right? So we all these companies had a deep engineering branch, uh, bench with a lot of engineers, uh, to actually go build this.  

Speaker 2    00:10:09    
But not every company has the time, or not every company's organized in the same way, right? So even this table here shows, uh, you know, for a cluster size, the amount of work that you need to do to do something in production, right? You can obviously spin up something on your dev environment to play with it, but this is what it takes for you to go live, right? And on top of this, being hard for the users, this is not how the, how the ecosystem works today, right? So the all engines want to be best for everything. So every engine claims I'm good at everything, right? So that less intense why in incentivize to make this multi-engine experience like a primary, uh, thing, right? Even though that benefits the users a lot, and a lot of these managed services and open source are sold on these like very gaps, right?  

Speaker 2    00:11:01    
And for example, you on the right, you can see, right? I mean, you have open source core, and then you sell upgrades, some managed catalogs, access control optimizations. These are the very things, the very things you lack to go to production easily on. Open source software is what you're also, you know, typically sold in managed services. And if you look at things like warehouses, they still default to close formats, right? So there are incentivizing the mood to, okay, start on an open format, right? They default to close, open is an op, uh, you know, like a optional thing that you can use if you want. So this increases the cost of switching, right? There should be a core set of services that should remain interoperable. You should make sure your file table format table optimization catalogs can be switched at any point in time, right?  

Speaker 2    00:11:47    
If you can achieve this other than this, we can totally implement this model, uh, today. But what if we could make this even more easier, even more accessible? What if we had an open switch, right? Moving from the engine first model of, you know, kind of like limited choices, proprietary things to op data. First model, open formats, sync to multiple catalogs and query with the engine of your choice. So this is at the core of this company that I founded called onehouse, and we recently built, uh, this thing called one Open Engines to address this head-on. So what is this? So essentially it, it gives you very easy ability to spin up, uh, you know, purpose build engines on top of the same copy of data for different use cases. So you can bring, uh, you know, data ingested, uh, by open source or managed tools, put it in open table formats, and essentially, you know, uh, it, it provides all the essential services for you to go to production.  

Speaker 2    00:12:47    
For example, it can sync with multiple catalogs. Uh, it does basic things like, you know, we maintain the clusters, the auto-scaling, um, right? So we just made it easy. If you are not one of these companies who have a lot of eng engineering resources to get started on this open, um, you know, data first model, um, this is something that should make your life easier. And this is how we think it flips the defaults to open. So first, consciously, it's priced lower than self-managed voices, right? And use it the exact same, same open source tools. Uh, you get the same exact news community support. So nothing's really changing. All it's doing is making it easy for you to connect your data to these open source engines and get something basic for you to go to production with. And it eliminates a couple of lock-in points around storage optimizations, right?  

Speaker 2    00:13:38    
That exist. And, uh, you know, because a lot of these, uh, engines are also bundling storage optimizations with the engine, which is great, but it also creates a lock in point, right? If you move to the second engine that's not optimizing it in the same way query performance offers. So we are automating that, made it work with all catalogs. So your permissions will translate when you switch engines, switch catalogs, and you can seamlessly upgrade to commercial engines, right? This is not to say that, oh, don't use any commercial engines. We basically flipping the model to like go methodically layer by layer.  

Speaker 2    00:14:13    
And here is how generally, you know, onehouse can help. First of all, we don't like, you know, we just build it together, right? We are open source contributors. We build a bunch of open source projects, we contribute to a bunch of open source projects, so we can be in these communities. You can just use these open source tools and build it yourself. If you are interested in this like data first approach, you know, wcan with has managed services like I just showed, just pairing a ingestion service and open engines together can, you know, help you in some data, put it in front of open source engines, uncommercial engines at the same time. So you can benchmark side by side, you can compare apples to apples and make a very educated decision for what engine you wanna pick. And obviously, you know, we are the most open cloud data platform.  

Speaker 2    00:15:01    
We have broad interoperability across both open source commercial ecosystems, uh, and we can run a lot of your core workloads, uh, right away, right? So with that, uh, some final takeaways, um, today, um, is specialized deliberately, no single engine is good at everything, right? And, and I think, uh, in, in today's, uh, world, we should be able to match engine to specific workloads because the data scale is high. We're spending a lot of money on these workloads. Number two, open formats give you a lot of flexibility. Um, you know, combined with open services and, uh, portability across catalogs and like things like that, uh, it, it creates a really good architecture model where you can not be doing this data migration projects and, you know, data has gravity, choose wisely. Year editions are critical and do regular assessments, uh, to make sure that you are, your data architecture is actually, uh, your stack is up to date, right? Um, all right, with that, uh, I thank you all for, uh, you know, being here, uh, for this talk. Uh, there's a lot of fun talks, um, in the conference data today. And, uh, also be sure to tune into the, the panel on open data platforms data. That's gonna be fun. And we are gonna touch upon a lot of these aspects in the panel as well,  

Speaker 1    00:16:24    
Right on my, man, that was great. Okay, so we've got a lot of questions coming through in the chat, and I am sure that people are going to be asking away as time keeps rolling. Let me start with the question that is on a lot of people's minds, which is, yes, all of this will be recorded and we will be giving the slides to you. So we'll have the replay going for the next 24 hours, and we will packaging, we will be packaging up these individual talks and then throwing them on to the OpenX data website so that you can watch it at your leisure now to the real meat and bones and the questions for you. You know, what is the ideal cost effective analytics engine setup?  

Speaker 2    00:17:21    
That's a great question, and, and it's, uh, something that is also a little bit subjective, right? So I'll, I'll say my, um, my piece here. So I think for analytics specifically, um, honestly, if you have small amounts of data, start with something like Postgres, right? And outgrow something like that, uh, at like a TB scale or something, I think you'll start to see, so, and for then get your data, like I was talking about in an open, um, you know, like a open format on top of cloud storage, connect things like open source, press two, three or star, all these different engines in our analytics guide evaluate the performance. I think you'll find that you get great performance on, on cost, price performance on top of that. Then if you're not, if you call queries are complex enough, then, you know, start trying, uh, warehouses and more specialized systems, uh, around, around these, right? And somewhere in between, if you like, oh, the open source, uh, engines are good enough, but I don't wanna manage them. There are plenty of managed services in between, right? That can actually, uh, help you with, um, the, the, the operational aspects of it.  

Speaker 1    00:18:34    
I was just choked on my <laugh> drink <laugh>. That was a great answer. <laugh>, let me compose myself here. All right. Uh, so Onehouse is open sourced and able to be self-hosted.  

Speaker 2    00:18:50    
Good question. So onehouse is a cloud SaaS service. Um, and yet it's, but it's, it's self-hosted in a sense that it, it's, it's BYOC serverless model we run in your VPC. So it's almost like you are self-hosting it, except we are doing, you know, all of the management, cluster management, all of those. So we are your, the operations is handled by us, right? So that's, that's the model. We are, the managed service is not open source, but we have, we are completely built on open source technologies ground up. So if you go to our, uh, like a website, you can find the open source stack that we use. Uh, what open source Onehouse provides on top of these open source projects is actually making it very workload aware. They making all these like dynamically work for your specific workloads, right? For example, uh, it can adapt to lags in your data pipelines, stuff like that, right? Uh, so, but we, we are built on open source software completely.  

Speaker 1    00:19:51    
Excellent. Now, are there any opportunities for integration between Onehouse and Dexter? I see that Snowflake is a first class citizen, but not onehouse.  

Speaker 2    00:20:04    
Well, we are allowed to integrate, uh, across, uh, the, the ecosystem, right? And, and, uh, so here is the, um, uh, the, the great part about this, right? And, and goes back to my open stack, uh, open, we have built on open technologies answer, you can use DaXtra today and submits, like say if you're doing DBD or if, if Dexter integrates with DB, D and Spark, right? Mm-hmm <affirmative>. So you can use the same exact integration against our, uh, spark and SQL cl, you know, clusters that we announced yesterday. So we don't have to build a special, we could build a special integration, but we don't really require it, right? You can use Dexter, uh, probably on top of us, like we can use airflow. So I'm basically, uh, based on that, I think that should work as well.  

Speaker 1    00:20:49    
All right, last one for you is Onehouse, both OLTP and OLA Engine?  

Speaker 2    00:20:59    
No, we, we are not an operational system. So we, um, onehouse is more for O-O-L-A-P or you know, analytics. But, uh, while I say that, I, you know, I'm passing because the world's way more complex than YLAP for data, right? So O OAP is bi, right? Predominantly that's analytics is what it means. But onehouse, what it unlocks is it unlocks use cases beyond that, it can go beyond analytics and unlock data science, machine learning use cases, stream processing use cases, and much more, right? But we are not on the, the OLTP, uh, operational database category.  

Speaker 1    00:21:37    
Holy smokes. I said one more question, and then there was like five more that came flying in. So how is ONEHOUSE used in real companies? What makes it different?  

Speaker 2    00:21:49    
What makes it different? Great. So the common pattern that we've seen, uh, for onehouse usage is you have a warehouse, or like a single vertical system, right? Closed our open. And people actually wanna move towards this model of, okay, I want my data to be in open format horizontally, and I want to use, for example, still wanna use my warehouse for my data. Uh, analysts want to stick on warehouses. My data scientists want spark notebooks, and I want to use in interactive query engines in between to power some operational dashboards, right? So that is the sweet spot. And, and onehouse uniquely unlocks that architecture while making sure these engine flexibility, uh, remains intact. And also when it comes to these, uh, E-L-T-E-T-L workloads, we run all kinds of E-L-T-E-T-L workloads pipelines. We, we can do that, uh, you know, at a way better, uh, cost, price performance than other tools in the market.  

Speaker 1    00:22:49  
All right. Brian's asking an amazing question here. I might need to give away some swag for this question. Any proven strategies for convincing non-data? First companies with limited data engineers that an open data architecture is better than closed compute engines? Difficult to ask non-data engineers that a complex OSS data versus a one stop data warehouse?  

Speaker 2    00:23:20
Yeah, great, great question. I assume if, if you are, for example, a tech lead, um, or, or like a senior IC or like a, you know, a manager, uh, looking to do this. So here's, here's what you do. Uh, you point to your management how, you know, look at, look at the big five, the three clouds, databricks, snowflake. There are like, you know, everybody's talking about open table formats. So open lake houses, this is the mainstream, the industries agreed on this as the way forward. And there are lots of work, there are new open source catalogs, right? There are new table formats and file formats being born. So, so if you point at all the vendor, uh, you know, ecosystem where it's going, it's clearly going towards an open, uh, you know, open data first architecture, right? The, the other thing that you point to is look at what all the data forward companies have built.

Speaker 2    00:24:14
Like, that was the, the table that I showed, uh, you can find it in our Open Engines blog, right? That there are countless examples. There's just include three things, but there are like so many examples of companies who kind of benefited from building that way. At that point. What you'll find is the gap is merely that, okay, this is the right thing to do. Everybody's aligned, and lots of companies with deep engineering pockets have done this. The gap is how do we do it, right? So that is basically why we built open engines, because we see this gap. I was on these teams at Uber and LinkedIn and we had engineers to go build it, but I see this like day in and day out in my life here as like a, you know, like a leading a, like a data vendor as a vendor, data vendor that, uh, a lot of our customers that we work with, that that gap is what prevents them from making the leap.

Speaker 1    00:25:03
Yeah. If everyone could be as privileged as the Ubers of the world that have all of those resources to throw at the data teams, right? So yeah, great answer.