To Build or Buy: Key Considerations for a Production-grade Data Lakehouse Platform
The data lakehouse architecture has made big waves in recent years. But there are so many considerations. Which table formats should you start with? What file formats are the most performant? With which data catalogs and query engines do you need to integrate? To be honest, it can become a bit overwhelming.
But what data engineer doesn't like a good technical challenge? This is where it sometimes becomes a philosophical decision of build vs buy.
In this presentation, Onehouse VP of Product Kyle Weller will break down the pros and cons he has seen over nearly a decade of helping organizations implement their own data lakehouses and building the Universal Data Lakehouse at Onehouse. You'll learn about:
- The strengths of open table formats such as Apache Hudi™, Apache Iceberg™ and Delta Lake
- Interoperability via abstraction layers such as Apache XTable™ (incubating)
- Lakehouse optimizations for cost and performance via Apache Spark™-based runtimes
Transcript
AI-generated, accuracy is not 100% guaranteed.
Speaker 0 00:00:00
<silence>
Speaker 1 00:00:07
The next session we've got coming up with Kyle from Onehouse. Where you at? Kyle? There he is.
Speaker 2 00:00:14
TROs. Good to see you, man.
Speaker 1 00:00:16
How you doing, dude? I know you got some, I'm doing awesome. Some lines to share with us. So go on and share your screen, get that rocking and rolling. I'll bring it up onto the stage. I'll let you have your space, and then I'll be back to ask you a few questions.
Speaker 2 00:00:33
Okay.
Speaker 1 00:00:34
There it is. To build or by the eternal question?
Speaker 2 00:00:40
Yes. Awesome and awesome song, by the way, Demetrius, I thought the, um, interlude music, uh, getting set up for the conference and in between sessions was great, but, uh, man, you rocked it on that guitar. You need to play that, uh, between every session. So <laugh>
Speaker 1 00:00:54
Great. All the questions that come through in the chat here. I will put them into song form for you.
Speaker 2 00:01:00
Awesome. <laugh>. Oh, and song form. Good, good. That's the only way I want to hear 'em, so, great. Thanks. Thanks for having me on. Really excited to be here. We're gonna talk about an exciting topic and existential question like you talked about, to build or buy, to introduce myself just a little bit. My name is Kyle Weller. I'm the VP of product here at Onehouse. I've been with the company since we started. I've been building data lakes and data platforms and data products for about 12 years. Um, so I've been on both sides of the fence. I've been building, um, data platforms with open source tools, um, but also building, um, uh, tools for data engineers to help make their lives easier as well. So as we get started, this build versus buy, I, I kind of think of it a little bit like a, a Batman versus Superman, um, where it's gonna be polarizing.
Speaker 2 00:01:48
People are gonna have, um, a deep fan, uh, um, a fan base for, for both sides and, and be rooting for, um, one certain character. Um, and sometimes there isn't a full right, right or wrong answer to this question of, of build versus buy. But one thing that is very certain and very true is the undeniable trend and growth of open source. One chart that I found that is actually very interesting as well on the top left hand side is, um, how open source databases have crossed over commercially licensed databases for the first time in, in about 2021. And you see that, uh, uh, trend there, um, of the rise of, of open source, uh, databases. Um, this, this trend is prevalent across the industry and in all different categories that you look at. Open source, uh, technologies and open source tools, um, are the leaders, and they should be the defacto first choices for anyone considering what, uh, technologies to build inside of their, uh, data architecture and build their data stack.
Speaker 2 00:02:54
So here's a quick snapshot. This is a, a popular, um, place to get a lay of the land or the landscape for data as a whole. If you know Matt Turk or if you've seen what's called the mad data landscape, uh, look that up on Google. Here's just a snapshot of some of the components of open source that landscape map is, is so much bigger and and wider than what you see just on the screen here. But, um, I hope you get the sense that like the open source community is rocking it. Um, there are so many great tools, so many great open source projects that are created, um, each and every day. And, um, uh, open source is, is really thriving. Some, some great progress. Now, when you first look at adopting open source, there's a couple, um, challenges, um, or pains that you might think about and, and be prepared for, um, as you're adopting open source.
Speaker 2 00:03:46
So integration pains. So sometimes open source projects like you saw on that previous slide, they're solving a specific problem, but also you see the diversity of projects out there. They have innovated and been grown and, and created by communities because they solve different types of problems in their own niche way even better than the previous project. And so when you're architecting a full end-to-end data, uh, uh, stack, you might have to put together a collection of open source projects. So be prepared for how you, um, uh, build with these components together and, and solve, uh, integration and, uh, library and dependency management. Um, sometimes there's some, some integration pans inside there. Um, when it comes to ease of use, of course there sometimes is a learning curve for adopting these open source technologies. And especially in the world of open source, things move fast. And so there's new, new technologies, new projects that sometimes there's a, a learning curve to, to getting started and a learning curve of also keeping your, um, teams, uh, your engineering teams up to speed on, on what is the latest and greatest and upgrading versions and, and things like that.
Speaker 2 00:05:00
Uh, sometimes, um, you're building mission critical services and, um, sometimes you can have your engineering and, and team, uh, trained and, and ready to go to support anything that might come your way. Um, but sometimes, uh, when you think about that choice between build versus buy, sometimes you might want to have security in knowing that you have someone to call that there's a, a, a page that you can hit 24 7 and, and a page someone else that also has deep knowledge on these open source technologies as well. Um, a question or a, a misconception that I see, uh, uh, prevalent in the industry as well is, is thinking that open source is completely a hundred percent free. Like while the, while the software and the license and the actual code that you get from GitHub, that might be free to, uh, uh, download, but there's, there is are costs associated with, with, um, uh, building with open source.
Speaker 2 00:06:04
And I don't know if of folks, um, uh, are into Marvel movies, but do you know that Ironman, he, he built his suit that was a build, uh, process, DIY, but he chooses to buy his weapons, right? And so he is mixed and, and matched between open source building and buying. Um, so some of those costs that that can come and sneak up on your are one opportunity costs. What could your teams rather be doing is the, is what you're building the true differentiator for your company and something that you need to spend your energy and time on, um, that is precious as well. Um, as things start to scale and grow, um, maybe the initial setup and and build of the stack was easy, but sometimes that maintenance and operations, um, can, can get heavy. Um, and so really it comes down to when you think about this question of is open source really free?
Speaker 2 00:07:00
It gets down to how much do you value your time? Is your time free? Do you have a deep bench of engineers ready to go? Do you have, um, or is your time better well spent on, on other activities? Um, that's kind of the, the core piece there. So when people ask this, how do, how do you choose? Um, I would put it down these two lanes. When you are, um, looking to build first maybe you have deeply custom requirements or an abundance of talent or maybe your infrastructure costs significantly outweigh, um, what that engineering time value is. Um, those might be reasons why you'd consider building. Um, but choosing to buy is, uh, those, those recommendations would be if it's not your differentiator if you don't need to, uh, focus on this. But also if speed and time matters, speed when it comes to time to, to value and how fast you can build these architectures.
Speaker 2 00:07:59
And time also when it relates to maybe, um, data latency or data arrival times, um, and performance, uh, management. Um, but also you might want to consider buying if you just want to offload maintenance, um, or headaches around, um, continuing to operate and, and maintain and upgrade the versions, um, of your infrastructure and stack. But between these two decisions of building versus buying, I think the more key P points that you should be evaluating is to choose open source back technologies no matter what, no matter which you choose to do, whether it's build or buy, and ensure that your data stack is interoperable. And if you do this, then you have the opportunity to switch in between build and buy. You might choose to build first, and if you choose to build first and you're using open source tools and technologies across these, there's usually a, a, a vendor or a company that's behind these projects that can help back you up if that time comes.
Speaker 2 00:09:01
And if you need, and same goes in the other direction, if you choose to buy first, as long as you make sure to, uh, make sure your data is, is using OpenTable formats to make sure that you're using open source catalogs to make sure that you're using open source compute engines. If you have, uh, chosen a buy route and you ensure you're using open source technologies, you're also portable to kick that vendor out of the way and go back to a build mode, um, if it's so requires as well. So I think this is the key point to make sure that you have a composable and interoperable data stack across all the layers. And I'll hear people oftentimes, um, you know, that table formats are, are very popular, Apache Hudi, iceberg, Delta Lake, um, is all the rage right now these days. And, um, you know, people will say, great, I've got an open table format now I'm free.
Speaker 2 00:09:55
Um, everything's gonna work from here on out, right? But no, there's other components of the stack that you need to be evaluating to make sure they're open, composable, interoperable, and this is what Onehouse we are designed to help deliver to you. We start from that universal storage layer, having any of the OpenTable format supported. We build on top of our specialized compute runtime, which is a hundred percent spark compatible, and we offer the services that are listed up on top, whether it's managed ingestion, full CDC replication, event streaming, or ingesting straight from S3 or other data sources as well. Now you can run SQL and Spark jobs directly on Onehouse. We announced that just yesterday, and I'll show you another slide about that in just a second. Or our table optimizations, which can lead to, um, we've seen with our, our customers up to 30 x performance improvements on the queries that can run on top of their tables after Onehouse has done its job.
Speaker 2 00:10:49
We launched open engines just last month where now you can deploy Flink, Trino and Ray Clusters also, and we synchronize across all the catalogs in the industry, whether that's Unity catalog, snowflake Glue, or others. So we ensure to build a open interoperable and a composable data stack to make it possible for you to mix and match. And also, hey, if Onehouse is doing its job, great, if not, kick us out of the door as well and, and continue to use open source. That's what we're advocating for. This announcement that we had just yesterday, I wanted to emphasize here, we've seen some amazing progress for our SQL and Jobs platform that we just launched and, uh, it's powered by our new, what we call the quant, um, query execution engine under the hood. And we've seen some amazing price and performance, uh, benchmarks that you can read about in our blogs online as well.
Speaker 2 00:11:41
What our customers are saying. These are some anonymized quotes that you see down on the bottom. Um, folks are saying that we've got insanely good ingestion rates, um, from an AI security company or a global telecom put us head to head and they said, Onehouse delivers 10 x the performance of Databricks at half the cost. That's what they measured in their own usage, um, or a large financial institution. They said that Onehouse helped us use Athena and Databricks seamlessly on the same data. Um, so some really great interoperability as well. I want to show you just a little bit about the platform before we turn it back over to q and a and I'm really excited to see the questions that are here, but I'll give you a fast tour of Onehouse as well. Uh, this is the Onehouse console. Um, and what I, what I'm showing right here on the, uh, page right now is a collection of all the data that I have inside of Onehouse Right now I'm looking at a particular table.
Speaker 2 00:12:32
I can see metadata and details about that table, a preview and different types of statistics. Um, I have, um, what I call a stream capture and ingestion job that's reading and, and ingesting data into this table, but I also have table services like clustering cleaning and a catalog sink running on top of this table. I've got a lot more details in here that you can see in other demos that we've recorded, but let me flip over to show you when I get started, I might create a new cluster and I've got different types. I can use a managed cluster, which is what we do for ingestion, a SQL based cluster spark, and even open engine. So I can immediately launch a, a, a cluster that has Trino Ray or Flink. Um, once these clusters are online, I can monitor their usage and, and how much data is being processed within them, um, and see how these are auto-scaling, um, um, over time as well.
Speaker 2 00:13:24
Uh, according to the demands of the workloads inside there, um, I have jobs, um, that are executing in near real time. Let me find one here that I like to show folks. Um, that's ingesting all of the data available and publicly, um, open GitHub projects online. So I'm ingesting about 240,000 records, um, about every minute or so, um, into that table that I was showing you earlier. All of this is synchronizing to the catalogs of your choice, whether you, you need Unity catalog, snowflake, data hub Glue, you name it. Um, you register the catalog and we're automatically syncing across to multiple catalogs simultaneously, the single one single copy of data. So, um, hopefully that little speed demo there gets you a, a good tour and taste of of what Onehouse offers. Um, and I will just hit one more conclusion. So, um, Onehouse, if you wanted to get more details about, um, our new launch that we had just yesterday about our SQL and Spark jobs powered by the quant execution engine, we also alongside this launch, a custom spark analysis tool.
Speaker 2 00:14:33
And if you scan this QR code, you can download this tool. It's free, it's open source, and you can point it to your Spark History server, and we will be able to provide you a customized, um, breakdown of how much cost savings you could drive with Onehouse. So like we like to advocate for at, at Onehouse, never trust a benchmark, um, and just read and expect that it will always be the same and true for your individual workload. Use this tool and then you can get something that's really customized and specific to what you're running on Spark and how Onehouse could run it differently. Um, so with that, um, thanks for coming to the session. Thanks for coming to OpenX data, very excited for the other sessions that are coming along here and excited to see, uh, questions that we got in here. Demetris, if, um, you wanna sing them to me,
Speaker 1 00:15:19
<laugh>, you're gonna hold me to that one, huh? No, no, no. Okay. <laugh>, no. First question that's coming through, and I'm sure more will trickle in, is how do you stack up Onehouse compared to Databricks offering?
Speaker 2 00:15:37
Okay, great question. How we stack up next to, uh, Databricks? I would say that there's a couple ways to look at that. One is, um, you can look at Onehouse as a complimentary component to Databricks, and we have many customers that do this. That's, uh, some of the quotes that you saw that I shared, how Onehouse can take care of rapid data ingestion, extract data from CDC sources like, um, Postgres, MySQL, Mongo, or event streams like Kafka, and we offer a managed ingestion product, um, that is different than how you would write and author your own code and notebooks and then figure out how to run those. We do this in a, a simpler, lower code way that can ingest that data. And we have industry leading both price performance, but also ingestion latency, um, and scale, uh, for how you can run those workloads.
Speaker 2 00:16:21
And so that's one area where we can rapidly deliver data to Databricks and synchronize to the Unity catalog and do a great job there. Um, then, um, lastly, I would, uh, just repeat the, where I shared that benchmark that's in how we can run Spark Jobs. And so that's one place where you could look at us maybe in a, in a more head-to-head comparison way, where you could take the same spark job, you could run it on Databricks, you could run it on Onehouse, um, and see how it works for you, um, and, and see which, which product, uh, does the job better for you.
Speaker 1 00:16:54
Excellent. Now cost, speaking of that Beast, <laugh> cost is shown as a main positive attribute about going with Onehouse. Yeah. Does or is Onehouse open to protecting us from penetrating pricing? I think, do you understand that one or should I try and rephrase it from what I'm understanding with penetrating pricing,
Speaker 2 00:17:28
I think I understand what I'm interpreting it as you tell me if you see it different, but penetrating meaning like your costs, how they rise, like how do you protect us from like, like runaway costs? Okay, great. Great. That's what I was thinking a couple ways.
Speaker 1 00:17:39
We'll see what Charan says in the chat, but that's how I was interpreting
Speaker 2 00:17:43
It. Okay, great. A couple ways that we help to protect you from runaway clo costs. One, you can set, uh, budgets and maximum limits for all of your clusters across your projects. And, and we autoscale and we have, uh, advanced auto-scaling algorithms that will move the, uh, clusters up and down, but stay within that budget that you have specified. So we never exceed that. Um, secondly, I'll add in that, um, what customers have really come to love and find about Onehouse is how our pricing scales with volume and with your data size, because we actually break this from a, a lot of products out there have like either a linear scaling or something that's kind of exponential in their scaling as your data volumes grow. But Onehouse, we can get subline scaling, um, when it comes to your data volumes and workloads. Some of this comes from our tech, uh, technical differentiations that we have under the hood.
Speaker 2 00:18:39
Let me give you an example of our indexing subsystem, um, where we can, uh, apply, uh, primary indexes, secondary indexes, and we're able to then, as your data is being ingested, um, have very fast updates and mutations to that data because we have index keys to look up and have very fast efficient jobs that are running. Um, we've also specialized in created some, um, unique differentiation from open source, something that we call the, um, vectorized column or emerging techniques. Um, and so yeah, we've, we've got some amazing ways that that can help scale in a subline way when it comes to volume and, and cost.
Speaker 1 00:19:19
I'm struggling to understand the difference between this Snowflake and Fabric One Lake.
Speaker 2 00:19:28
Okay, great. Great question. Um, so I would say that all three of those are almost distinct and, and different. Let me break it down for you. So Snowflake is, um, a great data platform that, uh, most folks will look at also as a data warehouse. Um, and it is best in class for analytics, for concurrency management and, um, man Snowflake is is killer for, for bi scenarios and it's really great, um, at many other things as well. Um, fabric One Lake, um, and I know Josh Kaplan presented on this just earlier as well, um, is more of a storage layer that's ubiquitous and even could be cross Cloud. And he, he talked about some amazing like shortcut features and things inside that. So that's a little bit more like storage fabric as a whole. That platform does a lot more things as a full complete package as well.
Speaker 2 00:20:15
But Fabric One Lake specifically is that storage layer. Now Onehouse, we have a couple components. I would say that we start from the far left hand side of the stack and we have a unique focus and differentiation on data ingestion, um, data integration, writing data into lakehouse table formats. And then because we have, um, amazing expertise on these table formats, not just Hootie but across Delta Lake and Iceberg as well, we'll write into these table formats. We'll optimize and manage these very well and then we'll synchronize to all of your catalogs. So we're, um, and now you can run Spark Jobs and SQL jobs and these kind of things. So we're trino provide a data, um, ETL platform, um, that helps you start from even before the analytical domain and ingest data rapidly from transaction databases and, and things like that. Hopefully that that helps.
Speaker 1 00:21:06
You are the person to ask. 'cause I know you have a little history with, uh, with fabric and the love for it. So dude, there's a lot more questions in the chat, but I gotta keep it rolling 'cause I just realized the time that we're at, I would love if you jump in the chat and you answer some of those questions, that would be awesome. I'm sure that people will enjoy it too. For now, we will sign off and I will see everyone in the next session with Pushkar.