September 12, 2023

Lakehouse or Warehouse? Part 2 of 2

Written by:

Floyd Smith

This blog post includes the second half of an important talk by Onehouse CEO Vinoth Chandar delivered at Data Council Austin in March 2022. Part 1, the first half, includes the introduction and architectural differences between the data lakehouse and the data warehouse, and a live link to the recorded talk.

This post, which is Part 2, compares the capabilities and price/performance characteristics of the architectures. It concludes by describing a future-proof, lakehouse-first architecture.

Comparing Data Warehouses and Data Lakehouses: Capabilities

It's impossible to look at every technical capability, but hopefully these slides will give you a picture and establish a pattern of the theme that you can observe.

For core reading and writing: at this point, both lakehouses and warehouses can support updates, deletes, transactions, statistics, sorting, bulk loads, audit, rollbacks, time travel, these kinds of basic functionality - you'll find these in all the stacks. You'll find that warehouses have more complete relational or database functionality like multi-table transactions, clustering, and these kinds of things.

On the lakehouse, most of this tech is delivered as libraries, as you know. And your mileage really varies with the libraries that you use. In Hudi, for example, we supported keys from the get-go, because the unique key constraint was a pretty hard requirement for us when we started out back at Uber.

“Lakehouse design choices reflect more the higher-scale data needs….”

Then you have space-filling curves. You'll find some are proprietary, some are open, and then there are some specific design choices around streaming updates and all these things. In general, you'll see that the lakehouse design choices reflect more the higher-scale data needs at this point, while warehouses try to stay close to the database origins that they started with.

The other set of things that you need is - it's not just writing a bunch of data and then walking away. You need some services that can, at the core, optimize your tables, ensure really good performance. And I've listed some of them here.

You can see a pattern here, like clustering or sorting for good performance. Space reclamation. As data becomes dirty, how is it managed? And there's a huge difference in terms of how automatically managed they are, or whether you need to take some of these libraries on the lakehouse and then run them yourself.

Something like file sizing is completely... Warehouses hide files, but they're still exposed on the lakehouse, and if you don't manage them well, they can affect performance. Some frameworks have automatic enforcement; some are DIY.

You'll find a mix of management gaps - even in Hudi, we have these capabilities, but you still need to self-manage it. Hudi can do clustering, but you need to run a clustering job yourself and operate it. While if you just go with a managed service, these are managed internally for you. Talking about caching - it's pretty transparent, and depending on the vendor, you'll either get caching or not on the lakehouse.

There are other things that you need to build a warehouse and declare it open for your company. Things like database change data capture (CDC), ingesting log events, and things like that. A lot of people who use cloud - the modern data stack or cloud data warehousing - have managed, zero-ops ingestion services that they prefer to use to get data in and build tables. But on the lake, people hand-roll their own Debezium or some other thing, tie it to a library like Hudi to build an end-to-end system, and then operate it.

So the big takeaway here is: if you want to build a lakehouse today and operationalize it, it does take some time, and you have to build it brick by brick by looking at all these different problems and problems spaces and putting solutions together. This gives you great flexibility, but it does take some time-to-market for you to operationalize this.

Comparing Data Warehouses and Data Lakehouses: Price/Performance

This is a very interesting topic: price/performance. First, let's try to think about the different workloads you have. Often I find that when people talk about price/performance, you just see TPC-DS wars and some numbers being thrown around.

In reality, you have at least these four different kinds of workloads, and not all of them need a premium engine; for instance, SQL optimizations aren't necessarily going to help you with ingestion. It's a very I/O-bound thing where you can afford to not run a premium engine. So we'll get into that a little bit.

Different workloads have different price/performance characteristics. That's what I'm trying to emphasize here. And locking yourself to one engine... On the lake, where you still have to do a lot of work to manage your tables, you have the option of picking the right engine for each use case. So for some, maybe you pick the open source version, some we go with the vendor offering. So the lake is good in the sense that you get that optionality, but with the warehouse, it's more or less priced in; all of these things are priced in for you.

This is about the famous TPC-DS wars that you always read about. We are going to do something different. We're going to try to understand the anatomy of a query.

What really happens? How does each component in a lakehouse or a warehouse react to each of these segments relating to performance, and where does the performance difference come from?

One is fetching metadata. You look at table metadata, schema, statistics, and whatever's needed for you to understand how to plan the query.

The lakehouse is stored today either in a metastore like Hive, or in flat files which store statistics about the table. And then warehouses predominantly use a database to store that. CS101 will tell you that databases will outperform flat files for a wide range of things when the data size is big. So the same laws apply. And I think it's still early days and we have to evolve the statistics management in the lakehouse beyond this.

If you go to query planning, then it pretty much depends on the engine and the vendor. This is where most of the engine open source versus vendor proprietary margins and all the competitive IP is built. Just going by the problem space, query optimization is a very hard problem, which gets better with the amount of time you spend in it. Just due to that, I would imagine there'll be much more mature optimizers out there; this is an evolving landscape.

Once you plan a query, when you execute it, if you look at a lakehouse, we employ open file formats, and on the warehouse, there are closed file formats. I don't think a lot of the performance differences come from the file format themselves, but more like how the runtime - if you do a group-by query, let's say you have to shuffle data around, how does the engine do it? And even within the lake engines, there is huge variance.

Something like Spark would be more resilient to shuffle failures, retries; but something like Presto wants to keep the interactive query performance and the lower latency. So probably it'll fail and it'll make different choices when it shuffles data and answers queries.

So the high-level answer is, it depends. TPC-DS is great, but I routinely found that, when people run their own workloads, they find that the gap isn't that big or it isn't that small. Always test with your own key table, take the top five queries or similar, and then try to test that, is what I would say.

This table gives you a good sense. So over the base EC2 cost, you have different engines pricing differently. This benefits you when for cost, price/performance, let's say your engine charges you 2X more but can finish that query 3X faster than the net, you gain in cost.

But that’s not the case everywhere. And underneath, you have to consider: does everything need to be fast? Because caches are pretty fast as well, for example, but they cost money. If you can afford to be slower in some cases, it can be cheaper as well.

“Benchmark your own workloads and understand which workload needs to go where.”

It's more nuanced; fastest isn't the only thing. Look at data warehouse pricing as well: some price by query, some price on different virtual instance types, which is hard to map to cloud. It's a little bit of a complicated thing out there. But the main thing to understand is, benchmark your own workloads and understand which workload needs to go where.

The main thing I want to stress is the case of the missing benchmark. Surprisingly, what I've noticed is ETL workloads, a lot of times, don't get that much attention - even though a lot of people spend a bunch of time on them. Because the problem of computing derived data sets out of all the data that your company's collecting every day is very different from trying to answer a more selective query where you're trying to figure out, "Okay, give me the sales for this region," or a dashboarding problem like that one.

And actually, there's a benchmark, TPC-DI, for data integration, but I rarely see anybody reporting anything around it. At least from our experience, what we've seen is if you look at the patterns on the right, they depict a fact table, an event table, and some dimension table being updated. You can see that the update patterns are really very different. We don't believe all these workloads are the same. So we need better standardization. And this is an issue that I regularly see people hitting, but there is not enough out there in terms of standardized benchmarks to compare all of these things on.

And finally, these data optimizations really do matter. We have very little here as well in terms of, "Okay, how does file sizing affect query performance? Do you take more hit on the write versus the read?"

As we've seen, these are studies of how clustering can improve performance. One on Hudi, one from AWS, one from our own blogs and open source. Yeah, you can carefully control data layout, but how do you do it across hundreds of tables without really having side effects? How do you globally optimize for cost and by data layout control? These are pretty open problems and a lot of people have to hand-tune and spend a lot of time today to squeeze value out of these techniques.

"...with the data lakehouse, things get cheaper, relatively speaking, as you scale."

I want to summarize and leave you with some food for thought on the data warehouse vs. the data lakehouse:

Openness. The data warehouse is typically closed; the data lakehouse is open (but conditions apply).
Use cases. The data warehouse was designed for BI; the data lakehouse has better data science, analytics, and ML/AI support.
Operations. The data warehouse tends to be fully managed; for the data lakehouse, setup is DIY and it’s difficult to manage. (Note: the Onehouse service and other managed services address this problem.)
Expense. With the data warehouse, the expense comes as you scale; with the data lakehouse, things get cheaper, relatively speaking, as you scale.

Toward a Lakehouse-First, Future-Proof Architecture

A question that I've been asked a lot is: when do people adopt a lakehouse?

Commonly, you start like this: you simply replicate your operational data; then you have some dashboards; and most people are happy there. Then, with even modest growth in data size, you start getting into doing some ETLs on the warehouse.

Eventually, you either hit cost constraints, or the company's big enough where you care about openness, open formats, all of the stuff that we talked about. And you form a data science team and they'll tell you, "Oh, let's go instrument the app with lots of events." I've seen this a lot from my time at Confluent and also with Kafka experience over the years.

You'll see that the event data lands on the lakehouse and then you start writing some ETLs which need to be fed some data from the warehouse. You also copy any ETLs that you offloaded from the warehouse, you start copying the data back as well. So most people are in this world where there isn't a single source for lineage for your data.

We feel like instead of polarizing the world into two camps, if we take a step back and think about from a user/organization standpoint, what would we need? I would distill it down to a few pillars, which I've seen companies eventually come to realize and adopt.

Interoperability

Lakehouses have to be interoperable with everything, even with warehouses. Why not? It’s just another engine that can query. So why shouldn't you be able to do that? Then you can keep the source data in the lakehouse.

Across all engines, you should be able to access that data in a crosscutting manner. For a given engine, you will often have a choice between open source; a managed service from a vendor; and/or a cloud provider offering.

There should be easy uptake for new tech. Today, vendors shouldn't be controlling a lot of the interoperability as much as they are, because these formats and these low-level pieces become anchor points to build some sort of lock-in.

Keeping Data on the Lake

By “on the lake,” I mean in open data formats. We've seen vendors rise and fall, but the data sits there.

In the last five years, I've seen that replicating your database changes or all your raw operational data directly onto the lakehouse in an open format sets you up really well for the future. Because these are usually the most voluminous data sets.

"...replicating your database changes or all your raw operational data directly onto the lakehouse in an open format sets you up really well for the future."

Even on the warehouse, what usually affects your cost is the raw ETL in the first layer of landing it into the warehouse and then running a dbt to get your tables. That's the ETL cost that I've seen.

So using the lakehouse sets you up really well for that. And you can always recompute anything. You can switch query engines, you can recompute, you get this really great flexibility in this architecture.

And most importantly, and I think this is very understated: this now feeds a single source of truth, meaning the same data feeds into both BI and your data science teams. Which means they can just worry about the data; they don't have a lot of these data quality surprises. They can just focus on how their analysis is wrong or the query is wrong rather than about whether we were fed different inputs.

Manage on the Lake

As much as possible, we can manage a lot on the lakehouse because it doesn't need premium compute. A lot of these management services are more IO-bound or dependent on external interactions. All the SQL optimizations don't really count here.

But we need a lot more here in terms of management. And again, we live in a world of specialized engines, so we really need to plan for that. And external tables on the warehouses today, I don't know if they're on par with their native formats. So that needs to improve.

We see three best practices to make the most of the interoperable lakehouse:

Keep big batch jobs in the lakehouse. Run scans over raw cloud storage.
Handle small ETL jobs where it’s convenient. These can run on the lakehouse or the warehouse.
More toward incremental updates. Move more toward an incremental model.

All of this gives you choices on price/performance, allowing you to have cost-effective ETL - dramatically reducing your costs, with no loss of performance where it matters.

When everything is done, it looks like this. You go directly to a lakehouse and it's open, it has all the interoperability, and you can write ETLs there in a cost-effective way. It's already ML/data engineering ready. If you can get the data in quickly, even your ETLs and the ETL data, out into the warehouse or directly query them, you're going to get really good performance for BI and reporting.

You can easily explore approaches like ML or streaming. A lot of these new emerging areas can be well integrated because again, it's an open format, and the ecosystem can evolve on its own, not beholden to vendors. And of course this is proven out to support data science and engineering. This is what I've seen a lot of companies choose.

Return to Part 1 or watch the video.

If you’re interested in learning more about Onehouse, visit https://www.onehouse.ai/product.

Authors

No items found.