October 20, 2023

It’s Time for the Universal Data Lakehouse‍

It’s Time for the Universal Data Lakehouse‍

It’s Time for the Universal Data Lakehouse

What I am about to propose in this blog post is not new. In fact, many organizations have already invested several years and the work of expensive data engineering teams to slowly build some version of this architecture. I know this because I have been one of those engineers before, at Uber and LinkedIn. I have also worked with hundreds of organizations building this in open-source communities, moving towards similar goals.

Back in 2011 at LinkedIn, we had started off using a proprietary data warehouse. As data science/machine learning applications like “People you may know” were built, we steadily moved towards a data lake on Apache Avro, accessible by Apache Pig, with MapReduce as the source of truth for analytics, reporting, machine learning and data applications you see on the service today. Fast-forward a few years, we faced the same challenge at Uber, this time with transactional data and a genuinely real-time business, where weather or traffic can instantly influence pricing or ETAs. We built a transactional data lake as our entry point for all data over Parquet, Presto, Spark, Flink and Hive by building Apache Hudi, which would then deliver the world’s first data lakehouse, even before the term was coined.

The architectural challenges organizations face today are not about picking the one right format or compute engine. The dominant formats and engines can change over time, but this underlying data architecture has stood the test of time, by being simply universal across a variety of use cases, allowing users to pick the right choice for each. This blog post urges the reader to proactively consider this inevitable architecture as the foundation of your organization’s data strategy.

Today’s cloud data architecture is broken

In my experience, an organization’s cloud data journey follows a familiar plot today. The medallion architecture offers a good way to conceptualize this as data is transformed for different use cases. The typical “modern data stack” is born by replicating operational data into a “bronze” layer on a cloud data warehouse, using point-to-point data integration tools. This data is then subsequently cleaned, audited for quality and prepared into a “silver” layer. Then a set of batch ETL jobs transform this silver data into facts, dimensions and other models to ultimately create a “gold” data layer, ready to power analytics and reporting. 

Organizations are also exploring newer use cases such as machine learning, data science, and emerging AI/LLM applications. These use cases often require massive amounts of data, so teams will add new data sources like event streams (eg. clickstream events, GPS logs, etc.) at 10-100x the scale of their existing database replication. 

Supporting high-throughput event data introduces the need for inexpensive cloud storage and the massive horizontal compute scalability of the data lake. However, while the data lake supports append-only workloads (no merges), it has little to no support for handling database replication. When it comes to high-throughput mutable data streams like NoSQL stores, document stores, or new age relational DBs, no current data infrastructure systems have adequate support. 

Figure 1 : Data warehouses support low-throughput mutable workloads, while data lakes handle high-throughput append-only workloads

Since each approach has strengths specific to certain workload types, organizations end up maintaining both a data warehouse and a data lake. In order to consolidate data between sources, they will periodically copy data between the data warehouse and data lake. The data warehouse with its fast queries serves business intelligence (BI) and reporting use cases, while the data lake, with its support for unstructured storage and low-cost compute, serves use cases for data engineering, data science, and machine learning.

Figure 2: Typical hybrid architecture with a mix of a data warehouse and a data lake based on data sources

Sustaining an architecture like that shown in Figure 2 is challenging, expensive, and error-prone. Periodic data copies between the lake and warehouse lead to stale and inconsistent data. Governance becomes a headache for everyone involved, as access control is split between systems, and data deletion (think GDPR) must be managed on multiple copies of the data. Not to mention, teams are on the hook for each of these various pipelines, and ownership can quickly become murky.

This introduces the following challenges for an organization:

  1. Vendor lock-in: The source of truth for high-value operational data is often a proprietary data warehouse, which creates lock-in points. 
  2. Expensive ingestion and data prep: While data warehouses offer merge capabilities for mutable data, they perform poorly for fast, incremental data ingestion of upstream databases or streaming data. The warehouse's expensive premium compute engines that are optimized for gold layer compute, such as a SQL engine optimized for star schema, are employed even for the bronze (data ingestion) layer and the silver (data preparation) layer, where they don’t add value. This typically results in ballooning costs for bronze and silver layers as the organization scales. 
  3. Wasteful data duplication: As new use cases emerge, organizations duplicate their work, wasting storage and compute resources across redundant bronze and silver layers across use cases. For example, the same data is ingested/copied once for analytics and once for data science, wasting engineering and cloud resources. Considering that organizations also provision multiple environments like development, staging, and production, the compounded costs across the entire infrastructure can be staggering. Additionally, enforcement costs of compliance regulations such as GDPR, CCPA, and data optimizations are incurred multiple times across multiple copies of the same data flowing in through various entry points.
  4. Poor data quality: Individual teams often reinvent the foundational data infrastructure for ingesting, optimizing, and preparing data in a piecemeal fashion. These efforts have frustratingly slow ROI or fail altogether due to a lack of resources, putting data quality at risk across the organization, as data quality is only as strong as the weakest data pipeline. 

Rise of the Data Lakehouse

During my time leading the data platform team at Uber, I felt the pain of this broken architecture firsthand. Large,slow batch jobs copying data between the lake and the warehouse delayed data to greater than 24-hour latency, which slowed our entire business. Ultimately, the architecture could not scale efficiently as the business grew; we needed a better solution that could process data incrementally.

In 2016, my team and I created Apache Hudi, which finally allowed us to combine the low-cost, high-throughput storage and compute of a data lake with the merge capabilities of a warehouse. The data lakehouse - or the transactional data lake, as we called it at the time - was born.

Figure 3: Data lakehouses fill the gap left by warehouses and lakes by supporting high-throughput mutable data

The data lakehouse adds a transactional layer to the data lake in cloud storage, giving it functionality similar to a data warehouse while maintaining the scalability and cost profile of a data lake. Powerful capabilities are now possible, such as support for mutable data with upserts and deletes using primary keys, ACID transactions, optimizations for fast reads through data clustering and small-file handling, table rollbacks, and more.

Figure 4: The data lakehouse adds a transactions layer to the data lake

Most importantly, it finally makes it possible to store all your data in one central layer. The data lakehouse is capable of storing all data that previously lived in the warehouse and lake, eliminating the need to maintain multiple data copies. At Uber, this meant we could run fraud models without delay, enabling same-day payments to drivers. And we could track up-to-the-minute traffic and even weather patterns to update ETA predictions in real time.

However, achieving such powerful outcomes is not merely an exercise in picking table formats or writing jobs or SQL; it requires a well-balanced, well-thought-out data architectural pattern implemented with the future in mind. I call this architecture the “Universal Data Lakehouse”.

The Universal Data Lakehouse Architecture

The universal data lakehouse architecture puts a data lakehouse at the center of your data infrastructure, giving you a fast, open, and easy to manage source of truth for business intelligence, data science, and more. 

Figure 5: The universal data lakehouse architecture

By adopting the universal data lakehouse architecture, organizations can overcome the previously insurmountable challenges of the disjoint architecture that continually copies data between the lake and the warehouse. Thousands of organizations already using both data lakes and data warehouses can reap these benefits by adopting this architecture:

Unifying Data

The universal data lakehouse architecture uses a data lakehouse as the source-of-truth inside your organization’s cloud accounts, with data stored in open source formats. Additionally, the lakehouse can handle the scale of complex distributed databases which were previously too cumbersome for the data warehouse. 

Ensuring Data Quality

This universal layer of data provides a convenient entry point in the data flow to perform data quality checks, schematize semi-structured data and enforce any data contracts between data producers and consumers. Data quality issues can be contained and corrected within the bronze and silver layers, ensuring that downstream tables are always built on fresh, high-quality data. This streamlining of the data flow simplifies the architecture, reduces cost by moving workloads to cost-efficient compute and eliminates duplicate compliance efforts like data deletion.

Reducing Costs

Since both operational data from databases and high-scale event data are stored and processed across a single bronze and silver layer, ingestion and data prep can run just once on low-cost compute. We have seen impressive examples of multi-million dollar savings in Cloud Data Warehouse costs by moving ELT workloads to this architecture on a data lakehouse.

Keeping data in open formats enables all data optimizations and management costs to be amortized across all three layers, bringing dramatic cost savings to your data platform. 

Faster Performance

The universal data lakehouse improves performance in two ways. First, it’s designed for mutable data, rapidly absorbing updates from change data capture (CDC), streaming data, and other sources. Second, it opens the door to move workloads away from big bloated batch processing to an incremental model for speed and efficiency. Uber saved ~80% in overall compute cost by using Hudi for incremental ETL. They simultaneously improved performance, data quality, and observability.

Bringing Freedom to Choose Compute Engines

Unlike a decade ago, today’s data needs don’t stop at traditional analytics and reporting. Data science, machine learning and streaming data are mainstream and ubiquitous across Fortune 500 companies and startups alike. Emerging data use-cases such as deep learning and LLMs are bringing a wide variety of new compute engines with superior performance/experience optimized for each workload independently. The conventional wisdom of picking one warehouse or lake engine upfront throws away all the advantages the cloud offers; the universal data lakehouse make it easy to spin up the right compute engine on demand for each use case. 

The universal data lakehouse architecture makes data accessible across all major data warehouses and data lake query engines and integrates with any catalog – a major shift from the prior approach of coupling data storage with one compute engine. This architecture enables you to seamlessly build specialized downstream “gold” layers across BI & reporting, machine learning, data science, and countless more use cases, using the engines that are the best fit for each unique job. For example, Spark is great for data science workloads, while data warehouses are battle-tested for traditional analytics and reporting. Beyond technical differences, pricing and the move to open source play a crucial role in which compute engines an organization adopts.

For example, Walmart built their lakehouse on Apache Hudi, ensuring they could easily leverage new technologies in the future by storing data in an open source format. They used the universal data lakehouse architecture to empower data consumers to query the lakehouse with a wide range of technologies, including Hive and Spark, Presto and Trino, BigQuery, and Flink. 

Taking Back Ownership of Your Data

All the source-of-truth data are held in open-source formats in the bronze and silver layers within your organization’s cloud storage buckets. 

Accessibility of data is dictated by you – not by an opaque third-party system with vendor lock-in. This architecture gives you the flexibility to run data services inside the organization’s cloud networks (rather than in vendors’ accounts), to tighten security and support highly regulated environments.

Additionally, you’ll be free to either manage data using open data services or to buy managed services, avoiding lock-in points on data services.

Simplifying Access Control

With data consumers operating on a single copy of the bronze and silver data within the lakehouse, access control becomes much easier to manage and enforce. The data lineage is clearly defined, and teams no longer need to manage separate permissions across multiple disjoint systems and copies of the data.

Choosing the right technology for the job

While the universal data lakehouse architecture is very promising, some key technology choices are crucial to realize its benefits in practice.

It’s imperative that ingested data is made available at the silver layer as fast as possible, since any delays will now impede multiple use cases. To achieve the best combination of data freshness and efficiency, organizations should choose a data lakehouse technology that is well-suited for streaming and incremental processing. This helps handle tough write patterns like random writes during ingest at the bronze layer, as well as leveraging change streams to incrementally update silver tables without reprocessing the bronze layer again and again.

While I may hold some bias, my team and I built Apache Hudi around these universal data lakehouse principles. Hudi is battle-tested and generally regarded as the best fit for these workloads, while also providing a rich layer of open data services to preserve optionality for build vs buy. Furthermore, Hudi unlocks the stream data processing model on top of a data lake to dramatically reduce runtimes and the cost of traditional batch ETL jobs.  Down the road, I believe that the universal data lakehouse architecture can also be built on future technologies that offer similar or better support for these requirements. 

Finally, Onetable (soon to be made available in open source) is another building block for the universal data lakehouse architecture. It brings interoperability across major lakehouse table formats (Apache Hudi, Apache Iceberg, and Delta Lake) with easy catalog integrations, allowing you to set your data free across compute engines and build downstream gold layers in different formats. These benefits are already being validated by Fortune 10 enterprises like Walmart.

What’s next? 

In this blog, we introduced the universal data lakehouse as the new way cloud data infrastructure should be architected. In doing so, we simply gave a name to and outlined the data architecture that hundreds of organizations (including large enterprises like GE, TikTok, Amazon & Walmart, Disney, Twilio, Robinhood, Zoom across tech, retail, manufacturing, social networking, media and other industries) have built using data lakehouse technologies such as Apache Hudi. This approach is simpler, faster, and far less expensive than the hybrid architectures that many companies maintain today. It features true separation of storage and compute while enabling practical ways to employ best-of-breed compute engines across your data. In the coming years, we believe it will only grow more popular, driven by rising demands on data, including the growth of ML and AI, rising cloud costs, increasing complexity, and increasing demands on data teams. For more background on this topic, see my recorded talk at Data Council in Austin or our related blog posts.

While I truly believe in the “right engine for the right workload on the same data” principle, it’s non-trivial to make that choice in an objective and scientific manner today. This is due to a lack of standardized feature comparisons and benchmarks, lack of shared understanding of key workloads, and other factors. In future blog posts in this series, we will share how the Universal Data Lakehouse works across data transfer modalities - batch, CDC, and streaming - and how it works, in a “better together” fashion, with different compute engines such as Amazon Redshift, Snowflake, BigQuery, and Databricks.

No prizes for guessing that Onehouse offers a managed cloud service that provides a turnkey experience to build the universal data lakehouse architecture outlined in this blog. Users like Apna have already improved data freshness from several hours to minutes and significantly reduced costs by cutting out their data integration tool and replacing the warehouse with Onehouse for their bronze and silver data. With the universal data lakehouse architecture, their analysts could continue using the warehouse to serve queries on the data stored in the lakehouse. 

And if you’d like to start implementing the universal data lakehouse in your organization today, contact us. You can also subscribe to our blog for more on the universal data lakehouse.

Read More:

Subscribe to the Blog

Be the first to read new posts

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.