What I am about to propose in this blog post is not new. In fact, many organizations have already invested several years and the work of expensive data engineering teams to slowly build some version of this architecture. I know this because I have been one of those engineers before, at Uber and LinkedIn. I have also worked with hundreds of organizations building this in open-source communities, moving towards similar goals.
Back in 2011 at LinkedIn, we had started off using a proprietary data warehouse. As data science/machine learning applications like “People you may know” were built, we steadily moved towards a data lake on Apache Avro, accessible by Apache Pig, with MapReduce as the source of truth for analytics, reporting, machine learning and data applications you see on the service today. Fast-forward a few years, we faced the same challenge at Uber, this time with transactional data and a genuinely real-time business, where weather or traffic can instantly influence pricing or ETAs. We built a transactional data lake as our entry point for all data over Parquet, Presto, Spark, Flink and Hive by building Apache Hudi, which would then deliver the world’s first data lakehouse, even before the term was coined.
The architectural challenges organizations face today are not about picking the one right format or compute engine. The dominant formats and engines can change over time, but this underlying data architecture has stood the test of time, by being simply universal across a variety of use cases, allowing users to pick the right choice for each. This blog post urges the reader to proactively consider this inevitable architecture as the foundation of your organization’s data strategy.
In my experience, an organization’s cloud data journey follows a familiar plot today. The medallion architecture offers a good way to conceptualize this as data is transformed for different use cases. The typical “modern data stack” is born by replicating operational data into a “bronze” layer on a cloud data warehouse, using point-to-point data integration tools. This data is then subsequently cleaned, audited for quality and prepared into a “silver” layer. Then a set of batch ETL jobs transform this silver data into facts, dimensions and other models to ultimately create a “gold” data layer, ready to power analytics and reporting.
Organizations are also exploring newer use cases such as machine learning, data science, and emerging AI/LLM applications. These use cases often require massive amounts of data, so teams will add new data sources like event streams (eg. clickstream events, GPS logs, etc.) at 10-100x the scale of their existing database replication.
Supporting high-throughput event data introduces the need for inexpensive cloud storage and the massive horizontal compute scalability of the data lake. However, while the data lake supports append-only workloads (no merges), it has little to no support for handling database replication. When it comes to high-throughput mutable data streams like NoSQL stores, document stores, or new age relational DBs, no current data infrastructure systems have adequate support.
Since each approach has strengths specific to certain workload types, organizations end up maintaining both a data warehouse and a data lake. In order to consolidate data between sources, they will periodically copy data between the data warehouse and data lake. The data warehouse with its fast queries serves business intelligence (BI) and reporting use cases, while the data lake, with its support for unstructured storage and low-cost compute, serves use cases for data engineering, data science, and machine learning.
Sustaining an architecture like that shown in Figure 2 is challenging, expensive, and error-prone. Periodic data copies between the lake and warehouse lead to stale and inconsistent data. Governance becomes a headache for everyone involved, as access control is split between systems, and data deletion (think GDPR) must be managed on multiple copies of the data. Not to mention, teams are on the hook for each of these various pipelines, and ownership can quickly become murky.
This introduces the following challenges for an organization:
During my time leading the data platform team at Uber, I felt the pain of this broken architecture firsthand. Large,slow batch jobs copying data between the lake and the warehouse delayed data to greater than 24-hour latency, which slowed our entire business. Ultimately, the architecture could not scale efficiently as the business grew; we needed a better solution that could process data incrementally.
In 2016, my team and I created Apache Hudi, which finally allowed us to combine the low-cost, high-throughput storage and compute of a data lake with the merge capabilities of a warehouse. The data lakehouse - or the transactional data lake, as we called it at the time - was born.
The data lakehouse adds a transactional layer to the data lake in cloud storage, giving it functionality similar to a data warehouse while maintaining the scalability and cost profile of a data lake. Powerful capabilities are now possible, such as support for mutable data with upserts and deletes using primary keys, ACID transactions, optimizations for fast reads through data clustering and small-file handling, table rollbacks, and more.
Most importantly, it finally makes it possible to store all your data in one central layer. The data lakehouse is capable of storing all data that previously lived in the warehouse and lake, eliminating the need to maintain multiple data copies. At Uber, this meant we could run fraud models without delay, enabling same-day payments to drivers. And we could track up-to-the-minute traffic and even weather patterns to update ETA predictions in real time.
However, achieving such powerful outcomes is not merely an exercise in picking table formats or writing jobs or SQL; it requires a well-balanced, well-thought-out data architectural pattern implemented with the future in mind. I call this architecture the “Universal Data Lakehouse”.
The universal data lakehouse architecture puts a data lakehouse at the center of your data infrastructure, giving you a fast, open, and easy to manage source of truth for business intelligence, data science, and more.
By adopting the universal data lakehouse architecture, organizations can overcome the previously insurmountable challenges of the disjoint architecture that continually copies data between the lake and the warehouse. Thousands of organizations already using both data lakes and data warehouses can reap these benefits by adopting this architecture:
The universal data lakehouse architecture uses a data lakehouse as the source-of-truth inside your organization’s cloud accounts, with data stored in open source formats. Additionally, the lakehouse can handle the scale of complex distributed databases which were previously too cumbersome for the data warehouse.
This universal layer of data provides a convenient entry point in the data flow to perform data quality checks, schematize semi-structured data and enforce any data contracts between data producers and consumers. Data quality issues can be contained and corrected within the bronze and silver layers, ensuring that downstream tables are always built on fresh, high-quality data. This streamlining of the data flow simplifies the architecture, reduces cost by moving workloads to cost-efficient compute and eliminates duplicate compliance efforts like data deletion.
Since both operational data from databases and high-scale event data are stored and processed across a single bronze and silver layer, ingestion and data prep can run just once on low-cost compute. We have seen impressive examples of multi-million dollar savings in Cloud Data Warehouse costs by moving ELT workloads to this architecture on a data lakehouse.
Keeping data in open formats enables all data optimizations and management costs to be amortized across all three layers, bringing dramatic cost savings to your data platform.
The universal data lakehouse improves performance in two ways. First, it’s designed for mutable data, rapidly absorbing updates from change data capture (CDC), streaming data, and other sources. Second, it opens the door to move workloads away from big bloated batch processing to an incremental model for speed and efficiency. Uber saved ~80% in overall compute cost by using Hudi for incremental ETL. They simultaneously improved performance, data quality, and observability.
Unlike a decade ago, today’s data needs don’t stop at traditional analytics and reporting. Data science, machine learning and streaming data are mainstream and ubiquitous across Fortune 500 companies and startups alike. Emerging data use-cases such as deep learning and LLMs are bringing a wide variety of new compute engines with superior performance/experience optimized for each workload independently. The conventional wisdom of picking one warehouse or lake engine upfront throws away all the advantages the cloud offers; the universal data lakehouse make it easy to spin up the right compute engine on demand for each use case.
The universal data lakehouse architecture makes data accessible across all major data warehouses and data lake query engines and integrates with any catalog – a major shift from the prior approach of coupling data storage with one compute engine. This architecture enables you to seamlessly build specialized downstream “gold” layers across BI & reporting, machine learning, data science, and countless more use cases, using the engines that are the best fit for each unique job. For example, Spark is great for data science workloads, while data warehouses are battle-tested for traditional analytics and reporting. Beyond technical differences, pricing and the move to open source play a crucial role in which compute engines an organization adopts.
For example, Walmart built their lakehouse on Apache Hudi, ensuring they could easily leverage new technologies in the future by storing data in an open source format. They used the universal data lakehouse architecture to empower data consumers to query the lakehouse with a wide range of technologies, including Hive and Spark, Presto and Trino, BigQuery, and Flink.
All the source-of-truth data are held in open-source formats in the bronze and silver layers within your organization’s cloud storage buckets.
Accessibility of data is dictated by you – not by an opaque third-party system with vendor lock-in. This architecture gives you the flexibility to run data services inside the organization’s cloud networks (rather than in vendors’ accounts), to tighten security and support highly regulated environments.
Additionally, you’ll be free to either manage data using open data services or to buy managed services, avoiding lock-in points on data services.
With data consumers operating on a single copy of the bronze and silver data within the lakehouse, access control becomes much easier to manage and enforce. The data lineage is clearly defined, and teams no longer need to manage separate permissions across multiple disjoint systems and copies of the data.
While the universal data lakehouse architecture is very promising, some key technology choices are crucial to realize its benefits in practice.
It’s imperative that ingested data is made available at the silver layer as fast as possible, since any delays will now impede multiple use cases. To achieve the best combination of data freshness and efficiency, organizations should choose a data lakehouse technology that is well-suited for streaming and incremental processing. This helps handle tough write patterns like random writes during ingest at the bronze layer, as well as leveraging change streams to incrementally update silver tables without reprocessing the bronze layer again and again.
While I may hold some bias, my team and I built Apache Hudi around these universal data lakehouse principles. Hudi is battle-tested and generally regarded as the best fit for these workloads, while also providing a rich layer of open data services to preserve optionality for build vs buy. Furthermore, Hudi unlocks the stream data processing model on top of a data lake to dramatically reduce runtimes and the cost of traditional batch ETL jobs. Down the road, I believe that the universal data lakehouse architecture can also be built on future technologies that offer similar or better support for these requirements.
Finally, Onetable (soon to be made available in open source) is another building block for the universal data lakehouse architecture. It brings interoperability across major lakehouse table formats (Apache Hudi, Apache Iceberg, and Delta Lake) with easy catalog integrations, allowing you to set your data free across compute engines and build downstream gold layers in different formats. These benefits are already being validated by Fortune 10 enterprises like Walmart.
In this blog, we introduced the universal data lakehouse as the new way cloud data infrastructure should be architected. In doing so, we simply gave a name to and outlined the data architecture that hundreds of organizations (including large enterprises like GE, TikTok, Amazon & Walmart, Disney, Twilio, Robinhood, Zoom across tech, retail, manufacturing, social networking, media and other industries) have built using data lakehouse technologies such as Apache Hudi. This approach is simpler, faster, and far less expensive than the hybrid architectures that many companies maintain today. It features true separation of storage and compute while enabling practical ways to employ best-of-breed compute engines across your data. In the coming years, we believe it will only grow more popular, driven by rising demands on data, including the growth of ML and AI, rising cloud costs, increasing complexity, and increasing demands on data teams. For more background on this topic, see my recorded talk at Data Council in Austin or our related blog posts.
While I truly believe in the “right engine for the right workload on the same data” principle, it’s non-trivial to make that choice in an objective and scientific manner today. This is due to a lack of standardized feature comparisons and benchmarks, lack of shared understanding of key workloads, and other factors. In future blog posts in this series, we will share how the Universal Data Lakehouse works across data transfer modalities - batch, CDC, and streaming - and how it works, in a “better together” fashion, with different compute engines such as Amazon Redshift, Snowflake, BigQuery, and Databricks.
No prizes for guessing that Onehouse offers a managed cloud service that provides a turnkey experience to build the universal data lakehouse architecture outlined in this blog. Users like Apna have already improved data freshness from several hours to minutes and significantly reduced costs by cutting out their data integration tool and replacing the warehouse with Onehouse for their bronze and silver data. With the universal data lakehouse architecture, their analysts could continue using the warehouse to serve queries on the data stored in the lakehouse.
And if you’d like to start implementing the universal data lakehouse in your organization today, contact us. You can also subscribe to our blog for more on the universal data lakehouse.
Be the first to read new posts