February 5, 2024

Enabling Walmart's Data Lakehouse With Apache Hudi

Enabling Walmart's Data Lakehouse With Apache Hudi

One of the most intriguing sessions at Open Source Data Summit was the presentation by Ankur Ranjan, Data Engineer III, alongside Ayush Bijawat, Senior Data Engineer, about their use of Apache Hudi at leading retailer Walmart. You can view the full presentation or check out the summary that follows.

In the talk, Ankur and Ayush shared their motivations and learnings from the strategic shift from a data lake to a data lakehouse architecture at Walmart, with a focus on the importance of the Apache Hudi lakehouse format in making this change.

Some of their key takeaways were the challenges that prompted the use of a data lakehouse and the benefits of adopting a universal data lakehouse architecture. These benefits, which combine the best of the data lake and data warehouse architectures, include faster row-level operations, strong schema enforcement and versioning, better transaction support, effective handling of duplicates, and more.

The evolution of data storage

Ankur kicked off the talk with a history of data storage methodologies, including the motivations, strengths, and weaknesses of each. Initially, he explained, data warehouses were the go-to solution for structured data, efficiently connecting with business intelligence (BI) tools to generate insights. However, their high operational costs, and the complexity of maintaining them, marked a need for innovation.

Enter: the era of data lakes. Propelled by the evolution of Apache Hadoop and proliferation of cloud storage around 2012–2013, data lakes gained traction for their ability to handle not just structured but also voluminous semi-structured and unstructured data. Data lakes became a staple in large organizations for their scalability and versatility. Despite their advantages, data lakes posed notable challenges in maintaining data integrity and in preventing data from turning into a chaotic “data swamp.” The solution to the data swamp? According to Ankur, it needed to be a best-of-both-worlds approach — a data lakehouse. He explained, “...the data warehouse was great for management features, [and the] data lake was scalable and agile … we are combining [their benefits] and creating the data lakehouse.”

Figure 1. A visual overview of the evolution from data warehouse to data lake to data lakehouse.

Understanding Apache Hudi

With this natural evolution, the next step of Ankur and Ayush’s journey was picking the right data lakehouse architecture for Walmart. While there are three open table formats in mainstream use (Apache Hudi, Apache Iceberg, and Delta Lake), Walmart chose to go with Apache Hudi for two key reasons:

  1. It is the best at enabling both streaming and batch processing, which are critical to their use cases
  2. It has the best support for open source software (OSS) formats for streaming use cases

At the core of Apache Hudi, explained Ankur, is its innovative structure, which combines data files (stored in Parquet format) with metadata in a unique way to enable a slew of advantages. This design enables efficient data management and supports important features, such as record keys and precombined keys. 

To explain precisely how Hudi works, Ankur first walked through the core concepts and terminology:

  • The Record key: The same as in any relational database management system (RDBMS), the primary or component key.
  • The Precombined key: The field used for upsert sorting.
  • Indexes: Mappings between record key and file group or file IDs. These help scan data as quickly as possible.
  • Timeline: The event sequence of all actions performed on the table at different instants. This helps with creating time series data views or explorations.
  • Data files: The actual data file in Parquet format.
Figure 2. How Apache Hudi is organized under the hood.

To help build some intuition around the system, Ankur described how it could work using a hypothetical database of students. In his example, the student ID acts as the primary key, the created column is the partition path, and an “update timestamp” on the record serves as the precombine key. 

With this setup, if an incoming upsert (i.e., the operation to update a record, or insert it if the record does not yet exist) from a source to a target for a student record comes in, a few things will happen: First, Hudi will check if the incoming data has greater value of that particular precombine key, which is the “update timestamp” field in our example. Then it will simply upsert the data, ensuring we upsert the latest data into the target without needing to look at all other records, thanks to the handy precombine field we can check against, significantly speeding up the operation.

Hudi also supports two types of tables – 'Copy on Write' (CoW) and 'Merge on Read' (MoR). Copy on write is optimal for read-heavy environments, because it applies most operations during the data writing phase. In contrast, merge on read is suitable for write-heavy scenarios.

Enabling Apache Hudi in organizations

Given the working intuition of Apache Hudi that Ankur provided, Ayush dove into the actual enablement of Apache Hudi in organizations, addressing a question he gets a lot: “How easy is it to enable Hudi in my data lake architecture?” 

Fairly easy, it turns out. And that’s because of how Hudi interacts with downstream storage and upstream compute or query engines, Ayush explained. Since all data lakes are using some file system (S3 on AWS, etc.), with some file formats (Parquet, CSV, etc.) storing data on top of them, Hudi fits into the layer between the raw data formats and the compute engine. “[Hudi’s] compatibility with the compute engines, whether it's Spark, BigQuery, or Flink, is phenomenal, and we can simply continue to use our existing file system,” Ayush said.

Figure 3. Where Apache Hudi sits in an organization's data architecture.

The advantages of Hudi at Walmart

In summary, Hudi led directly to a broad swath of benefits that Ayush, Ankur, and the team saw directly in their implementation at Walmart:

  • Significantly better support for row-level upsert and merge operations
  • Firm schema enforcement, evolution, and versioning (i.e., at the level that one would expect using an RDBMS)
  • Much better transaction (ACID) support than alternatives
  • Historical data and versioning, enabling data “time travel” with no additional overhead
  • Support for partial updates, removing the need for a separate NoSQL system to support the partial update use case
  • Built-in support for hard and soft deletes, removing an entire category of potential implementation errors
  • Support for more efficient indexes and clustering
  • Efficient duplicate handling using a combination of primary and deduplication keys

To provide better intuition for the improved upsert and merge operations they saw, Ayush explained how a librarian might organize physical library files under the data lake and the data lakehouse paradigms. In this comparison, our “librarian” is functionally our compute engine, which is doing the computational heavy-lifting in these scenarios.

In the data lake paradigm, a new batch of papers to be filed amongst many loosely organized papers comes in. Then, the librarian must check every previous set of papers, combine them, and then insert the new papers. This is because the existing papers weren’t particularly organized, so our librarian needs to check every single one to get them organized relative to one another.

Figure 4. The data lake classic approach: read in all data, merge, and overwrite.

In the new data lakehouse paradigm, however, things can happen much more efficiently. This is because now our loose papers are a well-organized shelf of books. When a new batch of books comes in to be filed away, our librarian can interact with only the spaces on the bookshelves, due to the enhanced organization.

Figure 5. The data lakehouse modern approach: read in only required data, and modify only required data.

In actual implementation, there are some additional advantages to the lakehouse approach: reduced developer overhead and reduced data bifurcation. Reducing developer overhead is important across organizations to minimize potential error vectors and cost. One major load taken off the developers in the lakehouse paradigm is the read and compute time (step 2 in Figure 4), since in the data lake it is all on the developers’ shoulders to implement and manage. Additionally, data deletion in the lake paradigm, where data is not clearly organized, can be a huge error vector, where incorrect deletes across partitions and joins can easily lead to incorrect or out-of-date data.

The lakehouse reduces data bifurcation due to its partial update support (step 2 in Figure 5). Before, teams would often use a separate NoSQL database, such as MongoDB, to support this important use case. Hudi allows developers to instead keep this data in the filesystem as a single source of truth, while still enabling partial updates. This saves money and also keeps data clean and up-to-date by reducing duplication.

Wrapping Up

Through illustrative, layperson-friendly examples that helped develop clear intuition for the Apache Hudi data lakehouse, and the clear benefits that it brought to bear on Walmart’s data organization, Ayush and Ankur gave a thorough explanation of how the system works and the huge benefits it can confer onto data organizations. To see all the insights they had to offer, check out their full talk from the conference.

Read More:

Subscribe to the Blog

Be the first to read new posts

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.