January 26, 2024

Apache Hudi™ 1.0 Preview: A Database Experience on the Data Lake

Written by:

Floyd Smith

Apache Hudi™ 1.0 Preview: A Database Experience on the Data Lake

‍This post is a summary of a well-attended talk from Open Source Data Summit 2023. The speakers, Bhavani Sudha Saktheeswaran and Sagar Sumit - well-known to the community as Sudha and Sagar - shared the story of Hudi, from its origins at Uber a decade ago to the ambitious new features planned for Hudi 1.0 and beyond. We invite you to view the Hudi 1.0 talk on the Open Source Data Summit site.

Apache Hudi is a top-level Apache open source project and community that reimagines what is possible on a data lake, bringing advanced data warehouse capabilities to lakes. From its inception at Uber in 2016 to the upcoming 1.0 release, Hudi has revolutionized what is possible with data lakes and become a cornerstone in modern data architecture.

Video 1. Watch this brief, four-minute video to learn about Hudi's background, success to date, and the goals for Hudi 1.0.

Hudi was started to meet the challenges of hypergrowth at Uber. The team wanted to implement data warehouse-type functionality on a data lake architecture, but constantly ran into performance and functionality issues. “We were dealing with not just scale and complexity… we also needed designs that would support transactional capabilities on the data lake”, Bhavani shared. Large datasets, like the set of all Uber trips taken, were especially challenging to work with, due to their sheer scale.

The Uber team found themselves bulk re-ingesting 120TB of data every 8 hours — even though only 500GB of data, well under 1% of the total, had actually changed. These re-computations were incredibly expensive and slow, resulting in an end-to-end data freshness of 24 hours. This was very limiting for a fast-growing, ambitious business like Uber.

So the Uber team designed what would become the Hudi project to more efficiently process data in the lake, dropping upsert time to 10 minutes and end-to-end data freshness to 1 hour. Hudi began to grow in popularity within Uber; eventually, the effort was made an open source project and donated to the Apache Software Foundation.

Figure 1. Hudi’s capabilities improved after it was put into open source and donated to the Apache Software Foundation.

Today, Hudi is widely used in industry by leading companies like Amazon, Robinhood, and Walmart, and comes pre-installed on five cloud providers, including AWS and GCP. The vibrant Hudi community has more than 400 contributors and over 3700 participants on Slack.

Hudi 0.14 was recently released, and Hudi 1.0 is currently in beta. Hudi 1.0 takes significant steps toward the aspiration of making Hudi the first transactional database for the data lake. Let’s take a look at the foundational components of Hudi that set the stage for future improvements.

Under the Hood

To understand how Hudi achieves its efficiency and flexibility, let's look at two core database components: tables and queries.

Hudi features two types of tables. The simpler type is Copy on Write, which creates a new versioned parquet file each time a write is performed. This ensures that the target parquet file is up to date, but leads to write amplification, with update activity much greater than the size of the changed data. This process can noticeably impact write performance, which may be a concern for write-heavy or ingest-heavy applications, such as in the Uber example above.

To solve these issues, Hudi has a second table type called Merge on Read, which uses change logs (in addition to the versioned parquet files) to manage writes and enable reads across previously merged and changelog files, periodically compacting the changelogs and previous files to produce the latest version. Together, the two approaches give operators a great deal of flexibility in managing updates for optimal performance.

Figure 2. Hudi’s two table types, Copy on Write and Merge on Read, differ in read/write latency and cost.

Hudi supports three types of queries:

Snapshot query: merge changes, then read everything
Read-optimized query: read the latest compacted data
Incremental query: read only the data that changed within an interval

Building on its robust foundation, future versions of Hudi will push the boundaries further, continuing to improve the capabilities and performance of data operations on data lakes.

Hudi 1.x

With Hudi 1.x, “the goal is to build the first transactional database for the lake,” Sagar shared. This is an ambitious goal, but one that the Hudi project has been trending toward for a decade now.

Video 2. Watch this brief video - under four minutes - for a deep dive on the breakthrough features in Hudi 1.0.

Upcoming versions will add more robust capabilities across many areas, including:

Deeper query engine integrations: Maintain Hudi connectors with Presto and Spark, while improving query planning and execution.
Generalized data model: Support a generalized relational model (i.e., don’t require a primary key), which is now easier because of advances in engines like Apache Spark™ and Flink.
Serverful & serverless architecture: Move to a hybrid architecture where Hudi employs server components for table metadata, while remaining serverless for data.
Unstructured data support: Support Hudi primitives, such as indexing, mutating, and capturing changes, for unstructured data blobs. This will help power ML/AI workloads around image and video processing.
Enhanced self-management: Provide diagnostics reports, cross-region logical replication, reverse streaming data, record-level TTL management, and other utility functions.

Hudi 1.x both broadens and deepens capabilities, providing more powerful transactional storage management features as well as improvements to query processing.

Figure 3. The left side shows a diagram from the seminal database paper that outlines the main components of a DBMS. The right side maps these components against the Hudi stack, showing existing components (green), new components (yellow), and external non-Hudi components (blue).

The advances in Hudi 1.x are set to further push the bounds of data lake technology, offering unprecedented efficiency, flexibility, and power in handling complex data workloads.

Beta highlights

Sagar shared some exciting new features in the Hudi 1.0 beta. For more information on all of these features, check out the beta release notes.

Introducing LSM trees (log-structured merge-trees) in the Hudi 1.0 beta represents a major step in optimizing timeline management, significantly improving the efficiency of write operations and non-concurrent data processing. Hudi maintains a transaction log, split into an active timeline (which has limited history to ensure fast access) and an archived timeline (which is expensive to access during reads/writes).

With LSM trees, Hudi now recursively compacts commits into a tree structure to create fewer, larger files that can be used to efficiently access different levels of the timeline by only loading the relevant portions of the timeline into memory. For a timeline with 1M commits — equivalent to a commit every 5 minutes for 10 years — Hudi can load the entire timeline (sans metadata) in only 367 ms.

The addition of functional indexes in the Hudi 1.0 beta opens up new possibilities for more efficient data querying, especially in scenarios requiring complex filtering and aggregation. The current Hudi indexing framework doesn’t support certain use cases, such as aggregating indexes or secondary indexes.

With Hudi’s new functional indexes, Hudi can now build indexes using functions or expressions. For example, you can index with the function `hour(timestamp)`, which will improve performance of queries that filter by hour (since they can skip entire partitions that don’t meet the filtering criteria). Hudi 1.0 also absorbs partitioning as part of the indexing system.

The Hudi 1.0 beta also features a new FileGroup reader, which substantially improves query performance. Merge On Read tables can already perform fast upserts using key-based merging of records. Now, Hudi provides first-class support for partial updates using position-based merging and page skipping. Sagar shared an example benchmark where the new reader was 5.7x faster for snapshot queries with 70x reduced write amplification.

The final beta feature highlighted is non-blocking concurrency control, enabling more simultaneous data operations. With long-running deletion jobs — such as jobs used for GDPR compliance — optimistic concurrency control may cause many transactions to be aborted and retried. Now, with non-blocking concurrency control, multiple writers can write simultaneously, allowing the reader and compactor to automatically resolve conflicts.

These beta highlights demonstrate that Hudi 1.0 is not just an incremental update, but a transformative step forward, as Hudi continues to redefine what’s possible on a data lake. You can view the Hudi 1.0 talk here.

As Hudi evolves, we invite you to join us on this journey — join the community on Slack, contribute on Github, or join us in Hudi community calls or events. To watch the full recording, click here.

Authors

No items found.