May 10, 2023

Top 3 Things You Can Do to Get Fast Upsert Performance in Apache Hudi™

Written by:

Nadine Farah

Top 3 Things You Can Do to Get Fast Upsert Performance in Apache Hudi™

The Apache Hudi community has been growing rapidly, with developers seeking ways to leverage its powerful capabilities for ingesting and managing large-scale datasets efficiently. Every week I receive common questions about a set of topics related to tips and tricks for maximizing their Hudi experience. One of the top questions I get is related to how Hudi can perform upserts, ensuring low-latency access to the latest data. That’s what we’ll cover in this blog!

Choose the right storage table type

One of the major considerations to factor in for fast upserts is choosing the right storage table type. Hudi supports two different storage table types - Copy-On-Write (COW) and Merge-On-Read (MOR). Each table type can impact upsert performance differently due to their distinct approaches to handling data updates.

COW table

COW tables are operational simpler compared to MOR tables because all updates are written to base files, which are in Apache Parquet format. You do not need to run a separate service like compaction to manage any log files to improve the read or storage efficiency. Updates are handled by entirely rewriting the file to generate a new version of the base file. Consequently, COW tables exhibit higher write amplification because there is a synchronous merge that occurs in order to create the new base file version. However, a key advantage of COW tables is their zero read amplification because all data is available in the base file, ready to read. The disk reads required for a query are minimal because they do not need to read multiple locations or merge data.

MOR table

In contrast to COW tables, MOR tables have more operational complexity. Rather than re-writing the entire file, MOR writes updates to separate log files, then these log files are merged with the base file into a new file version at a later time. If you’re familiar with Hudi, this is done with the compaction service. Compaction is needed to bound the growth of log files so query performance doesn’t deteriorate and storage is optimized.

Writing to log files directly avoids re-writing the entire base file multiple times lowering the write amplification - and if you are working with streaming data, this difference becomes apparent. This makes MOR tables to be write-optimized. However, MOR tables have a higher read amplification for snapshot queries, in between compactions, due to the need to read base files and log files and merge the data on the fly.

Considerations for COW and MOR tables

If you have a high update:insert ratio and are sensitive to ingestion latency, then MOR tables might be a good option for a table type. One example is with streaming sources– usually, you’ll want to act on insights faster to provide relevant and timely information to your users. However, if your workload is more insert-based and you can tolerate reasonable ingestion latencies, then COW tables are a good option.

Choose the right index type base on your record key

By leveraging indexes, Hudi avoids full-table scans when locating for records during upserts, which can be expensive in terms of time and resources. Hudi’s indexing layer maps record keys to their corresponding file locations. The indexing layer is pluggable and there are several index types to choose from. One thing to consider is that indexing latency is dependent on multiple factors like how much data is being ingested, how much data is in your table, whether you have partition or non-partitioned tables, type of index chosen, how update-heavy the workloads are and the record key’s temporal characteristics. Depending on the performance needed and uniqueness guarantees, Hudi provides different indexing strategies out-of-the-box that can be categorized to either global or non-global indexes.

Global vs. Non-Global indexes

Non-Global index: Hudi ensures a pair of partition path and record key is unique across the entire table. The index lookup performance is relatively proportional to the size of matching partitions among the incoming records being ingested. ‍
Global index: This index strategy enforces uniqueness of keys across all partitions of a table i.e., it guarantees that exactly one record exists in the table for a given record key. Global indexes offer stronger guarantees, but the update/delete cost grows with size of the table O(size of table).

One of the main considerations between global vs. non-global is related to index lookup latency due to the differences in uniqueness guarantees:

Non-global indexes only look up matched partitions: For example, if you have 100 partitions and the incoming batch has records for only for the last 2 partitions, only file groups belonging to those 2 partitions will be looked up. For upsert workloads at scale, you might want to consider non-global indexes, like non-global bloom, non-global-simple and bucket.

Global indexes look at all file groups in all partitions: For example, if you have 100 partitions and the incoming batch of records has records for the last 2 partitions, all the file groups in all 100 partitions will be looked up (since hudi has to guarantee there exists only one version of the record key across the entire table). This can cause increased latency for upsert workloads at scale.

Index types Hudi offers out-of-the-box

Bloom index: This is an indexing strategy to manage upserts and record lookups in file groups efficiently. This index leverages bloom filters, which is a probabilistic data structure that helps in determining if a given record key exists in a specific file group or not. This is available in global and non-global category.
Simple index: This is an indexing strategy that provides a straightforward approach to map record keys to their corresponding file groups. It performs a lean join of the incoming update/delete records against keys extracted from the table on storage. This is available in global and non-global category.
HBase index: This index strategy uses Hbase to store the index to map record keys and their corresponding file locations in file groups. This is available in global category.
Bucket index: This is an indexing strategy that uses hashing to route records to statically allocated file groups. This is available in non-global category. ‍
Bucket index with Consistent Hashing: This is an indexing strategy that is an advanced version of the bucket index. While the bucket index needs pre-allocation of file groups per partition, with the consistent hashing index, we can dynamically grow or shrink the file groups per partition based on the load. This is available in non-global category.

Index types to consider for update-heavy workloads

Bloom index: This is a good index strategy for update-heavy workloads if the record keys are ordered by some criteria (eg, timestamp-based) and updates are related to the recent set of data. For example, if the record keys are ordered based on timestamp and we’re updating data within the last few days.

Bloom index use case: Let’s say every 10 minutes a new batch of data is being ingested. We’re assuming the new batches contain data upserts within the last 3 days. Based on the Bloom index, Hudi identifies the records within the filegroup that are candidates for the updates and fetches the Bloom filter from the base file footer, and further trims down the records to be looked up in each file within the filegroup. If records are not found, they will be considered as inserts.‍

Simple index: This is a good index strategy for update-heavy workloads if you’re sporadically updating files across the entire span of the table and the records keys are random i.e., non-timestamp-based.

Simple index use case: If you have a dimension table where the record key is A trip ID (random UUID) and the partition is by city id. If we want to update 10000 trips spread across a range of cities, Hudi first identifies the relevant partition based on the incoming city Id. From there, Hudi efficiently locates the file containing the record by performing a lean join.

Bucket index: This is a good index strategy if the total amount of data stored per partition is similar across all partitions. The number of buckets (or file groups) per partition has to be defined upfront for a given table. Here’s an article Sivabalan wrote that talks more about it.

‍Bucket index example: In Hudi, after you define the number of buckets you want, Hudi applies a hash function to the record keys to distribute the records uniformly across the buckets. The hash function assigns each record ID to a bucket number. When an update occurs, Hudi applies the hash function to the record IDs and determines the corresponding bucket. Then, Hudi delegates the writes to the corresponding buckets (file groups).

Partition path granularity

Partitioning is a technique used to split a large dataset into smaller, manageable pieces based on certain attributes or columns in the dataset. This can greatly enhance query performance as only a subset of the data needs to be scanned during queries. However, the effectiveness of partitioning largely depends on the granularity of the partitions.

A common pitfall is setting partitions too granularly, such as dividing partitions by <city>/<day>/<hour>. Depending on your workload, there might not be enough data at the hourly granularity, resulting in many small files of only a few kilobytes. If you’re familiar with the small file problem, more small files cost more disks seeks and degrades query performance. Secondly, on the ingestion side, small files also impact the index lookup because it will take longer to prune irrelevant files. Depending on what index strategy you’re implementing, this may negatively affect write performance. I recommend users always start with a coarser partitioning scheme like <city>/<day> to avoid the pitfalls of small files. If you still feel the need to have granular partitions, I recommend re-evaluating your partitioning scheme based on query patterns and/or you can potentially take advantage of the clustering service to balance ingestion and query performance.

I hope this blog helps tour guide you into some specific parts of Hudi you can tune in order to get fast upserts. There is no specific formula to use- much of how you configure your applications and tune it will be specific to the workload types. If you have more specific questions about upsert performance related to your workload, you can find me in the Hudi community--especially in the community slack!

Authors

No items found.

Top 3 Things You Can Do to Get Fast Upsert Performance in Apache Hudi™

Choose the right storage table type

COW table

MOR table

Considerations for COW and MOR tables

Choose the right index type base on your record key

Global vs. Non-Global indexes

Index types Hudi offers out-of-the-box

Index types to consider for update-heavy workloads

Partition path granularity

Read More:

Announcing Apache Spark™ and SQL on the Onehouse Compute Runtime with Quanton

Measuring ETL Price-Performance On Cloud Data Platforms

Towards Open Data - Part 1: Cloud Warehouses Now Love Open Formats

Announcing Open Engines™: Flipping defaults to “open” for both data and compute

ClickHouse vs StarRocks vs Presto vs Trino vs Apache Spark™ — Comparing Analytics Engines

Ray vs Dask vs Apache Spark™ — Comparing Data Science & Machine Learning Engines

Top 3 Things You Can Do to Get Fast Upsert Performance in Apache Hudi™

Choose the right storage table type

COW table

MOR table

Considerations for COW and MOR tables

Choose the right index type base on your record key

Global vs. Non-Global indexes

Index types Hudi offers out-of-the-box

Index types to consider for update-heavy workloads

Partition path granularity

Read More:

Announcing Apache Spark™ and SQL on the Onehouse Compute Runtime with Quanton

Measuring ETL Price-Performance On Cloud Data Platforms

Towards Open Data - Part 1: Cloud Warehouses Now Love Open Formats

Announcing Open Engines™: Flipping defaults to “open” for both data and compute

ClickHouse vs StarRocks vs Presto vs Trino vs Apache Spark™ — Comparing Analytics Engines

Ray vs Dask vs Apache Spark™ — Comparing Data Science & Machine Learning Engines

Subscribe to the Blog