December 27, 2022

Comparing Apache Hudi's™ MOR and COW Tables: Use Cases from Uber and Shopee

Written by:

Andy Walner

Comparing Apache Hudi's™ MOR and COW Tables: Use Cases from Uber and Shopee

Overview

In recent years, many data lake technologies have revolutionized big data analytics by leveraging the Copy-On-Write (COW) storage/table format. Apache Hudi pioneered Merge-On-Read (MOR) for cases where COW may not be optimal. In this post, we will compare both table types and give suggestions for when to use each of these table types.

‍Background: Why do COW and MOR exist?

Before Hudi was created at Uber in 2015, Uber relied on Apache Spark™ jobs to periodically write datasets in Apache HDFS and to absorb upstream table inserts, updates, and deletes when a rider’s trip status changed. As Uber grew, the process of writing data became too inefficient and slow, so Hudi was born to provide faster, computationally efficient analytics at exabyte scale.

*Figure 1: Visualization of Uber’s data lake before and after Hudi. Before Hudi, Uber re-wrote the entire table with each update; with Hudi, updates only re-write the changed file.*

‍

Copy-On-Write (COW) was the first storage table type available upon Hudi’s creation. Compared to the old architecture with Apache Spark, Uber saw over 100 times more efficient writes.

*Figure 2: Writing to and reading from COW tables.*

‍

As the big data analytics community started embracing data lake technologies, their requirements expanded from purely batch processing to include stream processing which works best with minute-level latency. COW rewrites the entire file for even a single modified record. In the case of streaming data with COW, more table updates meant more file versions and increased file count. This led to inefficient writes and less fresh data.

Merge-On-Read (MOR) was the second storage table type created for Hudi to reduce the write amplification in COW tables with heavy updates. Rather than re-writing the entire file, MOR writes updates to separate changelog files, then these changelogs are merged into new file versions at a later time configured by the user. Grouping these smaller changelog files together avoids re-writing the entire file multiple times.

*Figure 3: Writing to and reading from MOR tables.*

Comparing COW & MOR‍

COW and MOR are two Hudi table types that each solve different needs for your company:

COW

⬆️ Great for fast query performance/reads
⬇️ Less efficient than MOR for table updates

Uber uses COW to store append-only data (never updated) like event logs which track user interactions in the Uber app (eg. when a user taps a button). Query performance is faster to serve data.

MOR

⬆️ Great for frequent table updates
⬇️ Slower query performance/reads than COW before compaction, but same speed after compaction

MOR is optimized for frequent table updates (ie. changing existing records). This storage type avoids unnecessary rewrites to data files, which reduces cost and enables low-latency writes. Shopee, a leading eCommerce brand, uses MOR tables to store analytics data from their site because the data is updated frequently, like when shoppers add/remove items from their cart.

Here’s a breakdown of the tradeoffs of each Hudi table type:

‍*Figure 4: Tradeoffs between COW and MOR tables*

* we’ll get into more details of compaction in a future blog

** COW cost may be higher than MOR for update-heavy workloads

COW in action: Uber uses Hudi COW tables for append-only writes

💡COW is best for read-heavy tables where you need performant reads of the latest data.

As the largest ride-sharing company in the world, Uber ingests more than 500 billion records per day into their data lake at the scale of hundreds of petabytes. Uber tracks many events throughout the lifecycle of one ride, including when the rider opens the app, calls the ride, reaches their destination, and rates their ride.

‍

Uber uses Hudi COW tables for append-only data like event streams which track user interactions in the Uber app. This data is append-only because Uber tracks a historical log of user interactions in the app, which would not change retroactively. Append-only data has no expensive table updates. In Uber’s case, this storage type met the data and query latency requirements. For example, Uber’s analysts were able to have performant dashboards.

‍

Learn more in this blog post from the Uber Engineering team.

MOR in action: Shopee uses Hudi MOR tables for update-heavy workloads

💡 MOR is best for update-heavy tables where you want faster and efficient writes.

Shopee is the largest online shopping platform in Southeast Asia, serving 343 million monthly visitors. Shopee’s product centers around a mobile app which brings in users across Asia, Europe, and Latin America. With millions of users regularly interacting with their site, Shopee has processed hundreds of petabytes of data.

‍

Shopee built their real-time data platform on Hudi to leverage Hudi’s powerful capabilities for streaming real-time data. The integration was straightforward for Shopee’s data team since Hudi natively supports Shopee’s existing computing engines like Flink and Spark; storage protocols like S3 and HDFS; query engines like Presto.

‍

Now, let’s explore an example from Shopee’s site:

‍

*Figure 6: Shopping experience on Shopee’s site.*

‍

Shopee offers a fun, gamified experience for their users with features like limited-time “flash deals” that reset multiple times per day.

Shopee might want to stream data into a Hudi data lake to track when users click on a deal, make a purchase, or allow a limited-time deal to expire. Different from Uber’s use case of keeping an event log, Shopee wants to track the current state of users in real-time.

Since the data is changing as users interact with the app, Shopee needs to update the records associated with each user in real-time. These updates would be costly with COW tables that re-write files on every update. MOR tables optimize Shopee’s updates to reduce the number of file rewrites and save on cost.

‍

Learn more in this presentation by Feng Jian, technical expert at Shopee.

Conclusion

Now you’ve seen some real-world examples that can help you choose the Hudi table type(s) that best fit your own use cases.

In the next blog post of this series, we’ll dive into the technical details of how COW and MOR work behind the scenes.

‍

----------

Special thanks to peer reviewers on the blog - Nadine Farah, Siva Narayanan, Alexey Kudinkin, and Balaji Varadarajan 👏

Authors

No items found.

Comparing Apache Hudi's™ MOR and COW Tables: Use Cases from Uber and Shopee

Overview

‍Background: Why do COW and MOR exist?

Comparing COW & MOR‍

COW in action: Uber uses Hudi COW tables for append-only writes

MOR in action: Shopee uses Hudi MOR tables for update-heavy workloads

Conclusion

Read More:

From the trenches: Managing Apache Iceberg metadata for near-real-time workloads

Announcing Apache Spark™ and SQL on the Onehouse Compute Runtime with Quanton

Measuring ETL Price-Performance On Cloud Data Platforms

Towards Open Data - Part 1: Cloud Warehouses Now Love Open Formats

Announcing Open Engines™: Flipping defaults to “open” for both data and compute

ClickHouse vs StarRocks vs Presto vs Trino vs Apache Spark™ — Comparing Analytics Engines

Comparing Apache Hudi's™ MOR and COW Tables: Use Cases from Uber and Shopee

Overview

‍Background: Why do COW and MOR exist?

Comparing COW & MOR‍

COW in action: Uber uses Hudi COW tables for append-only writes

MOR in action: Shopee uses Hudi MOR tables for update-heavy workloads

Conclusion

Read More:

From the trenches: Managing Apache Iceberg metadata for near-real-time workloads

Announcing Apache Spark™ and SQL on the Onehouse Compute Runtime with Quanton

Measuring ETL Price-Performance On Cloud Data Platforms

Towards Open Data - Part 1: Cloud Warehouses Now Love Open Formats

Announcing Open Engines™: Flipping defaults to “open” for both data and compute

ClickHouse vs StarRocks vs Presto vs Trino vs Apache Spark™ — Comparing Analytics Engines

Subscribe to the Blog