In recent years, many data lake technologies have revolutionized big data analytics by leveraging the Copy-On-Write (COW) storage/table format. Apache Hudi pioneered Merge-On-Read (MOR) for cases where COW may not be optimal. In this post, we will compare both table types and give suggestions for when to use each of these table types.
Before Hudi was created at Uber in 2015, Uber relied on Apache Spark jobs to periodically write datasets in Apache HDFS and to absorb upstream table inserts, updates, and deletes when a rider’s trip status changed. As Uber grew, the process of writing data became too inefficient and slow, so Hudi was born to provide faster, computationally efficient analytics at exabyte scale.
Copy-On-Write (COW) was the first storage table type available upon Hudi’s creation. Compared to the old architecture with Apache Spark, Uber saw over 100 times more efficient writes.
As the big data analytics community started embracing data lake technologies, their requirements expanded from purely batch processing to include stream processing which works best with minute-level latency. COW rewrites the entire file for even a single modified record. In the case of streaming data with COW, more table updates meant more file versions and increased file count. This led to inefficient writes and less fresh data.
Merge-On-Read (MOR) was the second storage table type created for Hudi to reduce the write amplification in COW tables with heavy updates. Rather than re-writing the entire file, MOR writes updates to separate changelog files, then these changelogs are merged into new file versions at a later time configured by the user. Grouping these smaller changelog files together avoids re-writing the entire file multiple times.
COW and MOR are two Hudi table types that each solve different needs for your company:
Uber uses COW to store append-only data (never updated) like event logs which track user interactions in the Uber app (eg. when a user taps a button). Query performance is faster to serve data.
MOR is optimized for frequent table updates (ie. changing existing records). This storage type avoids unnecessary rewrites to data files, which reduces cost and enables low-latency writes. Shopee, a leading eCommerce brand, uses MOR tables to store analytics data from their site because the data is updated frequently, like when shoppers add/remove items from their cart.
Here’s a breakdown of the tradeoffs of each Hudi table type:
* we’ll get into more details of compaction in a future blog
** COW cost may be higher than MOR for update-heavy workloads
💡COW is best for read-heavy tables where you need performant reads of the latest data.
As the largest ride-sharing company in the world, Uber ingests more than 500 billion records per day into their data lake at the scale of hundreds of petabytes. Uber tracks many events throughout the lifecycle of one ride, including when the rider opens the app, calls the ride, reaches their destination, and rates their ride.
Uber uses Hudi COW tables for append-only data like event streams which track user interactions in the Uber app. This data is append-only because Uber tracks a historical log of user interactions in the app, which would not change retroactively. Append-only data has no expensive table updates. In Uber’s case, this storage type met the data and query latency requirements. For example, Uber’s analysts were able to have performant dashboards.
Learn more in this blog post from the Uber Engineering team.
💡 MOR is best for update-heavy tables where you want faster and efficient writes.
Shopee is the largest online shopping platform in Southeast Asia, serving 343 million monthly visitors. Shopee’s product centers around a mobile app which brings in users across Asia, Europe, and Latin America. With millions of users regularly interacting with their site, Shopee has processed hundreds of petabytes of data.
Shopee built their real-time data platform on Hudi to leverage Hudi’s powerful capabilities for streaming real-time data. The integration was straightforward for Shopee’s data team since Hudi natively supports Shopee’s existing computing engines like Flink and Spark; storage protocols like S3 and HDFS; query engines like Presto.
Now, let’s explore an example from Shopee’s site:
Shopee offers a fun, gamified experience for their users with features like limited-time “flash deals” that reset multiple times per day.
Shopee might want to stream data into a Hudi data lake to track when users click on a deal, make a purchase, or allow a limited-time deal to expire. Different from Uber’s use case of keeping an event log, Shopee wants to track the current state of users in real-time.
Since the data is changing as users interact with the app, Shopee needs to update the records associated with each user in real-time. These updates would be costly with COW tables that re-write files on every update. MOR tables optimize Shopee’s updates to reduce the number of file rewrites and save on cost.
Learn more in this presentation by Feng Jian, technical expert at Shopee.
Now you’ve seen some real-world examples that can help you choose the Hudi table type(s) that best fit your own use cases.
In the next blog post of this series, we’ll dive into the technical details of how COW and MOR work behind the scenes.
Be the first to read new posts