June 29, 2022

Apache Hudi vs Delta Lake - Transparent TPC-DS Data Lakehouse Performance Benchmarks

Apache Hudi vs Delta Lake - Transparent TPC-DS Data Lakehouse Performance Benchmarks

Intro

In recent weeks, there has been a growing interest in comparing the performance of the Apache Hudi vs. Delta Lake vs. Apache Iceberg open source projects for the data lakehouse. We felt the community deserves more transparent and reproducible analysis. We want to add our perspective on how these benchmarks should be executed and presented, what value they bring, and how we should interpret them.

What are the issues with existing approaches? 

Recently Databeans published a blog where the performance of Hudi/Delta/Iceberg is compared head-to-head using a TPC-DS benchmark. While it’s fantastic to see the community coming forward and taking action to improve awareness of the current state of the art in the industry, we identified a few issues with the way the experiments were conducted and the results were reported, which we want to share and discuss more broadly today.

As a community, we should strive to add more rigor when publishing benchmarks. We believe these are crucial tenets of any benchmarking efforts:

  1. Reproducible: If results are not reproducible, the reader has no choice but to blindly trust the results at face value. Instead, the benchmark should be documented such that anyone can achieve the same results using the same instruments.
  2. Open: To achieve the same results, it’s vital to make sure that the tooling used for benchmarking is accessible to review correctness.
  3. Fair: With the complexity of the technologies being tested growing constantly, the benchmark setup needs to ensure all contenders use documented configurations for the workloads under test.

With respect to these fundamental issues, we believe that the Databeans blog, unfortunately, came short of sharing the complete picture of what and how the results were achieved. For example: 

  • Benchmarked EMR Runtime configuration was not fully disclosed: it’s unclear, e.g., whether or not Spark’s dynamic allocation feature was disabled, as it has the potential to affect the measurements unpredictably.
  • Code used for Benchmarking is an extension of Delta's benchmarking framework, which is also, unfortunately, not shared publicly, making it impossible to review or replay the same experiment. 
  • Not having access to the code also affects the ability to analyze configurations applied to Hudi/Delta/Iceberg, which makes it challenging to assess fairness

How we suggest running benchmarks

We routinely run performance benchmarks to make sure that Hudi’s rich feature-set is provided along with the best performance possible for the exabytes of Hudi-powered data lakes out there. Our team has extensive experience in benchmarking complex distributed systems like Apache Kafka or Pulsar, true to the principles outlined above.

To make sure the published benchmarks comply with these principles:

  1. We switch off Spark’s dynamic-allocation feature to make sure we run the benchmark in a stable environment and eliminate any jitter in the results entailed by Spark’s cluster deciding to scale up or down. We use EMR 6.6.0 release, with Spark 3.2.0 and Hive 3.1.2 (for HMS), with the following configuration (which is specified in Spark EMR UI upon creation) For more details on how HMS should be setup please follow instructions in the README
[{
  "Classification": "spark-defaults",
  "Properties": {
    "spark.dynamicAllocation.enabled": "false"
  }
}, {
  "Classification": "spark",
  "Properties": {
    "maximizeResourceAllocation": "true"
  }
}, {
  "Classification": "hive-site",
  "Properties": {
    "javax.jdo.option.ConnectionURL": < hive_metastore_url > ,
    "javax.jdo.option.ConnectionDriverName": "org.mariadb.jdbc.Driver",
    "javax.jdo.option.ConnectionUserName": < username > ,
    "javax.jdo.option.ConnectionPassword": < password >
  }
}]
  1. We have publicly shared our modifications to Delta’s benchmarking framework to support creating Hudi tables through either Spark Datasource or Spark SQL. This could be dynamically switched over within the benchmark definition.
  2. TPC-DS loads do not involve updates. The databeans configuration of Hudi loads used an inappropriate write operation `upsert`, while it is clearly documented that Hudi `bulk-insert` is the recommended write operation for this use case. Additionally, we adjusted the Hudi parquet file size settings to match Delta Lake defaults.
CREATE TABLE ...
USING HUDI
OPTIONS (
 type = 'cow',
 primaryKey = '...',
 precombineField = '',
 'hoodie.datasource.write.hive_style_partitioning' = 'true',
 -- Disable Hudi’s record-level metadata for updates, incremental processing, etc
 'hoodie.populate.meta.fields' = 'false',
 -- Use “bulk-insert” write-operation instead of default “upsert”
 'hoodie.sql.insert.mode' = 'non-strict',
 'hoodie.sql.bulk.insert.enable' = 'true',
 -- Perform bulk-insert w/o sorting or automatic file-sizing
 'hoodie.bulkinsert.sort.mode' = 'NONE',
 -- Increasing the file-size to match Delta’s setting
 'hoodie.parquet.max.file.size' = '141557760',
 'hoodie.parquet.block.size' = '141557760',
 'hoodie.parquet.compression.codec' = 'snappy',
  – All TPC-DS tables are actually relatively small and don’t require the use of MT table (S3 file-listing is sufficient)
 'hoodie.metadata.enable' = 'false',
 'hoodie.parquet.writelegacyformat.enabled' = 'false'
)
LOCATION '...'

Hudi’s origins take root in incremental data processing to turn all old school batch jobs incremental. Thus, Hudi’s default configs are geared towards incremental upserts and generating change streams for incremental ETL pipelines, treating the initial load as a rare, one-time activity. Thus, closer attention needs to be paid for the load times to be comparable with Delta. 

Running the benchmarks

Loading

Reference

As could be seen clearly, Delta and Hudi are within 6% for the 0.11.1 release and 5% for the current Hudi’s master* (we’ve additionally benchmarked against Hudi’s master branch since we’ve recently discovered a bug in Parquet encoding configuration that has promptly been resolved).

To power the rich feature-set that Hudi provides on top of raw Parquet tables, such as:

and many more, Hudi internally stores a set of additional metadata along with every record called meta-fields. Since tpc-ds is primarily concerned with snapshot queries, in this particular experiment, such fields have been disabled (and not computed), Hudi still does persist them as nulls, enabling turning them on in the future w/o the need to evolve the schema. Adding five such fields as nulls has, while low, still non-negligible overhead.

Queries

Reference

As we can see, there’s practically no difference between Hudi 0.11.1 and Delta 1.2.0 performance, and Hudi’s current master is very slightly faster (~5%).

You can find raw logs in this directory on Google Drive:

To reproduce the results above, please use our branch in Delta’s benchmark repo and follow the steps in the README.

Conclusion

Summing up, we’ve wanted to underscore the importance of openness and reproducibility in such a sensitive and sophisticated area as performance benchmarking. As we repeatedly saw, obtaining reliable and trustworthy benchmarking results is tedious and challenging, requiring dedication, diligence, and rigor to back it up.

Going forward, we’re planning to release more internal benchmarks that highlight how Hudi’s rich feature-set reaches unmatched performance levels in other common industry workloads. Stay tuned!

Read More:

Subscribe to the Blog

Be the first to read new posts

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.