As we wrap up 2022 I want to take the opportunity to reflect on and highlight the incredible progress of the Apache Hudi project and most importantly, the community. First and foremost, I want to thank all of the contributors who have made 2022 the best year for the project ever. There were over 2,200 PRs created (+38% YoY) and over 600+ users engaged on Github. The Apache Hudi community slack channel has grown to more than 2,600 users (+100% YoY growth) averaging nearly 200 messages per month! The most impressive stat is that with this volume growth, the median response time to questions is ~3h. Come join the community where people are sharing and helping each other!
While there are too many features added in 2022 to list them all, take a look at some of the exciting highlights:
Multi-Modal Index is a first-of-its-kind high-performance indexing subsystem for the Lakehouse. It improves metadata lookup performance by up to 100x and reduces overall query latency by up to 30x. Two new indices were added to the metadata table - Bloom filter index that enables faster upsert performance and column stats index along with Data skipping helps speed up queries dramatically.
Hudi added support for asynchronous indexing to assist building such indices without blocking ingestion so that regular writers don’t need to scale up resources for such one off spikes.
A new type of index called Bucket Index was introduced this year. This could be game changing for deterministic workloads with partitioned datasets. It is very light-weight and allows the distribution of records to buckets using a hash function.
Filesystem based Lock Provider - This implementation avoids the need of external systems and leverages the abilities of underlying filesystem to support lock provider needed for optimistic concurrency control in case of multiple writers. Please check the lock configuration for details.
Deltastreamer Graceful Completion - Users can now configure a post-write completion strategy with deltastreamer continuous mode for graceful shutdown.
Schema on read is supported as an experimental feature since 0.11.0, allowing users to leverage Spark SQL DDL support for evolving data schema needs(drop, rename etc). Added support for a lot of CALL commands to invoke an array of actions on Hudi tables.
It is now feasible to encrypt your data that you store with Apache Hudi.
Pulsar Write Commit Callback - On new events to the Hudi table, users can get notified via Pulsar.
Flink Enhancements: We added metadata table support, async clustering, data skipping, and bucket index for write paths. We also extended flink support to versions 1.13.x, 1.14.x and 1.15.x.
Presto Hudi integration: In addition to the hive connector we have had for a long time, we added native Presto Hudi connector. This enables users to get access to advanced features of Hudi faster. Users can now leverage metadata table to reduce file listing cost. We also added support for accessing clustered datasets this year.
Trino Hudi integration: We also added native Trino Hudi connector to assist in querying Hudi tables via Trino Engine. Users can now leverage metadata table to make their queries performant.
Performance enhancements: Many performance optimizations were landed by the community throughout the year to keep Hudi on par with competition or better. Check out this TPC-DS benchmark comparing Hudi vs Delta Lake.
Long Term Support: We start to maintain 0.12 as the Long Term Support releases for users to migrate to and stay for a longer duration. In lieu of that, we have made 0.12.1 and 0.12.2 releases to assist users with stable release that comes packed with a lot of stability and bug fixes.
Apache Hudi is a global community and thankfully we live in a world today that empowers virtual collaboration and productivity. In addition to connecting virtually this year we have seen the Apache Hudi community gather at many events in person. Re:Invent, Data+AI Summit, Flink Forward, Alluxio Day, Data Council, PrestoCon, Confluent Current, DBT Coalesce, Cinco de Trino, Data Platform Summit, and many more.
A wide diversity of organizations around the globe use Apache Hudi as the foundation of their production data platforms. Over 800+ organizations have engaged with Hudi (up 60% YoY) Here are a few highlights of content written by the community sharing their experiences, designs, and best practices:
Thanks to the strength of the community, Apache Hudi has a bright future for 2023. Check out this recording from our Re:Invent meetup where Vinoth Chandar talks about exciting new features to expect in 2023.
0.13.0 will be the next major release, with a package of exciting new features. Here are a few highlights:
Record-key-based index to speed up the lookup of records for UUID-based updates and deletes, well tested with 10+ TB index data for hundreds of billions of records at Uber;
The long-term vision of Apache Hudi is to make streaming data lake the mainstream, achieving sub-minute commit SLAs with stellar query performance and incremental ETLs. We plan to harden the indexing subsystem with Table APIs for easy integration with query engines and access to Hudi metadata and indexes, Indexing Functions and a Federated Storage Layer to eliminate the notion of partitions and reduce I/O, and new secondary indexes. To realize fast queries, we will provide an option of a standalone MetaServer serving Hudi metadata to plan queries in milliseconds and a Hudi-aware lake cache that speeds up the read performance of MOR tables along with fast writes for updates. Incremental and streaming SQL will be enhanced in Spark and Flink. For Hudi on Flink, we plan to make the multi-modal indexing production-ready, bring read and write compatibility between Flink and Spark engines, and harden the streaming capabilities, including CDC, streaming ETL semantics, pre-aggregation models and materialized views.