June 16, 2023

Exploring New Frontiers: How Apache Flink, Apache Hudi and Presto Power New Insights at Scale

Exploring New Frontiers: How Apache Flink, Apache Hudi and Presto Power New Insights at Scale

PrestoCon Day 2023 was a day event filled with extraordinary talks ranging from how Adobe uses Presto, to query performance optimizations with Alibaba, to a panel discussion consisting of companies who use Presto at scale to power their analytics. The talks and panels brought together Presto enthusiasts and experts, where everyone can learn the best practices of using Presto to build data applications. For me, I left inspired and excited about the future of Presto. 

Throughout the day, a diverse lineup of speakers took the stage, delivering insightful presentations and thought-provoking discussions. They shared their experiences, best practices, and innovative use cases, showcasing the versatility and power of Presto across various industries and applications. One of those talks came from our very own Sagar Sumit and Danny Chan

Danny kicked off the talk and went over the challenges of changelog stream ingestion, changelog stream materialization and incremental ETL. Let’s go over each of these challenges:

Challenge 1: Changelog stream ingestion

Ingesting changelog stream into a filesystem has been challenging for many reasons. Firstly, files in the filesystem and the chosen file format (such as Parquet or ORC) are typically immutable. This means that all changes must be carefully applied to these files because they cannot be updated in place. Secondly, files that are frequently committed from streaming ingestion jobs can create a massive amount of small files stored. This will inevitably lead to performance degradation and resource overhead if not addressed. Small files can be combated with services like clustering and compaction. Typically, it’s up to the users to manage these services and file operations manually – which can be a huge operational overhead.

Changelog stream ingestion challenges

Challenge 2: Changelog stream materialization

Flink, known for its stateful streaming capabilities, utilizes  Dynamic Tables as a first-class citizen. These tables allow for the accumulation of records through running computations and then propagate the results as another dynamic table. However, there are certain limitations to consider for Flink’s Dynamic Tables:

  • The Dynamic Table is exclusively handled by the Flink engine and cannot be accessed by other engines directly. If another engine needs to query the dynamic table, it must first synchronize the data into external storage.
  • Dynamic Tables cannot be shared among multiple jobs, posing a challenge when dealing with computation operators like SQL JOIN that require significant resources. Sharing the intermediate result set would be beneficial in terms of cost-saving.
  • There is no history snapshot view of the dynamic table, and schema evolution is not supported. This means that there is no built-in capability to access previous versions or historical states of the dynamic table, nor is there a straightforward mechanism for handling changes in the table’s schema over time.
Changelog stream materialization challenges

Challenge 3: Incremental ETL

Building a medallion architecture with streaming pipelines is no small feat. Flink is best suited to handle a continuous stream of data in real time and at scale. However, there are other challenges that need to be solved to build towards an incremental processing architecture. Flink can be paired with other technologies to overcome the challenges. Here’s some considerations when using Flink to build an incremental ETL architecture:

  • Maintaining the sequence of events is crucial for accurate streaming computation. However, when processing changelog records in Flink, there is currently no built-in support for event time sequencing. Changelog records represent incremental changes to the data state but do not inherently include an event_time concept. Unlike some other database systems where `event_time` can be utilized to compare and update the latest records, the Dynamic Table in Flink does not have a notion of event_time. As a result, it is more suitable for handling continuous data streams rather than managing a complete historical representation or state of the data. To ensure correctness, it is important to consume records in the same sequence in which they are generated. It’s worth noting that in Flink’s current state, if you encounter late arriving data, it will be processed as it arrives, potentially leading to updates that are not in the correct time order.
  • Both the writer and reader jobs must ensure exactly-once semantics to maintain correctness. This means that data must be processed and consumed without duplication or loss, guaranteeing reliability and accuracy.
  • The reader job requires a reliable approach to track the consumption offset, allowing for efficient recovery in a lightweight manner. This enables the reader to resume processing from the correct position after any potential failures or interruptions.

To overcome these challenges, it’s up to the user to implement custom mechanisms or leverage external tools. Again, this results in operational overhead.

Incremental ETL challenges

Rising above Challenges: Hudi and Flink integration

By combining Hudi with Flink, the challenges mentioned above can be overcome. 

Solution 1: Changelog stream ingestion

Hudi has a pluggable index layer where new records are indexed. Each record written into a Hudi table would be first tagged with the location information. Records are sent to a file slice. A group of file slices is called a file group

In Hudi, records with the same key are always routed to the same FileGroup. This mapping ensures that changes related to a particular record key are stored together within the same FileGroup. By organizing data at the FileGroup level, Hudi optimizes the storage layout and reduces the number of small files. Hudi tries to binpack new inserts into smaller FileGroups, allowing for efficient data storage and minimizing the generation of unnecessary small files. 

In addition, Hudi provides a clustering and compaction table service to help with file sizing. Compaction is specifically used for Hudi’s Merge-On-Read (MOR)  tables to compact log files into base files. Clustering is used for both Copy-On-Write (COW) and MOR tables to help with a few things:

  1.  decouple file sizing from ingestion where smaller files are combined into larger files without impacting write speeds and data freshness
  2. improve query performance by reorganizing the data layout via techniques such as sorting data on different columns

I won’t go into too much detail about the compaction and clustering table service here- but you can refer to the documentation linked to learn more.

Hudi's index layer and file layout

Solution 2: Changelog stream materialization

By leveraging Hudi, the challenges surrounding event time sequencing and persistent change operations can be effectively handled. Hudi also has more features compared to Flink’s Dynamic Table:

  • A Hudi table can be exposed as a snapshot view that can be queried by Presto and other ad-hoc engines.
  • A Hudi table can perform time-travel and incremental queries where you can query from a specific timestamp. This capability enables efficient analysis of data changes over time and facilitates historical data exploration.
  • Hudi tables can even function as a message queue, allowing streaming readers to consume and replay all changes. This capability enables computations and propagation of changes downstream, providing a comprehensive solution for real-time data processing.

Hudi supports schema evolution to handle changes in the schema. As the data schema changes over time, write or query operations won’t fail. You can learn more about schema evolution in the Hudi documentation.

Materialize the Dyanmic Table as a Hudi table

Solution 3: Incremental ETL

Hudi is able to maintain the event sequence through the use of a timeline and how it organizes data in its file structure.

  • Hudi’s file layout structure: Earlier in Solution 1, I mentioned Hudi’s FileGroup layout. Hudi groups changes together base on the record key. All changes associated with that key are routed to the same FileGroup. Hudi’s file layout helps for quick data accessibility and processing.
  • Hudi’s active timeline: Each Hudi commit results from a Flink checkpoint. Hudi has an internal timeline where all commits to a Hudi table are recorded. The commit timestamp represents the changes that were applied to the data- i.e. an update or delete operation. Since each update or delete has a unique timestamp, Hudi can maintain and ensure proper event ordering. Hudi also has built-in capabilities that helps with key deduplication and records merges to assist in when records should be updated or deleted. Hudi acts as a persistent Dynamic Table that supports updates and change streams for incremental consumption. The Flink stream reader monitors the Hudi’s commits and accesses the timeline, and thus incrementally consume from Hudi. Here’s an example diagram of this architecture:

Incremental ETL with Flink and Hudi

Fresher insights on streaming data with Presto using Hudi’s metadata table

Once Hudi ingests streaming data from Flink, you can power Presto to query the data for near real-time analytics. Presto is able to query data magnitudes faster due to Hudi’s metadata indexing table service

Hudi’s metadata table is where each partition stores information about the metadata about the file. For example, the column stats partition will store information about the column min max values, null counts and so on. The metadata table also stores files partitions index, which stores information about what files are located within, which partition.  

The metadata is a central view of all the indexes, like col_stats (column stats index), bloom filter index and file index. Presto can access the metadata table and prune irrelevant files faster, resulting in near real-time analytics. 

Hudi's multi-modal indexing sub-system

There was so much more covered during their session. Sagar went pretty deep into how Presto takes advantage of the metadata table. You can catch presentation and demo on Presto's YouTube:

Presentation:

Demo:

All-in-all, the event showcased the growth and momentum of the Presto community. With an increasing number of organizations adopting and contributing to Presto, it was evident that Presto has become a cornerstone technology for modern data analytics. We extend our sincere appreciation to the organizers, speakers, sponsors, and attendees for making PrestoCon Day a resounding success.

Subscribe to the Blog

Be the first to read new posts

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
We are hiring diverse, world-class talent — join us in building the future