PrestoCon Day 2023 was a day event filled with extraordinary talks ranging from how Adobe uses Presto, to query performance optimizations with Alibaba, to a panel discussion consisting of companies who use Presto at scale to power their analytics. The talks and panels brought together Presto enthusiasts and experts, where everyone can learn the best practices of using Presto to build data applications. For me, I left inspired and excited about the future of Presto.
Throughout the day, a diverse lineup of speakers took the stage, delivering insightful presentations and thought-provoking discussions. They shared their experiences, best practices, and innovative use cases, showcasing the versatility and power of Presto across various industries and applications. One of those talks came from our very own Sagar Sumit and Danny Chan.
Danny kicked off the talk and went over the challenges of changelog stream ingestion, changelog stream materialization and incremental ETL. Let’s go over each of these challenges:
Ingesting changelog stream into a filesystem has been challenging for many reasons. Firstly, files in the filesystem and the chosen file format (such as Parquet or ORC) are typically immutable. This means that all changes must be carefully applied to these files because they cannot be updated in place. Secondly, files that are frequently committed from streaming ingestion jobs can create a massive amount of small files stored. This will inevitably lead to performance degradation and resource overhead if not addressed. Small files can be combated with services like clustering and compaction. Typically, it’s up to the users to manage these services and file operations manually – which can be a huge operational overhead.
Flink, known for its stateful streaming capabilities, utilizes Dynamic Tables as a first-class citizen. These tables allow for the accumulation of records through running computations and then propagate the results as another dynamic table. However, there are certain limitations to consider for Flink’s Dynamic Tables:
Building a medallion architecture with streaming pipelines is no small feat. Flink is best suited to handle a continuous stream of data in real time and at scale. However, there are other challenges that need to be solved to build towards an incremental processing architecture. Flink can be paired with other technologies to overcome the challenges. Here’s some considerations when using Flink to build an incremental ETL architecture:
To overcome these challenges, it’s up to the user to implement custom mechanisms or leverage external tools. Again, this results in operational overhead.
By combining Hudi with Flink, the challenges mentioned above can be overcome.
Hudi has a pluggable index layer where new records are indexed. Each record written into a Hudi table would be first tagged with the location information. Records are sent to a file slice. A group of file slices is called a file group.
In Hudi, records with the same key are always routed to the same FileGroup. This mapping ensures that changes related to a particular record key are stored together within the same FileGroup. By organizing data at the FileGroup level, Hudi optimizes the storage layout and reduces the number of small files. Hudi tries to binpack new inserts into smaller FileGroups, allowing for efficient data storage and minimizing the generation of unnecessary small files.
In addition, Hudi provides a clustering and compaction table service to help with file sizing. Compaction is specifically used for Hudi’s Merge-On-Read (MOR) tables to compact log files into base files. Clustering is used for both Copy-On-Write (COW) and MOR tables to help with a few things:
I won’t go into too much detail about the compaction and clustering table service here- but you can refer to the documentation linked to learn more.
By leveraging Hudi, the challenges surrounding event time sequencing and persistent change operations can be effectively handled. Hudi also has more features compared to Flink’s Dynamic Table:
Hudi supports schema evolution to handle changes in the schema. As the data schema changes over time, write or query operations won’t fail. You can learn more about schema evolution in the Hudi documentation.
Hudi is able to maintain the event sequence through the use of a timeline and how it organizes data in its file structure.
Once Hudi ingests streaming data from Flink, you can power Presto to query the data for near real-time analytics. Presto is able to query data magnitudes faster due to Hudi’s metadata indexing table service.
Hudi’s metadata table is where each partition stores information about the metadata about the file. For example, the column stats partition will store information about the column min max values, null counts and so on. The metadata table also stores files partitions index, which stores information about what files are located within, which partition.
The metadata is a central view of all the indexes, like col_stats (column stats index), bloom filter index and file index. Presto can access the metadata table and prune irrelevant files faster, resulting in near real-time analytics.
There was so much more covered during their session. Sagar went pretty deep into how Presto takes advantage of the metadata table. You can catch presentation and demo on Presto's YouTube:
Presentation:
Demo:
All-in-all, the event showcased the growth and momentum of the Presto community. With an increasing number of organizations adopting and contributing to Presto, it was evident that Presto has become a cornerstone technology for modern data analytics. We extend our sincere appreciation to the organizers, speakers, sponsors, and attendees for making PrestoCon Day a resounding success.
Be the first to read new posts