Hudi-Presto Workshop

Building an Open Data Lakehouse on AWS S3 with Apache Hudi™ & Presto

April 24th, 2024 | 9 AM PST | 12 PM EST

A woman giving a presentation to a group of people.

Prerequisites

A black and white image of a boat in the water.

Basic understanding of data lake file/table formats

Programming knowledge- Python, SQL

AWS service: S3

Technology Stack

Data lake storage - AWS S3

A blue computer keyboard on a black background.

File format - Parquet

Table format - Apache Hudi

Compute engine - Presto

Metastore - Hive

Dataset

The workshop will leverage TPC-DS dataset in volume of 10 GB to demonstrate the various capabilities of read and write with Hudi and Presto. The dataset will be made available at a common S3 location accessible to workshop attendees.

Environment Details

All the required open source software and its dependencies will be pre-installed for this workshop session. Attendees will use Jupyter Notebooks to run various read and write queries on Apache Hudi using Presto and Spark SQL. Users will also have access to Spark UI and Presto UI for additional analysis and debugging.

Description

The lakehouse architecture combines the flexibility, scalability, and cost-efficiency of data lakes with the robust data management features of data warehouses. This workshop is designed to provide data engineers & architects with a comprehensive understanding of Apache Hudi and use it to build an open lakehouse architecture on AWS S3, utilizing Presto as the engine for fast and interactive queries.

Attendees will learn:

Open Lakehouse architecture stack with Hudi as the transactional layer & Presto as the compute engine.
Hudi’s Table optimization service - Clustering & Metadata tables to help improve query performance.
Practical exercises on creating different Hudi tables (CoW, MoR) on S3, ingesting data, performing upserts/deletes, and synching with catalogs such as Hive Metastore.
Various ways of querying data using Presto including snapshot and read-optimized queries.
Application of clustering table service & metadata table to observe firsthand improvements in query speed on the Presto-side.

Featured Speakers

Dipankar Mazumdar

Staff Developer Advocate, Apache Hudi Contributor

Kiersten Stokes

Open Source Developer, Presto Contributor