Hudi-Presto Workshop

Building an Open Data Lakehouse on AWS S3 with Apache Hudi & Presto

calendar icon

April 24th, 2024 | 9 AM PST | 12 PM EST

A woman giving a presentation to a group of people.

Prerequisites

A black and white image of a boat in the water.
Basic understanding of data lake file/table formats
A black and yellow logo with a blue and yellow logo.
Programming knowledge- Python, SQL
A red object with a black background.
AWS service: S3

Technology Stack

A red object with a black background.
Data lake storage - AWS S3
A blue computer keyboard on a black background.
File format - Parquet
Hudi logo
Table format - Apache Hudi
Presto logo
Compute engine - Presto
Hive logo
Metastore - Hive

Dataset

The workshop will leverage TPC-DS dataset in volume of 10 GB to demonstrate the various capabilities of read and write with Hudi and Presto. The dataset will be made available at a common S3 location accessible to workshop attendees.

Environment Details

All the required open source software and its dependencies will be pre-installed for this workshop session. Attendees will use Jupyter Notebooks to run various read and write queries on Apache Hudi using Presto and Spark SQL. Users will also have access to Spark UI and Presto UI for additional analysis and debugging.

Workshop Architecture picture

Description

The lakehouse architecture combines the flexibility, scalability, and cost-efficiency of data lakes with the robust data management features of data warehouses. This workshop is designed to provide data engineers & architects with a comprehensive understanding of Apache Hudi and use it to build an open lakehouse architecture on AWS S3, utilizing Presto as the engine for fast and interactive queries.

Attendees will learn:

  • Open Lakehouse architecture stack with Hudi as the transactional layer & Presto as the compute engine.
  • Hudi’s Table optimization service - Clustering & Metadata tables to help improve query performance.
  • Practical exercises on creating different Hudi tables (CoW, MoR) on S3, ingesting data, performing upserts/deletes, and synching with catalogs such as Hive Metastore.
  • Various ways of querying data using Presto including snapshot and read-optimized queries.
  • Application of clustering table service & metadata table to observe firsthand improvements in query speed on the Presto-side.

Featured Speakers

A man standing at a podium with a laptop in front of him.
Staff Developer Advocate, Apache Hudi Contributor
A woman with blonde hair and blue eyes smiling.
Open Source Developer, Presto Contributor