Hudi-Presto Workshop

Building an Open Data Lakehouse on AWS S3 with Apache Hudi & Presto

April 24th, 2024 | 9 AM PST | 12 PM EST

Prerequisites

Basic understanding of data lake file/table formats
Programming knowledge- Python, SQL
AWS service: S3

Technology Stack

Data lake storage - AWS S3
File format - Parquet
Table format - Apache Hudi
Compute engine - Presto
Metastore - Hive

Dataset

The workshop will leverage TPC-DS dataset in volume of 10 GB to demonstrate the various capabilities of read and write with Hudi and Presto. The dataset will be made available at a common S3 location accessible to workshop attendees.

Environment Details

All the required open source software and its dependencies will be pre-installed for this workshop session. Attendees will use Jupyter Notebooks to run various read and write queries on Apache Hudi using Presto and Spark SQL. Users will also have access to Spark UI and Presto UI for additional analysis and debugging.

Description

The lakehouse architecture combines the flexibility, scalability, and cost-efficiency of data lakes with the robust data management features of data warehouses. This workshop is designed to provide data engineers & architects with a comprehensive understanding of Apache Hudi and use it to build an open lakehouse architecture on AWS S3, utilizing Presto as the engine for fast and interactive queries.

Attendees will learn:

  • Open Lakehouse architecture stack with Hudi as the transactional layer & Presto as the compute engine.
  • Hudi’s Table optimization service - Clustering & Metadata tables to help improve query performance.
  • Practical exercises on creating different Hudi tables (CoW, MoR) on S3, ingesting data, performing upserts/deletes, and synching with catalogs such as Hive Metastore.
  • Various ways of querying data using Presto including snapshot and read-optimized queries.
  • Application of clustering table service & metadata table to observe firsthand improvements in query speed on the Presto-side.

Featured Speakers

Staff Developer Advocate, Apache Hudi Contributor
Open Source Developer, Presto Contributor

About this Webinar:

The data lakehouse is attracting greater and greater adoption. But building your own data lakehouse is challenging. Onehouse's Universal Data Lakehouse™ is a fully managed service built on open source technology. It offers interoperability with the leading lakehouse formats and compatibility with leading data stores and query engines such as Snowflake, Databricks, and Amazon Athena.

Our live webinar includes an overview of Onehouse from Founder and CEO Vinoth Chandar and a live demo of Onehouse. You’ll see how the managed lakehouse can:

Your Presenters:

Vinoth Chandar
CEO and Founder
Andy Walner
Product Manager
We are hiring diverse, world-class talent — join us in building the future.