April 23, 2024

Dremio Lakehouse Analytics with Hudi and Iceberg using XTable

Written by

Dipankar Mazumdar & Alex Merced

Dremio Lakehouse Analytics with Hudi and Iceberg using XTable

Dipankar Mazumdar is Staff Data Engineering Advocate at Onehouse
Alex Merced is Sr. Technical Evangelist at Dremio‍

Data lakehouse architectures are becoming ubiquitous today, with organizations increasingly adopting open table formats such as Apache Hudi, Apache Iceberg, and Delta Lake for their data platforms. These formats have been instrumental in establishing an open, flexible architectural foundation that empowers organizations to choose the most appropriate compute engine for their specific workloads, without locking data into the storage formats of a proprietary database or data warehouse.

This approach to openness and flexibility has enabled a transformation in the way data is being stored and used. Today, customers can choose to store data in open table formats in cloud object stores such as Amazon S3, Microsoft Azure Blob Storage, or Google Cloud Storage. The data is wholly owned and managed by the data owner and kept in their secure virtual private cloud (VPC) account. The user can bring the right type of query engine(s) for their workload(s), without data copying. This creates a future-proof architecture where you can add new tools to your stack whenever required.

Despite these advantages, there has been a hindrance: the need to choose a single table format, which introduces a significant challenge, due to the unique features and integration benefits each format presents. Additionally, with newer workloads, organizations require formats to be fully interoperable, so data is universally queryable. Without interoperability, the organization is bound to a single format, forcing them to deal with one-off migration strategies or making full data copies - often frequently - for working with other formats.

*Fig 1: An open data lakehouse with interoperability between table formats*

The Apache XTable (Incubating) project, an open-source initiative launched last year, addresses this challenge by focusing on interoperability among these different lakehouse table formats. XTable acts as a lightweight translation layer, allowing for seamless metadata translation between source and target table formats, without necessitating the rewriting or duplication of actual data files. Thus, irrespective of the initial table format choice for writing data, you can read your data using the preferred format and compute engine of your choice.

In this blog, we will take a look at a hypothetical but practical scenario that is becoming more frequent in today’s analytical workloads within organizations.

Scenario

This scenario begins with two analytics teams as part of the Market Analysis group in an organization. These teams are responsible for analyzing market trends and consumer preferences for the products of various superstores. Most of their data lands in an S3 data lake. For this specific exercise, we have used publicly available data from Kaggle.

Team A: Uses Apache Hudi as the Table format with Spark

Team A uses Apache Hudi to manage some of their most critical, low-latency data pipelines. Hudi's strength lies in its ability to support incremental data processing, providing faster upserts and deletes in data lakes. Additionally, robust indexing and automatic table management features in Hudi enable Team A to maintain high levels of efficiency and performance in their data ingestion processes, primarily executed via Apache Spark. This Hudi table contains data for sales that have happened in ‘Tesco’ for a specific period.

Team B: Analytics with Dremio and Iceberg

On the other side, Team B focuses on ad hoc analysis, BI, and reporting, leveraging Dremio's robust compute engine and the reliability of Apache Iceberg tables. Iceberg’s capabilities, such as hidden partitioning and data versioning, pair seamlessly with Dremio's query acceleration capabilities for analytical workloads. This combination allows Team B to perform complex analyses and generate BI reports with performance and ease. Team B stores sales data for the supermarket ‘Aldi’ as an Iceberg table.

The Challenge: Unifying Data from Hudi & Iceberg Tables

To perform a detailed comparative analysis for a special marketing campaign in the organization, Team B wants to understand the category-wise product sales of both ‘Tesco’ and ‘Aldi’ superstores. To do so, Team B wants to use the dataset generated by Team A (stored as a Hudi table) and combine it with their dataset (the Iceberg table). Given that they use Dremio as the compute engine for analysis and reporting, this would traditionally pose a significant barrier, because Dremio doesn’t natively support Hudi tables.

The Solution: Apache XTable for Interoperability

In scenarios such as this one, Apache XTable provides a straightforward solution to enable Team B to deal with this. Using XTable, Team B exposes the source Hudi table ('Tesco' data) as an Iceberg table. This is achieved by translating the metadata from Hudi to Iceberg, without the need for rewriting or duplicating the actual data. This translation process is efficient, and utilizes the same S3 bucket to store the translated metadata for the target table.

*Figure 2: Team B uses XTable to translate the ‘Tesco’ Hudi dataset to Iceberg format to use it with Dremio for analytics*

Once the Team A data (Hudi) is presented as an Iceberg table, Team B can then work on the data as though it were originally written in Iceberg format. They can take advantage of operations such as joins and unions with Dremio’s compute, creating a new dataset with data from both the teams. Through XTable, there are no costly rewrites of data or cumbersome migration efforts, allowing for quick analysis. With XTable, data is more universally available, allowing the organization to work seamlessly with multiple table formats.

Now that we have established a solid understanding of the problem statement and the solution provided by Apache XTable, let's now dive into the practical aspects and see how interoperability works in action for the above described scenario.

Hands-on Use Case

Team A:

Team A uses Apache Spark to ingest sales data from the ‘Tesco’ superstore into a Hudi table stored in an S3 data lake.

Let’s start with creating the Hudi Table. Here is all the configuration we will need to use PySpark with Apache Hudi.

Now, we ingest the sales records into the Hudi table.

Let’s quickly check the Hudi table files in the S3 file system.

Here’s how the data looks (queried with Spark SQL).

Team B:

Next, the ingestion for the ‘Aldi’ superstore is performed using Spark, with the dataset stored as an Iceberg table (retail_ice) in the S3 data lake. This step simulates a typical workflow where data engineering teams are responsible for data preparation and ingestion.

If you want to use a local Spark and Dremio environment to try out this use case, follow the directions on this repo to create a local lakehouse environment.

Let’s first configure Apache Iceberg with PySpark and a Hadoop catalog and create the Iceberg table.

And then ingest the sales data for ‘Aldi’.

After the data is written as an Iceberg table in the S3 data lake, data analysts can use Dremio’s lakehouse platform to connect to the lake and start querying the data.

Here’s a simple query.

Translate Hudi dataset ('Tesco') to Iceberg

So, with both teams having their data stored in two different table formats, we now introduce Apache XTable to solve the interoperability challenge.

XTable will be used to translate the metadata from the Hudi table ('Tesco') to Iceberg format, enabling the data to be accessible and queryable in Iceberg format using Dremio on the Team B side. This doesn’t modify or duplicate the original dataset’s Parquet base files.

To start with Apache XTable, we will start by cloning the GitHub repository to our local environment and compile the necessary jars using Maven. The following command initiates the build:

mvn clean package

For more details around installation, follow the official docs.

After the build completion is successful, we will use the utilities-0.1.0-SNAPSHOT-bundled.jar for initiating the metadata translation process.

The next step is to set up a configuration file, my_config.yaml, within the XTable directory we've cloned, to define the translation details. The configuration should look something like this:

My_config.yaml

sourceFormat: HUDI
targetFormats:
  - ICEBERG
datasets:
  - tableBasePath: s3://diplakehouse/hudi_tables/
    tableName: retail_data

The configuration outlines the source format (Hudi), the target format (Iceberg), and the table specific details: the base path and table name in S3.

To kick off the translation process, we will execute the following command.

java -jar utilities/target/utilities-0.1.0-SNAPSHOT-bundled.jar --datasetConfig my_config.yaml

After the sync process completes successfully, we will see the output, as shown in the snippet below.

If we now inspect the S3 location path, we’ll see the Iceberg metadata files, which include details such as schema definitions, commit history, partitioning information, and column statistics. Here is the metadata folder in S3. As we can see, the Iceberg metadata is part of the same /hudi_tables directory.

Now that the original Hudi table ('Tesco' dataset) has been translated to an Iceberg table in our S3 data lake, we can seamlessly use Dremio's compute engine to query the data and perform further operations.

Without a lightweight translation layer such as Apache XTable, accessing Hudi tables from Dremio would not be straightforward. Alternatives would involve cumbersome migration processes, costly data rewrites, and the potential loss of historical data versions.

Let’s go ahead and query this new dataset from Dremio.

Now, for the next part, Team B wants to combine both datasets ('Tesco' and 'Aldi') together into one view and build a BI report using the data.

We will use a simple UNION on both these tables, as shown below, to achieve this.

Dremio also allows saving this as a view in a particular space (layer) in our environment so it can be used by specific teams. We will save the combined dataset as Universal_dataset_superstore.

Great! So, this combined dataset (Hudi-translated & Iceberg native tables) will now be used by Team B to do their category-wise product sales analysis for both ‘Tesco’ and ‘Aldi’ superstores.

To do so, analysts can use the “Analyze With” button in Dremio to build a BI report in Tableau using this new combined dataset.

Here’s the final report in Tableau that integrates datasets from two distinct table formats to perform a category-wise product sales analysis. The 'Tesco' data (in green) was originally stored in Hudi format and now translated to Iceberg using XTable. And the 'Aldi' data (in yellow) was natively stored as an Iceberg table.

This use case underscores the benefits brought by XTable's translation capabilities. Team B’s analysts were able to work with Tesco's data as if it were always an Iceberg table, without the need for any changes on their part during analysis. The flexibility provided by XTable allows Dremio to read and perform analytics on the Tesco dataset without any distinction from the native Iceberg format. The ability to interoperate between open formats saves money, improves performance, simplifies the analytical workflow, and ensures that data is universally accessible.