July 14, 2022

Apache Hudi on Microsoft Azure

Apache Hudi on Microsoft Azure

Apache Hudi is a popular open source lakehouse technology that is rapidly growing in the big data community. If you have built data lakes and data engineering platforms on AWS you have likely already heard of or used Apache Hudi since AWS natively integrates and supports Apache Hudi as a first class technology in its many data services including EMR, Redshift, Athena, Glue, etc. 

When considering building a data lake or a lakehouse on Microsoft Azure, most people are already familiar with Delta Lake from Databricks, but some are now discovering that Apache Hudi is a viable alternative that works with Azure native services like Azure Synapse Analytics, HDInsight, ADLS Gen2, EventHubs, and many others even including Azure Databricks. With limited documentation available, the goal of this blog is to share awareness of how you can easily kickstart and use Hudi on Azure.

Apache Hudi is an open source lakehouse technology that enables you to bring transactions, concurrency, upserts, and advanced storage performance optimizations to your data lakes on Azure Data Lake Storage (ADLS). Apache Hudi offers remarkable performance advantages to your workloads and ensures that your data is not locked in to or tied to any one vendor. With Apache Hudi, all of your data stays in open file formats like parquet and Hudi can be used with any of the popular query engines like Apache Spark, Flink, Presto, Trino, Hive, etc.

Hudi and Synapse Setup

Using Apache Hudi on Azure is as simple as loading a few libraries into Azure Synapse Analytics. If you already have a Synapse Workspace, a Spark Pool, and ADLS Storage accounts, you can skip some of the prerequisite steps below.

Prereq 1 - Synapse Workspace

First, if you don’t have one already, create a Synapse workspace. No special configurations required, you will want to remember the ADLS Account Name and File System Name for later.

Prereq 2 - Serverless Spark Pool

It is very easy to use Hudi with Apache Spark, so let’s create a Synapse Serverless Spark Pool that we can use to read and write data with. First, open the Synapse Workspace that you just created and launch it in Synapse Studio. Follow this quickstart to create a Spark pool

There are no specific settings you have to set for Hudi, but take note of the Spark runtime version you select and make sure you pick the appropriate Hudi version to match. For this tutorial, I picked Spark 3.1 in Synapse which is using Scala 2.12.10 and Java 1.8. 

Prereq 3 - Add Hudi Package to Synapse

While Apache Hudi libraries are pre-installed by default on most other data services on AWS and GCP, there is one extra step you need to perform to get Hudi purring on Azure. See docs for how to install packages into your Synapse workspace.

With Spark 3.1, I could use Hudi 0.10+. I downloaded the latest Hudi 0.11.1 jar to my local machine from maven central here:

https://search.maven.org/search?q=a:hudi-spark3.1-bundle_2.12

Upload the .jar package to your Synapse workspace from the studio here:

Now that the package is uploaded to the workspace, you will need to add this package to your workspace pool. Navigate to your pool and find the 3 dots menu to find the packages option:

Click “Select From Workspace Packages”

And select the Hudi .jar you uploaded to your workspace:

Prereq 4 - Create a Synapse Notebook

Create a notebook in Synapse Studio. For this tutorial I am using Scala, but feel free to use Python as well.

Let’s Go!

Now with the Hudi libraries installed, you are ready to start using Hudi on Azure! 

Hudi and Synapse Quickstart

This simple quickstart will just get your feet wet and introduce you to the basic setup. See additional resources below for more in depth material. If you don’t want to copy/paste code in the rest of the tutorial, you can download this notebook from my Github and then import this into your Synapse workspace.

Import sample data

For this simple quickstart I am using a classic NYC Taxi dataset from AML open datasets.

Set Hudi write configs

Choose a Hudi base path and set basic write configs.

  • Read about Hudi record keys, precombine keys, and other configs in the Hudi docs:
  • Read about Hudi write operations here:

Write the sample dataset to ADLS G2 as a Hudi table

All it needs is a single keyword swap: spark.write.format("hudi")

Create a SQL Table

You can create a managed or external Shared Table with Hudi keyword

https://docs.microsoft.com/en-us/azure/synapse-analytics/metadata/table#shared-spark-tables 

You can now take full advantage of basic SQL on your Hudi table:

Upserts/Merges

Apache Hudi offers a first of its kind high performance indexing subsystem on a data lake. With record level indexes and ACID transactions it is simple for Hudi to make fast and efficient upserts and merges.

Let's say that after a taxi trip is completed, a rider decides to change his tip hours or days after the ride has completed.

For this example let's grab a single record and inspect the original `tipAmount`

Now set the hoodie write operation to `upsert`, change the `tipAmount` to $`5.23`, and write the updated value as an `append`

You can see that the original value is now updated:

Time Travel

With Apache Hudi you can write time travel queries to reproduce what a dataset looked like at a point in time. You can specify the point in time with a commit instant, or a timestamp: https://hudi.apache.org/docs/quick-start-guide#time-travel-query

There are several ways that you can find a commit instant (query table details, use Hudi CLI, or inspect storage). For simplicity, I opened the ADLS browser inside Synapse Studio and navigated to the folder where my data is saved. In the folder is a .hoodie folder which has a list of commits represented as [commit instance id].commit. I picked the earliest one for this example.

Here is the query:

Incremental Queries

Apache Hudi enables you to replace your old-school batch data pipelines with effecient incremental pipelines. I can specify an `incremental` query type and ask Hudi for all the records that are new or updated after a given commit or timestamp. (for this example it is just that one row we updated, but feel free to play around with it more)

Deletes

A challenging task for most data lakes is handling deletes, especially when dealing with GDPR compliance regulations. Apache Hudi processes deletes fast and efficient while offering advanced concurrency control configurations.

First, query the records you want to delete

Next, set the hoodie write operation to `delete` and write those records as an append:

You can confirm the record was deleted:

Conclusion

Hopefully this provides a guide on how you can get started using Apache Hudi on Azure. While I focused on Azure Synapse Analytics, this is not the only product in the Azure portfolio where you can use Apache Hudi. Here is an idea of some architecture patterns you can consider when developing your data platform on Azure with Apache Hudi:

If this looks exciting and you want to learn more, here are some more materials to get started:

Read More:

Subscribe to the Blog

Be the first to read new posts

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.