June 18, 2024

How to use Apache Hudi with Databricks

How to use Apache Hudi with Databricks

At Onehouse, we help organizations rapidly implement a Universal Data Lakehouse, unlocking previously untapped potential for analytics and AI. 

Here’s one example of the flexibility that the Universal Data Lakehouse model provides: The continuous ingestion services offered by Onehouse write data in your S3/GCS buckets as Apache Hudi, Delta Lake, and/or Apache Iceberg tables. Onehouse can even simultaneously write in all three formats - thanks to Apache XTable (Incubating), an open source project we originally launched as OneTable with Google and Microsoft.

When a data engineering team evaluates Apache Hudi, some of the unique characteristics that stand out include: 

  • the multi-modal indexing subsystem
  • the incremental processing framework
  • advanced merge-on-read configurations
  • the self-managing timeline, which runs all your table maintenance hands-free

However, over the past few years, there was a period where developers’ table format choices were constrained by their query / processing engine choices. 

Now, though, developers realize they can easily mix and match table formats with query engines using the likes of Apache XTable. So a question we hear more of is, “Can I use Apache Hudi with Databricks?” Some are even pleasantly surprised to hear that the answer is, “Yes, it’s easy!” 

In this blog, we’ll guide you through the steps to use Apache Hudi on Databricks and demonstrate how to catalog the tables within Databricks’ Unity Catalog via Apache XTable. 

Running Hudi on Databricks

Let’s walk through the straightforward process of setting up and configuring Hudi within your Databricks environment. All you need is your Databricks account and you're ready to follow along. We'll cover everything from creating compute instances to installing the necessary libraries, ensuring that you're set up for success.

We'll also explore the essential configurations needed to leverage Hudi tables effectively. With these insights, you'll be equipped to tackle a wide range of data management tasks with ease, all within the familiar Databricks environment.

Prerequisites

  1. A Databricks account. Note that you can work with Databricks community edition to follow along with this blog; you don’t have to have a paid account.

Setup

  1. From your Databricks console, click Compute and then Create Compute
  2. Choose a relevant compute name, such as hudi-on-databricks
  3. Choose a runtime version that is compatible with the Hudi version you are planning to use. For example, Databricks’ 13.3 runtime supports Spark 3.4.1, which is currently supported by Hudi.
  1. Once the compute is created, head into the Libraries tab inside hudi-on-databricks and click Install new.
  2. In the pop-up, choose Maven, then provide the Hudi-Spark maven coordinates - such as org.apache.hudi:hudi-spark3.4-bundle_2.12:0.14.1.
  1. You need to provide the Databricks runtime environment with the minimal configuration required to work with Hudi tables with Spark. This can be written to the Spark tab under Configuration inside hudi-on-databricks. Paste in the following configurations:

spark.serializer org.apache.spark.serializer.KryoSerializer
spark.sql.catalog.spark_catalog org.apache.spark.sql.hudi.catalog.HoodieCatalog
spark.sql.extensions org.apache.spark.sql.hudi.HoodieSparkSessionExtension
spark.kryo.registrator org.apache.spark.HoodieSparkKryoRegistrar

  1. Now you have to Confirm the changes, which restarts the compute. Now, any notebook which uses this compute will be able to read and write Hudi tables.

Putting It All Together

Let’s create a notebook based on the example provided in the Hudi-Spark quickstart page. You can also download the notebook from this repository and upload it to Databricks for ease of use.

As we have already updated the compute with the Hudi-Spark bundled jar, it’s automatically added to the classpath. You can start reading and writing Hudi tables as you would in any other Spark environment.

# pyspark
from pyspark.sql.functions import lit, col

tableName = "trips_table"
basePath = "file:///tmp/trips_table"

columns = ["ts","uuid","rider","driver","fare","city"]
data
=[(1695159649087,"334e26e9-8355-45cc-97c6-c31daf0df330","rider-A"
,"driver-K",19.10,"san_francisco"),
      (1695091554788,"e96c4396-3fad-413a-a942-4cb36106d721","rider-C","
driver-M",27.70 ,"san_francisco"),
      (1695046462179,"9909a8b1-2d15-4d3d-8ec9-efc48c536a00","rider-D","
driver-L",33.90 ,"san_francisco"),
      (1695516137016,"e3cf430c-889d-4015-bc98-59bdce1e530c","rider-F","
driver-P",34.15,"sao_paulo"),
      (1695115999911,"c8abbe79-8d89-47ea-b4ce-4d224bae5bfa","rider-J","
driver-T",17.85,"chennai")]
inserts = spark.createDataFrame(data).toDF(*columns)
inserts.show()

While this example demonstrates writing data to the local path inside Databricks, typically, for production, users would write to S3/GCS/ADLS. This pattern requires an additional configuration in Databricks. For steps to enable Amazon S3 reads/writes you can follow this documentation.

hudi_options = {
   'hoodie.table.name': tableName
}

inserts.write.format("hudi"). \
   options(**hudi_options). \
   mode("overwrite"). \
   save(basePath)

Note: Pay close attention to hoodie.file.index.enable being set to false. This enables the use of Spark file index implementation for Hudi; this speeds up listing of large tables, and is required if you're using Databricks to read Hudi tables.

tripsDF = spark.read.format("hudi").option("hoodie.file.index.enable", "false").load(basePath
tripsDF.show()

You can also turn this into a view, to enable you to run SQL queries from the same notebook.

tripsDF.createOrReplaceTempView("trips_table")
%sql
SELECT uuid, fare, ts, rider, driver, city FROM  trips_table WHERE fare > 20.0

(Optional) Using Unity Catalog with Apache XTable

As Databricks’ Unity Catalog supports only Delta Lake table format currently, if there is a need to use Unity Catalog, you have to translate the Hudi table metadata to Delta Lake format using Apache XTable.

In production scenarios, when you’re working with cloud object stores, you can translate the tables and add them to Unity Catalog as mentioned in the XTable documentation.

Conclusion

In conclusion, integrating Apache Hudi into your Databricks workflow offers a streamlined approach to data lake management and processing. By following the steps outlined in this blog, you've learned how to seamlessly configure Hudi within your Databricks environment, empowering you to handle needed data operations with efficiency and confidence.

To learn more about how to mix-and-match table formats and query engines, and the value of doing so, you can download the Universal Data Lakehouse whitepaper today for free.

Authors
No items found.

Read More:

Introducing Onehouse LakeView
Introducing Onehouse Table Optimizer
How to use Apache Hudi with Databricks
Managing AI Vector Embeddings with Onehouse

Subscribe to the Blog

Be the first to read new posts

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
We are hiring diverse, world-class talent — join us in building the future