May 1, 2024

Onehouse Custom Transformations QuickStart

Written by:

Chandra Krishnan

Onehouse Custom Transformations QuickStart

Data transformations, used in both extract, transform, and load (ETL) and extract, load, and transform (ELT) processes, are a pivotal component in building data pipelines within the contemporary data stack. As clients migrate their data from its point of origin to successive layers within the data lakehouse stack, the need arises to enact transformations that optimize data for downstream value generation and swift insights. To facilitate this process, Onehouse has developed an array of no-code transformations, addressing commonplace requirements ranging from struct flattening and array explosion to field masking and datetime inference.

However, situations may well arise where Onehouse users need to apply custom business logic to their data transformations. So Onehouse supports the creation and integration of custom transformers into Onehouse stream captures. This enables any Onehouse stream capture to serve as a comprehensive data pipeline, executing cost-effective and high-performance data transformations seamlessly within the Onehouse-managed lakehouse.

This blog post serves as a QuickStart guide for utilizing these Custom Transformers and explores the myriad advantages they offer to organizations' data pipelines.

What are Onehouse Custom Transformations

Onehouse custom transformers allow customers to author their own data transformations using Java code written in the Spark framework. Through this, customers can create new tables that have the data structured as they want. These custom transformers are executed as a part of Onehouse stream captures, which incrementally process data from a source to the destination. Through this, each transformer runs its operations incrementally, operating on the slice of data that is currently flowing through the stream capture. A diagram of this can be seen in the figure below.

How to Write Custom Transformers

Prerequisites:

Java 8 installed in your development environmentsome text
1. The Onehouse transformers are built using Java 8 and require this in order to compile, with dependencies
A Java IDE or code editor, such as IntelliJ or VSCode
Gradle or Maven installed in your development environment

Clone this repository to your development environment. This repo provides a template with the Gradle build setup configured and the needed dependencies already set.

From here, navigate to the /src/main/java/onehouse/ directory. This is where you can put your transformer classes. There is an example transformer already set there which performs a simple select and group by operation on the dataset. As you can see, in order to create a new transformer, one needs to extend the Transformer base class and implement the “apply” method with the given parameters. The parameters are as follows:

javaSparkContext - the java wrapper of your sparkContext
sparkSession - the entry point of Spark
df - This is the input data from the stream source
typedProperties - extension of the properties class

From there, you can author any transformation operations you wish inside of the apply() function. The apply function returns the output value of the transformation.

Tests can be written in test files under /src/test/. Here, you can create sample datasets or schemas in the resources folder. Your implementation of the custom transformers can be tested in the sample test function shown there. In addition, you can author new test classes for any additional transformations that you create.

Building and Using the Transformers

You can use a Gradle- or Maven- based build environment to build the custom transformer code and upload it to the Onehouse console for use in your stream captures. The example transformer in the linked GitHub repository uses Gradle. In order to build, install Gradle using the process linked here. Next, navigate to the root directory of the project and run the following command:

./gradlew clean build

This will create a build directory in the root of the project. The jar file will be located in the path: build/libs/OnehouseCustomTransformersQuickStart-1.0-SNAPSHOT.jar.

In order to upload this jar file to be used by Onehouse, navigate to Settings -> Integrations and click on “Manage Jars” in the Onehouse console.

Here, you can upload the jar (either from your computer or from a location in an S3/GCS bucket). Any transformers that are in that jar file will now be available in the Onehouse Console.

Executing Transformations:

In order to execute Onehouse custom transformations, add these transformers to their Onehouse Stream Captures, as follows:

Create a new stream capture via the UI
When setting up the new stream capture, click “Configure”
You will see a section called Transformations. Look for the desired transformation function in the dropdown and click the + Add Transform button
Click “Start Capturing” to start the stream capture

Now, the transformation will automatically execute as a part of the stream capture process.

Cost Savings

Onehouse custom transformers allow customers to run their ETL/ELT transformations on the same shared compute as their ingestion and data optimization. This means that they will be able to run transformations inside of specially bin-packed Spark code that Onehouse has developed. This means that the ingestion, transformation, and optimization jobs will be packed together to run such that no compute is sitting idly by. An illustration of this process is shown in the figure below.

Conclusion

In this blog, we were able to see how Onehouse customers can use the Onehouse custom transformers in order to execute business logic in a way that is easy, performant, and cost-effective. In addition to these transformers, Onehouse also has a robust set of no-code transformers, which include CDC materialization, JSON parsing, and many others. These capabilities provide end-to-end ETL/ELT on top of the Onehouse Universal Lakehouse. If you are interested in learning more, please reach out to gtm@onehouse.ai or sign up for a free trial with $1000 in credits.

Authors

No items found.