July 7, 2023

Integrating Onehouse with the Amazon Sagemaker Machine Learning Ecosystem

Integrating Onehouse with the Amazon Sagemaker Machine Learning Ecosystem

Introduction 

Companies around the world are leveraging machine learning (ML) to enhance their business capabilities. Firms leverage ML ecosystems provided by software vendors to accelerate the training and development capabilities of their data scientists and machine learning engineers. Popular ecosystems offered by vendors include Amazon Sagemaker, Azure ML, Google Cloud VertexAI, and IBM Watson among others. However, in order to effectively use these ecosystem services, firms need to be able to make their large scale data accessible to the ecosystems. 

Onehouse is a solution to this problem. It’s a lakehouse as a service offering that takes only an hour or two to launch. Onehouse is open, seamlessly interoperable with all lakehouse platforms, fast, and resource-efficient. This saves engineering time, speeds the start and completion of projects, and reduces project costs. Using Onehouse also allows teams to stay out of competitive back and forth between major vendors that are currently pulling the industry in different directions. 

In this blog, we will explore how customers can connect data stored in Onehouse to the Amazon Sagemaker ecosystem. Customers can use this model to have Onehouse serve as the data bedrock of their machine learning needs while simultaneously taking advantage of the powerful computing capabilities and ML services offered by Sagemaker.

Onehouse Managed Lakehouse

The Onehouse solution is designed to help customers build, implement, and run a modern data lakehouse in a fast and cost-effective way. It turns data lakes into fully managed lakehouses by bringing data warehouse functionality to the data lake (and all of the user’s data stores). Onehouse also automates data management services like ingestion, performance optimization, indexing, observability and data quality management. Apache Hudi is the original, ground-breaking open-source Lakehouse technology that powers Onehouse, and has been chosen by AWS and others for their own use. For more detailed information, see: https://www.onehouse.ai/product.

Amazon Sagemaker

Amazon Sagemaker is AWS’s machine learning ecosystem. It allows data scientists and ML engineers to build, train, and deploy ML models for any use case on top of fully managed infrastructure, tools, and workflows. For this blog, we will take advantage of Sagemaker notebooks. These are Juptyer notebooks that run on top of AWS’s machine learning-optimized compute instances. Sagemaker also contains a wide array of additional features that you can explore in addition to the example provided in this blog.

Solution Architecture

In this example use case, we will be taking a timeseries dataset of electric production data loaded into a data lakehouse powered by Onehouse and connecting it to an Amazon Sagemaker Notebook Instance. From here, we will be able to train a forecasting model which can be used to forecast future electricity production needs. Our goal is to show how a small dataset can be connected from the Onehouse into Amazon’s Sagemaker ecosystem for training and testing machine learning models. Using this example as a blueprint, customers can extend their own ML capabilities to the wide array of datasets and models that will unlock business value at their respective organizations.

Prerequisites:

  1. An active AWS account
  2. An Amazon Sagemaker Jupyter Notebook
  3. Onehouse Managed Lakehouse configured for AWS
  4. Data for use in ML model training loaded into your Onehouse lakehouse
Architecture diagram- Onehouse integration with Sagemaker

The solution will have raw electric production data ingested into the Onehouse data lakehouse via Onehouse’s managed S3 ingestion capability. Once loaded into the Apache Hudi tables, the data will be available for query by Amazon Athena, which will serve the data to our Amazon Sagemaker notebook. Sagemaker will then be used for training and testing the machine learning models. 

In the ingestion step, we loaded our electric production data into a Hudi table named ml_table via Onehouse’s S3 managed ingestion. This is synced with the AWS Glue Data Catalog to provide query capabilities.

Onehouse Hudi Table - ML Data

Integrating Onehouse with the Sagemaker notebooks

The open nature of Apache Hudi tables means that Hudi is supported natively by a wide variety of query engines available today. In this blog, we chose to leverage Amazon Athena, which is serverless and easy to use, to serve as our query layer.  Onehouse provides built-in support for the AWS Glue Data Catalog, enabling seamless integration with Amazon Athena for powerful query capabilities on your data.

Sagemaker can utilize the data obtained through Athena queries to generate dataframes, which are then employed for training machine learning (ML) models. In the below steps, you can see how we were able to execute this using the following tools:

  1. Sagemaker Jupyter Notebook
  2. Python Boto3 API for Amazon Athena
Code for Athena queries Hudi tables

The code above shows us using Athena to query our Hudi tables and loading the data from this query into a dataframe (df). Once loaded into a pandas dataframe, we can begin the process of preparing and training our forecast models on this data.

Training timeseries data - ARIMA model

To better understand the data that we are working with, we start off by plotting the data to observe  any potential trends or seasonality.

Dataset Trend Plot

The above graph shows an evident upward trend with annual seasonality, indicating a correlation between electricity generation and the varying demand required for each season. As a result, we will use the Seasonal Autoregressive Integrated Moving Average model (SARIMA), because  this will provide us with an accurate timeseries forecast while also taking into account seasonal changes in the data. 

Since our dataset records the electric production on the first day of every month, we will set the seasonality to 3, indicating that there are 3 months in every season.  To run an ARIMA model, we will have to determine the following hyperparameters:

p is the number of autoregressive terms

d is the amount of differencing that needs to be done to stationarize the dataset

q is the number of lagged forecast errors in the prediction equation

Determine d:

This is the number of differences needed to stationarize the dataset; that is, to remove any long term variation in statistical properties, so that we can focus on other trends within the data. We run the below code to determine this parameter.

Differencing plots

We can see that the data stationarizes after 1 order of differencing, hence d=1. Next, we will determine the p and q values.  Below are the autocorrelation function (ACF) and partial autocorrelation function (PACF) plots.  This will give us the autocorrelation (p) and partial autocorrelation value (q). 

ACF and PACF Plots

Autocorrelation(ACF) and Partial Autocorrelation (PACF) Plots

Analyzing the autocorrelation function (ACF) plot, we see a curve value, suggesting the starting point for model training with a non-zero p value, which in this case is determined as 1. Similarly, by examining the partial autocorrelation function (pACF) plot, a prominent spike is observed at the value of 1, indicating the initial value for q, which is set as 1. These values provide the foundation for selecting the appropriate parameters in our model.

Now, we will begin training our SARIMA model using the following hyperparameters: 

p,P=1

q,Q=1

d,D=1

m=3

The below code will be used to train the model:

code

We will now take a look at the residuals on the model after training:

code
Model Residuals ACF and PACF Plots

It is evident that there are no notable residuals observed in either the Autocorrelation or Partial Autocorrelation functions. With this observation, data scientists can proceed with model testing, further refinement, and applying tuning techniques to optimize their models for production readiness.

Conclusion 

We are excited to showcase the features of Onehouse to accelerate the machine learning (ML) capabilities of our customers. By connecting data from the Onehouse product to the Amazon Sagemaker Machine Learning Ecosystem, data scientists can use the lakehouse as a centralized store for training and testing data. Having a single source of truth for machine learning data means that models can have continuous improvement and connectivity to near real-time data inflow offered by Onehouse - a powerful addition to the ML models and the AI capabilities that our customers are already building on. 

Through advances in cost optimization, openness, time to value, and now AI/ML integration, we take great pride in being able to deliver significant business value to Onehouse customers on their journey to a unified data lakehouse. 

If you want to learn more about Onehouse and would like to give it a try, please visit the Onehouse listing on the AWS Marketplace or contact gtm@onehouse.ai.

Read More:

Subscribe to the Blog

Be the first to read new posts

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.