Crafting a solid foundation for a data lakehouse begins with acknowledging its roots in cloud storage: a sprawling collection of files. Without proper optimization of the files, tasks such as retrieving specific datasets, applying filters, joining dimensional data, etc., may require inefficient repetitive scanning through files that might feel like searching for a needle in a haystack.
Ensuring optimal file sizes, clustering related data, strategically partitioning data, and other data locality techniques are not merely technical minutiae; they hold the key to unlocking the true performance potential of a Universal Data Lakehouse. Lakehouse metadata layers such as Apache Hudi, Apache Iceberg, or Delta Lake help by adding abstraction layers around files, making some of these optimizations possible.
If you are using Hudi, Iceberg, or Delta, you likely have spent countless hours monitoring, and trial-and-error tuning, a wide array of configurations. Even after you complete a point-in-time optimization, your teams still have an ongoing operational burden of scheduling and running these optimizations, diagnosing failures, and managing the underlying infrastructure.
At Onehouse we have decades of collective experience, pioneering the invention of - and operating - these advanced optimizations at scale for some of the largest data lakes on the planet. We are excited to gather many of these innovations into a brand new product, Onehouse Table Optimizer.
Table Optimizer is a solution designed to streamline and optimize Apache Hudi table management. Table Optimizer empowers data engineers to focus on core business logic, while ensuring optimal table performance and reliability.
Our table optimizer runs alongside pipelines you already have in place. We write to the same tables and apply optimizations to improve the performance and cost of your tables. This blog post describes some of the engineering design and implementation challenges we faced in creating and implementing Table Optimizer, and how we solved them.
Data lakehouse architecture combines the best features of a data lake and a data warehouse. The term "table services" represents a set of functionalities and tools that manage and optimize the data stored in the data lakehouse. Some examples of these functionalities include data compaction, data clustering, data cleanup, and metadata sync. (For more depth on these and other Hudi topics, see our free ebook, Apache Hudi: From Zero to One.) Hudi offers a complete spectrum of table services for Hudi tables.
Hudi also offers different ways to deploy these tables services, ranging from a completely inline model, to just scheduling inline (async mode), to a fully standalone model. Most users in the open source community deploy these table services inline, to keep things simple and to avoid the overhead of managing complicated multi-writer concurrency control methods.
Hudi users face a few challenges to use inline/async table services:
Offloading table services into a separate set of resources can mitigate all the above challenges at the same time. For example, clustering is normally a time-consuming task. By running clustering tasks outside of the ingestion pipeline, the ingestion latency can be dramatically reduced, as well as being made more predictable. Meanwhile, any issues with clustering tasks, such as memory issues, will not block ingestion tasks. As a result, more ingestion jobs can be scheduled concurrently to fully utilize the resources, reducing costs.
So at Onehouse, we have developed an independent table service management component which is able to schedule and execute table services asynchronously, without blocking the ingestion writers. This frees up pipeline owners to focus on data ingestion. (When integrated with ingestion services, within the Onehouse Cloud product, this component also frees pipeline owners from the task of tuning performance of these table services for different workloads.) This component is Table Optimizer.
Table Optimizer is executed inside a Spark application, deployed through Kubernetes on the customer's AWS EKS or GCP GKE cluster. The main logic of Table Optimizer runs inside the Spark driver, and each individual table service task is executed in Spark executors. Using Spark supports our goal of delivering system scalability and reliability.
Table Optimizer serves as the orchestration layer for Hudi table service tasks. Table service tasks for one table are independent of those for other tables. Meanwhile, due to the thoughtful design of Hudi, table service tasks for a specific Hudi table can be executed concurrently. Therefore, the architecture of Table Optimizer is essentially a task-dispatcher system, which can easily achieve high throughput and reliability. The following diagram shows the architecture of Table Optimizer.
A metastore stores the table specifications, and whether a table is under the management of Table Optimizer is specified in a manifest file stored in a distributed file system. The manifest file is updated by customers through the friendly user interface of Onehouse Cloud.
Table Service Manager: Table Service Manager maintains a set of handles to the tables under management, and each table handle contains necessary information for scheduling and executing table services, such as write configurations, table status, and task status. Table Service Manager periodically updates these handles by fetching the manifest file. Meanwhile, Table Service Manager is responsible for selecting active tables and scheduling tasks for each table, based on business logic and Hudi internal constraints.
Coordinator Task: The Coordinator task processes active tables and schedules specific table service tasks for each of them. Before each scheduling operation, the Hudi write client is refreshed with the latest table configurations and timeline information.
Table Service Tasks: The major table service categories include compaction, clustering, cleaning, timeline archive, and metadata sync. More categories can be added easily.
Resource Manager: Resource Manager consolidates allocation and release of all resources for Table Optimizer, such as thread pools and the Hudi timeline service. This module prevents any sort of resource leak during runtime.
After Resource Manager allocates resources, Table Service Manager creates or updates its table handles by fetching the latest manifest file. For active tables, it submits a Coordinator task instance into the task pool.
Table Optimizer ensures that there is a single Coordinator task running for a specific table at any time, such that the Coordinator task ensures that all the table service tasks for the table are scheduled in a proper order to avoid data inconsistency issues. Generally, the Coordinator task first checks if there are any failed compaction or clustering tasks and, if so, reschedules them. After that, it checks if any new compaction, clustering, cleaning/archival, and/or metadata sync tasks need to be scheduled, and if so schedules them in a proper order.
All scheduled table service tasks are executed asynchronously. A proper distributed locking mechanism and Hudi internal concurrency control mechanism work together to ensure the expected system behavior.
For tables whose table services are paused, Table Service Manager removes their scheduled tasks and kills their running tasks. For tables whose status is set to stopped status, they are removed from the manifest file.
Besides common implementation challenges, such as scalability and reliability, here are a couple of interesting challenges we met when we were implementing and deploying Table Optimizer:
We continually seek to deliver a number of concrete benefits to those who use Table Optimizer (and the parallel functionality in Onehouse Cloud):
Table Optimizer is available today, serving Hudi users who want a managed service to handle time-consuming tasks such as data compaction, clustering, and cleanup. Interoperability with Apache Iceberg and Delta Lake is available via Apache XTable (Incubating).
For a free trial of Table Optimizer, or to learn more, contact Onehouse.
Be the first to read new posts