July 20, 2023

Apply Pre-Commit Validation for Data Quality in Apache Hudi 

Apply Pre-Commit Validation for Data Quality in Apache Hudi 

Data quality is the process of ensuring that data is accurate, consistent, complete, and reliable. It is essential in data engineering to ensure that the data used for analysis and decision-making can be trusted. Organizations utilize various techniques such as data validation, cleansing, profiling, and monitoring to identify and resolve issues, thereby maintaining data integrity and achieving high-quality data.

In this blog, we will explore how Apache Hudi's pre-commit validation feature can be leveraged to ensure data quality by validating data before it is committed to the storage system.

Importance of data quality in a data lakehouse

  • Data quality is crucial for accurate decision-making, trust, and credibility in data and insights, operational efficiency, cost savings, compliance, customer satisfaction, data integration, and long-term data value.
  • High-quality data enables informed and confident decision-making, leading to better outcomes and insights.
  • Poor data quality introduces inefficiencies, inaccuracies, and additional costs throughout the project lifecycle.
  • Compliance with regulations and data protection requires accurate and reliable data, ensuring legal compliance and mitigating risks.
  • Customer satisfaction is directly influenced by data quality, as it affects the accuracy of interactions and personalized experiences.
  • Data integration and interoperability rely on high-quality data to ensure smooth data exchange and sharing between different systems, platforms, and stakeholders.

Data quality using pre-commit validators in Apache Hudi

Apache Hudi , originally developed at Uber in 2016, is a cutting-edge data lakehouse technology that has become an Apache open source project since 2017. It has gained widespread adoption and contributions from major enterprises like Uber, Amazon, Walmart, GE, and others. Hudi brings database-like features to data lakes, paving the way for the data lakehouse paradigm. With Hudi, users can perform updates and deletions on their data lakes with transactional consistency and high performance, revolutionizing data lake management.

Apache Hudi provides pre-commit validators that allow users to validate their data against specific data quality expectations during the writing process using DeltaStreamer or Spark Datasource writers. These validators ensure the integrity and quality of the data being written.

To configure these validators, users can use the hoodie.precommit.validators setting, which accepts a comma-separated list of validator class names. This configuration provides flexibility for customizing the data quality validation process according to specific requirements.

The pre-commit validators in Apache Hudi enable users to enforce data quality checks such as uniqueness constraints, schema compliance, data consistency, and adherence to business rules. By performing these checks before committing the data to a new version, only high-quality data is written to the storage system.

By leveraging this feature, users can choose from a range of built-in validators or create their own custom validators to meet their specific data quality needs. This ensures that the written data meets the desired quality standards and enhances the reliability and accuracy of the overall data lake or data warehousing system. For example, if a user wants to apply multiple validators, they can configure them as follows:

In the above example, two validator classes, ValidatorClass1 and ValidatorClass2, are specified and will be executed during the pre-commit phase.

SQL based Pre-commit Validation

Apache Hudi is bundled with a pre-implemented validator called SqlQuerySingleResultPreCommitValidator, which validates that a SQL query on a table produces a specific single value result. Users can provide multiple queries, separated by a ';' delimiter, and the expected result is included as part of each query, separated by '#'. This feature allows users to verify specific conditions or expectations on the data. Ensure that the queries used for validation are not terminated by a semicolon, as the semicolon is used as a separation character for multiple queries.

The following example demonstrates how to implement a validation that restricts the insertion of null values for the "name" column using Spark SQL:

In the provided code block, "Query 1" fails with the error message "At least one pre-commit validation failed." This occurs because the validator query checks for null values in the "name" column, and the query results in 1, which is not equal to the expected result of 0.

The error indicates that the validation condition is not met, and the insertion of null values in the "name" column is restricted as intended.

Here's another comprehensive example that implements the following rules:

  • The "Name" field, if present, must start with a capital letter.
  • The "PhoneNumber" field should be mandatory and contain exactly 10 characters and only numbers.
  • The "Email" field is not mandatory but must contain the "@" symbol and a dot.

Defining the custom PreCommit Validator 

Apache Hudi offers the SparkPreCommitValidator class, which users can extend to define custom pre-validators based on their specific use cases. By extending this class, users can create their own validation logic to ensure data quality before the commit operation.

The SparkPreCommitValidator class serves as an abstract class that users can subclass to implement their custom pre-validators. The key method to override is validateRecordsBeforeAndAfter, which takes two Dataset<Row> parameters representing the data before and after the commit, and a Set<String> parameter indicating the affected partitions.

With the ability to extend the SparkPreCommitValidator class, users have the flexibility to incorporate their own data quality checks, apply custom business rules, and implement specific validation conditions. This empowers users to create pre-validators that align with their unique requirements.

To utilize the custom pre-validator, users can reference their implementation in the configuration by specifying the class name when configuring pre-commit validators using the hoodie.precommit.validators property.

By Leveraging the SparkPreCommitValidator class, Apache Hudi empowers users to define custom pre-validators that enable comprehensive data quality validation. These tailored validators ensure that the validation process aligns precisely with users' specific use cases and quality expectations.

Conclusion

By configuring and utilizing these pre-commit validators in Apache Hudi, users can enforce data quality expectations and ensure the integrity, consistency, and correctness of the data being written to their data lakehouses.

These pre-commit validators provide users with powerful tools to validate their data and ensure it meets the desired quality standards, enhancing the reliability and accuracy of their data-driven applications and analytics processes.

We hope that this blog will assist users in implementing pre-commit validators for data quality checks in their lakehouse data pipelines. We value your feedback and encourage you to share your thoughts with us. Feel free to engage with us in the Hudi community and join our Slack! channel.

Subscribe to the Blog

Be the first to read new posts

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
We are hiring diverse, world-class talent — join us in building the future