In Hudi 0.10 we've introduced support for advanced data layout optimization techniques such as Z-order and Hilbert Space Filling curves (as new clustering algorithms) that unlocked the power of data skipping even in complex scenarios where a large table is frequently queried with filters on multiple columns rather than a single one.
But what is data skipping actually?
Data Skipping as a technique rose in popularity along with the scale of the data being stored in data lakes. Data Skipping is essentially a common term for various types of indexes enabling query engines to effectively skip the data, that is irrelevant to the query it's currently executing to reduce amount of scanned and processed data, saving on the amount of data scanned as well as (potentially) significantly improving execution time. Let's take as an example a simple non-partitioned parquet table “sales” storing records with the schema like the following:
Each parquet file of this table will naturally have a range of values stored in each respective column corresponding to the set of records stored in this particular file and for every column parquet would follow either natural ordering (for ex, strings, dates, ints, etc) or induced one (for ex for, composite data types parquet orders them lexicographically, which also matches ordering of its binary representation).
But if there're an ordering and a range... there're also min and max values! Meaning now that every column for every Parquet file has well-defined min and max values (which could be null as well). Min/max values are examples of so called column statistics -- metrics characterizing the range of the values stored in a single column of the columnar file-format (like Parquet). Another examples usually are
Equipped with the column statistics characterizing a range of values stored in each individual column in every file, let's now collate a following table: each row will be correspondent to a pair of filename & column, and for each such pair we would write out corresponding statistic: min, max, count, null-count:
This essentially is a Column Stats Index!
For convenience, let's transpose it such that each row will be correspondent to a single file, while every statistic column will bifurcate into its own copy for every data column:
Such transposed representation makes a very clear case for Data Skipping:
For query Q with predicates P1, P2, ... on the columns C1, C2, ..., which are indexed by the column stats index, we can evaluate those predicates P1, P2, etc. against column statistics stored in the index for every correspondent file of the table to understand whether particular file “file01”, “file02”, etc could potentially contain the values matching the predicates. This approach is exactly what is done by Spark/Hive and other engines for example, when they are reading data from Parquet files -- each individual Parquet file stores its own column statistics (for every column) and with predicate filters being pushed down to Parquet reader it's able to evaluate whether the query in question might be satisfied with the data stored in the column (in the file) allowing to avoid unnecessary fetching, decompressing and decoding of the data in cases when file does not contain any data matching query predicates.
But if Parquet already stores column statistics what's the point of creating an additional index?
Each Parquet file stores individually just a single row from the Index we've composed above. Obvious drawback to such approach is that to understand which files could potentially contain the data query is looking for, query engine will have to read the Parquet footers of every Parquet file in the table affecting query performance (could even potentially result in throttling from the cloud storage) as compared to an dedicated Index represented in a more compact format.
In Hudi 0.10 we've introduced stop-gap implementation of the very simple column stats index (being stored as just a simple Parquet table) to support very first version of the data skipping implementation in Hudi to showcase the power of Z-order and Hilbert space-filling curves as advanced Layout Optimization techniques.
In Hudi 0.11, we're introducing Multi-modal Indexes such as bloom-filter index and column stats index into Metadata Table, both of which are implemented as dedicated partitions w/in Metadata Table (“column_stats” and “bloom_filters” respectively). While these new Indexes are still in experimental phase, moving column stats index into Metadata Table means more:
Here we’re going to cover some crucial aspects of the design of the new column stats index. If you’re interested in more details we’d recommend you check out RFC-27 for more details.
Column stats index is persisted as standalone partition w/in the Metadata Table (designated as “column_stats”). To be able to keep up with the scale of the largest tables out there while staying flexible, index could be configured to be sharded into a number of file-groups with individual records being hashed into either of these based on their key value. To configure the number of file-groups please use the following configuration (bearing default value of 2):
As was already called out, Metadata Table uses HFile as its storage file format (which is very efficient sorted binary key-values format) to be able to
To explain how this is being used in column stats index, let’s take a look at its record’s key’s composition:
Prefixing index records’ keys with column is not random and is motivated by following observations
To better exemplify it, let’s take a look at a query Q filtering on a column C2:
We can simply read a contiguous block of records without a need to either a) read the whole index (which could be large) nor b) do random seeks to cherry-pick records we’re interested in. This allows us to reach substantial performance improvements for very large tables as going to be shown in the section below.
To demo column stats index and data skipping features in full swing we will be using commonly known Amazon Reviews dataset (taking up only 50Gb on storage) so that our results are easily reproducible by anybody, but with slightly uncommon ingestion configuration to showcase how efficiencies brought about by column stats index and data skipping scale with the number of files in the dataset.
To ingest Amazon Reviews dataset into Hudi table we're using this gist.
Please note, that you have to specify following configuration properties to make sure column stats index is being built synchronously during ingestion:
However, If you want to run an experiment on an already existing table which doesn't have column stats index currently, you can leverage the async indexer feature to backfill the index for an already existing table.
Note that, to see data skipping in action following is required:
For that following 2 properties have to be specified either as Spark or Hudi options:
*Metadata Table is enabled by default only on the writer side and readers still have to specify corresponding config explicitly if they're willing to leverage Metadata Table on the read path:
Please check out this gist to see how to query previously ingested dataset.
All of the tests are performed on a small EMR cluster with the following configuration, making it easy for you to reproduce the same results would you opt in to do so.
Nodes: m5.xlarge (1 master / 3 executors)
Spark: OSS 3.2.1 (Hadoop 3.2)
Run: Non-partitioned COW Table
Note that, we deliberately squeeze the file-size out to generate a meaningfully large number of files, given that the dataset is only 50Gb.
As could be easily seen from the table above, data skipping powered by new Column Stats Index in Hudi 0.11 are bringing in substantial improvements in execution performance for queries (proportional to their pruning potential), reducing the execution runtime as well as saving crucial compute resources directly translating into cost-savings for Cloud-based Lakes and Lakehouses built on Hudi.
Even though power of column stats index and data skipping are already available to Hudi users today, there’s more work to be done that is currently in progress:
If you would like to follow the work currently in progress or chime in on some features you feel are the most valuable please check out HUDI-1822 and leave your comments.
If you have any questions don’t hesitate to reach out on Hudi’s official Slack channel.
Be the first to read new posts