June 9, 2022

Hudi’s Column Stats Index and Data Skipping feature help speed up queries by an orders of magnitude!

Hudi’s Column Stats Index and Data Skipping feature help speed up queries by an orders of magnitude!

Introduction

In Hudi 0.10 we've introduced support for advanced data layout optimization techniques such as Z-order and Hilbert Space Filling curves (as new clustering algorithms) that unlocked the power of data skipping even in complex scenarios where a large table is frequently queried with filters on multiple columns rather than a single one.

But what is data skipping actually?

Data Skipping as a technique rose in popularity along with the scale of the data being stored in data lakes. Data Skipping is essentially a common term for various types of indexes enabling query engines to effectively skip the data, that is irrelevant to the query it's currently executing to reduce amount of scanned and processed data, saving on the amount of data scanned as well as (potentially) significantly improving execution time. Let's take as an example a simple non-partitioned parquet table “sales” storing records with the schema like the following:

Each parquet file of this table will naturally have a range of values stored in each respective column corresponding to the set of records stored in this particular file and for every column parquet would follow either natural ordering (for ex, strings, dates, ints, etc) or induced one (for ex for, composite data types parquet orders them lexicographically, which also matches ordering of its binary representation).

But if there're an ordering and a range... there're also min and max values! Meaning now that every column for every Parquet file has well-defined min and max values (which could be null as well). Min/max values are examples of so called column statistics -- metrics characterizing the range of the values stored in a single column of the columnar file-format (like Parquet). Another examples usually are

  • Total number of values
  • Number of null values (along with total, could yield number of non-null values for the column)
  • Total size in bytes of all values in the column (dependent on the used encoding, compression, etc)

Equipped with the column statistics characterizing a range of values stored in each individual column in every file, let's now collate a following table: each row will be correspondent to a pair of filename & column, and for each such pair we would write out corresponding statistic: min, max, count, null-count:


This essentially is a Column Stats Index!

For convenience, let's transpose it such that each row will be correspondent to a single file, while every statistic column will bifurcate into its own copy for every data column:

Such transposed representation makes a very clear case for Data Skipping

For query Q with predicates P1, P2, ... on the columns C1, C2, ..., which are indexed by the column stats index, we can evaluate those predicates P1, P2, etc. against column statistics stored in the index for every correspondent file of the table to understand whether particular file “file01”, “file02”, etc could potentially contain the values matching the predicates. This approach is exactly what is done by Spark/Hive and other engines for example, when they are reading data from Parquet files -- each individual Parquet file stores its own column statistics (for every column) and with predicate filters being pushed down to Parquet reader it's able to evaluate whether the query in question might be satisfied with the data stored in the column (in the file) allowing to avoid unnecessary fetching, decompressing and decoding of the data in cases when file does not contain any data matching query predicates.

But if Parquet already stores column statistics what's the point of creating an additional index?

Each Parquet file stores individually just a single row from the Index we've composed above. Obvious drawback to such approach is that to understand which files could potentially contain the data query is looking for, query engine will have to read the Parquet footers of every Parquet file in the table affecting query performance (could even potentially result in throttling from the cloud storage) as compared to an dedicated Index represented in a more compact format.

Column Stats Index and Data Skipping in Hudi 0.11

In Hudi 0.10 we've introduced stop-gap implementation of the very simple column stats index (being stored as just a simple Parquet table) to support very first version of the data skipping implementation in Hudi to showcase the power of Z-order and Hilbert space-filling curves as advanced Layout Optimization techniques.

In Hudi 0.11, we're introducing Multi-modal Indexes such as bloom-filter index and column stats index into Metadata Table, both of which are implemented as dedicated partitions w/in Metadata Table (“column_stats” and “bloom_filters” respectively). While these new Indexes are still in experimental phase, moving column stats index into Metadata Table means more:

  1. Robust support: Column Stats Index (CSI) now also enjoys consistency guarantees of the Metadata Table
  2. Efficient implementation: Metadata Table uses HFile as both base- and log-files format facilitating fast key-based lookups (sorted key-value storage). Practically meaning that for large tables with large number of columns we don't need to read the whole column stats index and can simply project its portion by looking up the columns referenced in the query.

Design

Here we’re going to cover some crucial aspects of the design of the new column stats index. If you’re interested in more details we’d recommend you check out RFC-27 for more details.

Column stats index is persisted as standalone partition w/in the Metadata Table (designated as “column_stats”). To be able to keep up with the scale of the largest tables out there while staying flexible, index could be configured to be sharded into a number of file-groups with individual records being hashed into either of these based on their key value. To configure the number of file-groups please use the following configuration (bearing default value of 2):

As was already called out, Metadata Table uses HFile as its storage file format (which is very efficient sorted binary key-values format) to be able to 

  • Efficiently lookup records based on their keys as well as 
  • Efficiently scan record ranges based on their keys’ prefixes

To explain how this is being used in column stats index, let’s take a look at its record’s key’s composition:

Prefixing index records’ keys with column is not random and is motivated by following observations

  1. With HFile storing all key-value pairs sorted, such key composition provides for a nice property of locality of all records pertaining to particular column C, and
  2. Any given query for the original table, usually is filtering on just a handful of columns, entailing that we can seek efficiencies by avoiding reading the full index and instead simply projecting its contiguous slice specific to the columns C1, C2, etc query might be filtering on. 

To better exemplify it, let’s take a look at a query Q filtering on a column C2:

We can simply read a contiguous block of records without a need to either a) read the whole index (which could be large) nor b) do random seeks to cherry-pick records we’re interested in. This allows us to reach substantial performance improvements for very large tables as going to be shown in the section below.

Benchmarking

To demo column stats index and data skipping features in full swing we will be using commonly known Amazon Reviews dataset (taking up only 50Gb on storage) so that our results are easily reproducible by anybody, but with slightly uncommon ingestion configuration to showcase how efficiencies brought about by column stats index and data skipping scale with the number of files in the dataset.

Ingesting

To ingest Amazon Reviews dataset into Hudi table we're using this gist.

Please note, that you have to specify following configuration properties to make sure column stats index is being built synchronously during ingestion:

However, If you want to run an experiment on an already existing table which doesn't have column stats index currently, you can leverage the async indexer feature to backfill the index for an already existing table.

Querying

Note that, to see data skipping in action following is required:

  1. Make sure Metadata Table is enabled on the read-path*
  2. Data Skipping feature is enabled

For that following 2 properties have to be specified either as Spark or Hudi options:

*Metadata Table is enabled by default only on the writer side and readers still have to specify corresponding config explicitly if they're willing to leverage Metadata Table on the read path:

Please check out this gist to see how to query previously ingested dataset.

EMR Config

All of the tests are performed on a small EMR cluster with the following configuration, making it easy for you to reproduce the same results would you opt in to do so.

Nodes: m5.xlarge (1 master / 3 executors)

Spark: OSS 3.2.1 (Hadoop 3.2)

Run: Non-partitioned COW Table

Note that, we deliberately squeeze the file-size out to generate a meaningfully large number of files, given that the dataset is only 50Gb.

  • Dataset: Amazon Reviews (~50Gb uncompressed)
  • Records: 161M (~160 bytes)
  • Table Type: COW (non-partitioned)
  • File size: 1Mb
  • Number of files: ~39k (total size ~47Gb, compressed, zstd)
  • Column Stats: 21 columns (~847k records, ~63 Mb)
  • Warmup: No (cold caches, shell is restarted every time to flush any caching)

As could be easily seen from the table above, data skipping powered by new Column Stats Index in Hudi 0.11 are bringing in substantial improvements in execution performance for queries (proportional to their pruning potential), reducing the execution runtime as well as saving crucial compute resources directly translating into cost-savings for Cloud-based Lakes and Lakehouses built on Hudi.

Even though power of column stats index and data skipping are already available to Hudi users today, there’s more work to be done that is currently in progress: 

  • Support data skipping in Merge-On-Read tables
  • Add caching for column stats index queries
  • Profile and optimize column stats index performance even further

If you would like to follow the work currently in progress or chime in on some features you feel are the most valuable please check out HUDI-1822 and leave your comments.

If you have any questions don’t hesitate to reach out on Hudi’s official Slack channel.

Authors
No items found.

Read More:

Announcing: AI Vector Embeddings Generator for the Lakehouse
Onehouse Analytics Engine Guide
Onehouse Achieves PCI Compliance Certification

Subscribe to the Blog

Be the first to read new posts

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
We are hiring diverse, world-class talent — join us in building the future