Copy on write (CoW)

Copy on write (CoW) refers to writing data incrementally; for instance, when using a column-based data storage format such as Apache Parquet, the data is updated only in the Parquet partitions that need to be changed. This avoids rewriting the entire table (across all partitions) due to a single write operation. Queries read the most recently updated version of each partition.

‍

Copy on write can be used in combination with merge on read (MoR), improving efficiency and performance for relevant workloads. It’s used for use cases in which it’s important that queries always read the most recent available data.

‍

All data lakehouse projects support copy on write, which defers updates to columnar files, allowing them to use columnar files efficiently. Some lakehouse projects use a combination of Apache Parquet and Avro files to reduce the need to rewrite entire tables.

‍

On the Onehouse website:

Copy on write (CoW)

Stay in the know