Lakehouse Glossary

A unified source of essential lakehouse-related knowledge.
‍
Open source software projects: Apache Hudi™ / Hudi; Apache Kafka™ / Kafka; Apache Parquet / Parquet; Avro
‍
Data architecture terms: data warehouse, data lake, data lakehouse; medallion architecture; ingest; query processing; metadata
‍
Hudi capabilities / data integration: ACID transactions; copy-on-write; merge-on-read; change data capture; streaming data; incremental update; ETL, ELT

AI/ML and data: AI; machine learning; generative AI; retrieval-augmented generation (RAG); vector embeddings; generative AI; vector database; embedding models; Euclidean distance; similarity search

ACID transactions

ACID transactions refers to database inserts, updates, and deletions that have four characteristics:Atomicity - each statement in a transaction (to read, write, update, or delete date) is treated as a single unit; that is, either the entire...

Apache Hudi

Hudi was originally developed at Uber and was released as an open source project in 2017. Hudi is considered to be the first data lakehouse project and is today one of the three leading data lakehouse projects. Hudi was originally...

Apache Kafka

Apache Kafka was originally developed at LinkedIn and was released as an open source project in 2011. Kafka has many capabilities, but it is best known as scalable software for real-time data streaming. Kafka is fully distributed. Streaming...

Apache Parquet

Apache Parquet is an open source file format that stores data in column-based format, making it more useful for many analytics operations. This is in contrast to data stored in row-based formats, such as data stored in Avro, which are easier...

Artificial Intelligence (AI)

Artificial intelligence (AI) is a field in computer science that seeks to use machines to achieve human-like results on tasks that require intelligence, reasoning, judgment, and other capabilities in which human capability exceeds other living creatures...

Avro

Avro is an open source row-based storage format with superior support for write operations, when compared to the column-based storage supported by Apache Parquet. Avro allows columns to be added or modified (schema evolution)...

Change data capture (CDC)

Database change data capture (CDC) enables changes to a database to be identified, tracked, and sent as updates in real-time. This allows downstream processes and/or systems to act on the change. A common use case for CDC is to keep a...

Compute engine

A compute engine is a specialized framework, system, or platform designed to efficiently process and analyze large volumes of data. Compute engines contribute to scalability of a system and support analytics, ML, and...

Copy on write (CoW)

Copy on write (CoW) refers to writing data incrementally; for instance, when using a column-based data storage format such as Apache Parquet, the data is updated only in the Parquet partitions that need to be changed. This avoids rewriting...

Data lake

A data lake is a repository for storing structured, semi-structured, and unstructured data, at any scale, without changing the structure of the data. The data lake uses object storage, which is very efficient for a wide range of data types and...

Data lakehouse

A data lakehouse provides storage of unstructured, semi-structured, and structured data, like the data lake. However, the data lakehouse adds core functionality and services that enable the data lakehouse to rival the data warehouse...

Data warehouse

A data warehouse is a repository originally designed in the 1980s for storing structured (relational) data for use in reporting and business intelligence. It has been widely used for that purpose ever since. A process called extract, transform...

ELT

ELT is a variant of ETL. Whereas ETL stands for extract, transform, and load, ELT reverses the last two steps: it stands for extract, load, and transform.In ELT, data is loaded unchanged into the analytics system, such as a...

ETL

ETL stands for extract, transform, and load. The term is very widely used in data management and data engineering.ETL was defined decades ago to describe the process of moving data from transactional systems, such as...

Embedding Models

Embedding models are machine learning models designed to represent data—such as words, sentences, images, or other entities—into vector embeddings, which are dense, numerical representations in a high-dimensional space. These models learn to capture...

Lakehouse Glossary

Filters

ACID transactions

Apache Hudi

Apache Kafka

Apache Parquet

Artificial Intelligence (AI)

Avro

Change data capture (CDC)

Compute engine

Copy on write (CoW)

Data lake

Data lakehouse

Data warehouse

ELT

ETL

Embedding Models

Euclidean Distance

Generative AI

Hudi

Incremental update

Ingest

No result found.

Lakehouse Glossary

Filters

ACID transactions

Apache Hudi

Apache Kafka

Apache Parquet

Artificial Intelligence (AI)

Avro

Change data capture (CDC)

Compute engine

Copy on write (CoW)

Data lake

Data lakehouse

Data warehouse

ELT

ETL

Embedding Models

Euclidean Distance

Generative AI

Hudi

Incremental update

Ingest

No result found.

Stay in the know