View our lakehouse webinar on YouTube
A
ACID transactions refers to database inserts, updates, and deletions that have four characteristics:Atomicity - each statement in a transaction (to read, write, update, or delete date) is treated as a single unit; that is, either the entire...
Apache Hudi was originally developed at Uber and was released as an open source project in 2017. Hudi is considered to be the first data lakehouse project and is today one of the three leading data lakehouse projects. Hudi was originally...
Apache Kafka was originally developed at LinkedIn and was released as an open source project in 2011. Kafka has many capabilities, but it is best known as scalable software for real-time data streaming. Kafka is fully distributed. Streaming...
Apache Parquet is an open source file format that stores data in column-based format, making it more useful for many analytics operations. This is in contrast to data stored in row-based formats, such as data stored in Avro, which are easier...
C
Database change data capture (CDC) enables changes to a database to be identified, tracked, and sent as updates in real-time. This allows downstream processes and/or systems to act on the change. A common use case for CDC is to keep a...
A specialized framework, system, or platform designed to efficiently process and analyze large volumes of data.
Copy on write (CoW) refers to writing data incrementally; for instance, when using a column-based data storage format such as Apache Parquet, the data is updated only in the Parquet partitions that need to be changed. This avoids rewriting...
D
A data lake is a repository for storing structured, semi-structured, and unstructured data, at any scale, without changing the structure of the data. The data lake uses object storage, which is very efficient for a wide range of data types and...
A data lakehouse provides storage of unstructured, semi-structured, and structured data, like the data lake. However, the data lakehouse adds core functionality and services that enable the data lakehouse to rival the data warehouse...
A data warehouse is a repository originally designed in the 1980s for storing structured (relational) data for use in reporting and business intelligence. It has been widely used for that purpose ever since. A process called extract, transform...
E
ELT is a variant of ETL. Whereas ETL stands for extract, transform, and load, ELT reverses the last two steps: it stands for extract, load, and transform.In ELT, data is loaded unchanged into the analytics system, such as a...
ETL stands for extract, transform, and load. The term is very widely used in data management and data engineering.ETL was defined decades ago to describe the process of moving data from transactional systems, such as...
H
I
Incremental updates avoid rewriting an entire table for each update. Instead, the data management system stores updates in separate change tables, and uses both base tables and change tables in routine operations.
Ingest, or data ingestion, is the process of extracting data from various sources and writing it into the data warehouse, data lake, data lakehouse, or other data store. Ingested data may be highly processed data from an operational...
K
M
The medallion architecture is a framework for describing a set of data transformations within a data lakehouse or data warehouse. In the medallion architecture, data moves through several steps - Bronze: ingestion of raw...
Merge on read (MoR) stores upserts for file groups into a row-based delta log as they arrive, so write performance is high. Queries then check the delta log as well as the base file, which causes a small hit to query performance. A compactor...
Metadata is literally “data about data.” The column header names in a row-based data table are metadata, as are the data types associated with each column (text, date, integer, etc.) Summary data is often included in data...
P
Be the first to hear about news and product updates