While data lakes have long embraced openness, cloud data warehouses have evolved differently. Their evolutionary arc has been primarily driven by the need to overcome the scaling challenges and rigid architectures of on-prem data warehouses, where cost and performance inefficiencies made it difficult to adapt to growing analytical workloads. Designed for performance and ease of use, cloud warehouses introduced managed, elastic compute but retained proprietary (closed) storage formats and tightly integrated query engines.
Recent trends, particularly the rise of open table formats, have begun to reshape this landscape. Table formats such as Apache Hudi™, Apache Iceberg™, and Delta Lake, when combined with open file formats such as Apache Parquet™, enable an independent data layer that introduces modularity and flexibility, allowing organizations to plug in any compute engine as needed. By supporting these formats, cloud warehouses have a path to address some of their long-standing limitations with closed storage architectures. But how open are cloud warehouses, really? In this series, we discuss various important topics on our path towards open data.
Over the years, data lakes and cloud data warehouses have represented two contrasting approaches to managing analytical data. Data lakes emerged as an open ecosystem (starting with the Hadoop era), built on open storage formats (e.g., Parquet, ORC, and Avro) and designed to work with various compute engines such as Apache Spark™, Apache Flink™, and Presto/Trino. Even as early as the 2000s, data lakes decoupled compute from storage, allowing organizations to scale each component independently, optimizing both resource management and costs. This also provided the advantage of not being tied to a specific vendor’s compute engine for processing needs. This model has now been replicated by many, many organizations.
Another enabler of this openness was metadata management. Catalogs like Hive Metastore (HMS) provided a pluggable metadata layer, allowing users to register, manage, and query datasets using standard SQL engines while maintaining the flexibility to switch between different compute engines. When new data was written to the storage, it could be immediately registered in Hive Metastore by calling the Metastore API from any data application, ETL pipeline, or orchestration tool with HMS support, making the data instantly discoverable for querying.
Cloud data warehouses, on the other hand, have traditionally followed a closed ecosystem model, where storage, compute, and metadata management (catalog) are tightly integrated. Platforms such as Snowflake, Amazon Redshift, and Google BigQuery have offered a fully managed, highly optimized experience, abstracting away infrastructure complexity and providing seamless performance. However, this tight integration also meant that users were bound to the warehouse’s native storage format and compute layers, making it difficult to interoperate with external systems.
When Snowflake, BigQuery, and Redshift emerged as cloud-native warehouses, they were designed to overcome the scalability and performance constraints of on-premises systems while providing a fully managed experience. Before cloud warehouses, most enterprises relied on on-prem warehouse solutions such as Teradata, Vertica, and Netezza, where storage and compute were tightly coupled and optimized for specific hardware (e.g., Oracle Exalogic, Teradata Appliances, etc.). Customers were required to purchase compute resources in proportion to storage, even if they only needed additional capacity for one. This led to high costs, inefficient scaling, and rigid architectures. Cloud warehouses addressed this problem by decoupling storage and compute, allowing users to scale each component independently and intelligently via autoscaling.
However, despite this design improvement, storage and compute still remained tightly integrated at the warehouse vendor level. Cloud warehouses continued to rely on proprietary storage formats, meaning data had to be ingested into the warehouse's internal storage before it could be queried efficiently. Unlike data lakes, where data remains open and accessible to various compute engines, cloud data warehouses behaved as monolithic systems where:
Initially, cloud warehouses offered little to no support for directly querying Apache Parquet or ORC files stored externally. Companies that used both a data lake and a cloud warehouse had to duplicate data - maintaining raw files in object storage while re-ingesting transformed versions into the warehouse for analysis. This increased costs and introduced data silos.
Over the past few years, the data lakehouse paradigm has emerged as a way to combine the best of both worlds, offering the performance optimizations and better data management capabilities of a warehouse while preserving the flexibility of a data lake. This shift has been fueled by open table formats such as Apache Hudi, Apache Iceberg, and Delta Lake. Recognizing the growing popularity of these formats, cloud data warehouses have begun supporting them as external tables or managed versions of open formats. This theoretically allows organizations to query open table formats without first ingesting data into the warehouse’s native storage format.
This is a significant shift as it signals that cloud warehouses acknowledge the industry’s demand for openness. However, it’s critical to examine what this means in practice. The real question is: Does supporting open table formats as external formats mean cloud warehouses are genuinely “open”?
The lakehouse architecture has fundamentally transformed data management by demonstrating that warehouses weren’t the only way to bring structure and performance to data lakes. Open table formats such as Apache Hudi, Apache Iceberg, and Delta Lake introduce ACID transactions, schema evolution, and time travel capabilities traditionally found in warehouses, while keeping data in an open and accessible format. Additionally, these table formats optimize storage and query performance through techniques such as clustering, compaction, cleaning, and indexing in the case of Hudi, similar to those used in data warehouses. This closed the gap that was seen with data lakes!
As organizations sought more cost-efficient and flexible architectures, there was a demand for better ways to access data without forcing ingestion into proprietary storage formats. This need led to the introduction of external tables that enable cloud warehouses to query data stored in external object storages (such as S3) from file formats such as Parquet, ORC, or Avro without physically moving the data into the warehouse. It is important to note that prior to 2021, cloud warehouse vendors such as Snowflake had very limited external table support for Parquet, requiring workarounds for accessing data in the data lakes.
In parallel, another approach has also emerged: some warehouse platforms have chosen to adopt and integrate a specific open table format more deeply into their proprietary stack. This model replaces the warehouse’s native storage format with an open one while continuing to manage it through closed services such as metadata catalogs, optimization layers, and internal lifecycle tools. We refer to this as a vendor-managed open table model.
One of the most compelling reasons customers adopt a lakehouse architecture is its ‘openness’ - the ability to store data in open file and table formats as an independent storage tier, enabling any compute engine to be layered on top based on workload requirements. As lakehouse adoption grew, enterprises sought greater flexibility to use openly stored data in a lakehouse format rather than being locked into proprietary systems. In response to these demands for flexibility, customer control, reduced vendor lock-in, and cost efficiency, cloud warehouses introduced support for lakehouse formats such as Apache Hudi, Apache Iceberg, and Delta Lake as external tables and vendor-managed open formats.
The introduction of open table format support in cloud warehouses signals a positive industry shift, but there are nuances to how this model is implemented. While some cloud warehouses now allow querying formats as external tables or offer a managed version of the format, every major cloud warehouse still defaults to its own proprietary storage format and reserves full platform support for its native tables. These implementations, though promising, surface some fundamental limitations in how openness is delivered today:
To understand these limitations more concretely, let’s examine how open table formats are currently implemented across major cloud warehouses. While support is evolving, several functional gaps still remain - especially when compared to the capabilities available with native tables.
In this section we will explore some of the functional gaps in open table format support across widely used cloud warehouses such as Snowflake, Redshift, and BigQuery and understand their current state.
External or managed table support is primarily read-only, with inconsistent or missing support for core DML operations (INSERT, UPDATE, DELETE) across cloud warehouses.
Takeaway: While cloud warehouses have introduced support for open table formats, their external and managed table implementations remain constrained, offering limited or no write support, enforcing platform-specific constraints, and restricting interoperability with external engines.
Many of the key features that make Iceberg, Delta, and Hudi powerful, such as schema evolution, time travel, partitioning, and query optimizations remain either unsupported or inconsistently implemented across cloud warehouses.
Schema Evolution & Metadata Limitations:
Schema evolution is one of the most important features of open table formats, yet support varies significantly.
Time Travel & Versioning:
Time travel and rollback capabilities - a core advantage of open table formats - are also fragmented in cloud warehouse implementations.
Partition, Clustering, and Query Optimizations:
Performance optimizations such as partition pruning, clustering, and indexing are critical for scaling open table formats, yet cloud warehouse implementations are incomplete.
Takeaway: Cloud warehouses do not provide the same level of advanced capabilities available to their native tables. Schema evolution, time travel, and query optimizations - key advantages of open table formats - are either missing or inconsistently implemented. Users relying on these formats in cloud warehouses may have to navigate fragmented support, manual interventions, and external service dependencies.
Here’s a comparative analysis of feature support for open table format across the three cloud data warehouses.
Note: The feature support for open table formats across cloud warehouse platforms is evolving. The information in this table reflects the current state at the time of writing, but capabilities may change as vendors continue to enhance their support. Please refer to the official documentation for the most up-to-date details.
Apart from the functional gaps, cloud warehouses also lack extensibility as compared to open source platforms or engines. One of the defining advantages of open source compute engines such as Apache Spark, Flink, or Ray is their extensibility. Developers can plug in new table formats, extend read/write capabilities, or build custom connectors to suit their workload needs. However, this level of control is not available in cloud warehouses, where the execution engine is tightly controlled and closed to external extension.
For example, you cannot write a new format plugin or extend the compute layer in Snowflake to support a table format that isn’t natively supported, whereas in OSS frameworks, you can integrate any format that follows an open spec and APIs. This restricts innovation and flexibility, and reinforces a vendor-defined boundary around what’s possible with open table formats.
While cloud warehouses now support external tables and offer managed versions of open table formats, these approaches still introduce constraints that limit the full potential of an open data architecture. A common picture that is usually painted by vendors is that using an open table format model suddenly makes the architecture open, enabling multiple compute engines to interact with the same set of tables without any restrictions or performance implications. This isn’t the case today (as we saw) with external tables or managed flavors of open table formats.
As highlighted in the previous sections, external formats or warehouse-managed versions of open table formats still impose critical limitations that go against the principles of an open data architecture. In this section, we go over some of the critical aspects of why all of these matters.
Major cloud warehouses have acknowledged the importance of the openness enabled by lakehouse table formats - driven by growing customer demand for open and interoperable architectures that allow multiple workloads to operate on a single source of truth. While some support has emerged, the conversation now needs to move beyond basic enablement toward standardization and true interoperability. Real progress will come when open table formats are no longer treated as exceptions or add-ons, but as fully integrated, first-class layers within modern data architectures. To reach that future, a few critical points must be addressed:
This reinforces that open table formats are currently supported as exceptions, not defaults, and they often lack critical capabilities that are available for native warehouse tables.
For open formats to become truly useful in enterprise environments, they must be supported with the same depth and operational tooling as native tables, without forcing customers to trade off openness for performance.
These restrictions break the core tenet of openness, where any compatible engine should be able to read and write to the same dataset. To move forward, cloud warehouses must enable multi-engine interoperability. This also extends to interoperability in other layers, such as with catalogs and storage engines (for table optimizations).
The road to open data architecture doesn’t end at open format support - it begins with interoperability. We’ll explore these topics in greater depth in the future parts of this series.
Be the first to read new posts