April 28, 2025

Towards Open Data - Part 1: Cloud Warehouses Now Love Open Formats

Towards Open Data - Part 1: Cloud Warehouses Now Love Open Formats
TL;DR: This blog explores how cloud data warehouses have started supporting open table formats (Iceberg, Hudi, Delta) — a notable shift from their traditionally closed ecosystems. But it argues that current support remains limited, inconsistent, often falling short of true openness in terms of interoperability and feature parity.

While data lakes have long embraced openness, cloud data warehouses have evolved differently. Their evolutionary arc has been primarily driven by the need to overcome the scaling challenges and rigid architectures of on-prem data warehouses, where cost and performance inefficiencies made it difficult to adapt to growing analytical workloads. Designed for performance and ease of use, cloud warehouses introduced managed, elastic compute but retained proprietary (closed) storage formats and tightly integrated query engines.

Recent trends, particularly the rise of open table formats, have begun to reshape this landscape. Table formats such as Apache Hudi™, Apache Iceberg™, and Delta Lake, when combined with open file formats such as Apache Parquet™, enable an independent data layer that introduces modularity and flexibility, allowing organizations to plug in any compute engine as needed. By supporting these formats, cloud warehouses have a path to address some of their long-standing limitations with closed storage architectures. But how open are cloud warehouses, really? In this series, we discuss various important topics on our path towards open data.

The Tale of Two Datastores

Over the years, data lakes and cloud data warehouses have represented two contrasting approaches to managing analytical data. Data lakes emerged as an open ecosystem (starting with the Hadoop era), built on open storage formats (e.g., Parquet, ORC, and Avro) and designed to work with various compute engines such as Apache Spark™, Apache Flink™, and Presto/Trino. Even as early as the 2000s, data lakes decoupled compute from storage, allowing organizations to scale each component independently, optimizing both resource management and costs. This also provided the advantage of not being tied to a specific vendor’s compute engine for processing needs. This model has now been replicated by many, many organizations.

Another enabler of this openness was metadata management. Catalogs like Hive Metastore (HMS) provided a pluggable metadata layer, allowing users to register, manage, and query datasets using standard SQL engines while maintaining the flexibility to switch between different compute engines. When new data was written to the storage, it could be immediately registered in Hive Metastore by calling the Metastore API from any data application, ETL pipeline, or orchestration tool with HMS support, making the data instantly discoverable for querying.

Cloud data warehouses, on the other hand, have traditionally followed a closed ecosystem model, where storage, compute, and metadata management (catalog) are tightly integrated. Platforms such as Snowflake, Amazon Redshift, and Google BigQuery have offered a fully managed, highly optimized experience, abstracting away infrastructure complexity and providing seamless performance. However, this tight integration also meant that users were bound to the warehouse’s native storage format and compute layers, making it difficult to interoperate with external systems.

When Snowflake, BigQuery, and Redshift emerged as cloud-native warehouses, they were designed to overcome the scalability and performance constraints of on-premises systems while providing a fully managed experience. Before cloud warehouses, most enterprises relied on on-prem warehouse solutions such as Teradata, Vertica, and Netezza, where storage and compute were tightly coupled and optimized for specific hardware (e.g., Oracle Exalogic, Teradata Appliances, etc.). Customers were required to purchase compute resources in proportion to storage, even if they only needed additional capacity for one. This led to high costs, inefficient scaling, and rigid architectures. Cloud warehouses addressed this problem by decoupling storage and compute, allowing users to scale each component independently and intelligently via autoscaling.

However, despite this design improvement, storage and compute still remained tightly integrated at the warehouse vendor level. Cloud warehouses continued to rely on proprietary storage formats, meaning data had to be ingested into the warehouse's internal storage before it could be queried efficiently. Unlike data lakes, where data remains open and accessible to various compute engines, cloud data warehouses behaved as monolithic systems where:

  • Data had to be moved into the warehouse's internal storage.
  • Data inside the warehouse was inaccessible to external compute engines such as Spark, Flink, or Trino.
  • Workloads were tightly coupled to the vendor’s query engine, limiting interoperability and increasing the potential for lock-in.

Initially, cloud warehouses offered little to no support for directly querying Apache Parquet or ORC files stored externally. Companies that used both a data lake and a cloud warehouse had to duplicate data - maintaining raw files in object storage while re-ingesting transformed versions into the warehouse for analysis. This increased costs and introduced data silos.

Over the past few years, the data lakehouse paradigm has emerged as a way to combine the best of both worlds, offering the performance optimizations and better data management capabilities of a warehouse while preserving the flexibility of a data lake. This shift has been fueled by open table formats such as Apache Hudi, Apache Iceberg, and Delta Lake. Recognizing the growing popularity of these formats, cloud data warehouses have begun supporting them as external tables or managed versions of open formats. This theoretically allows organizations to query open table formats without first ingesting data into the warehouse’s native storage format. 

This is a significant shift as it signals that cloud warehouses acknowledge the industry’s demand for openness. However, it’s critical to examine what this means in practice. The real question is: Does supporting open table formats as external formats mean cloud warehouses are genuinely “open”?

External Tables & Managed Formats

The lakehouse architecture has fundamentally transformed data management by demonstrating that warehouses weren’t the only way to bring structure and performance to data lakes. Open table formats such as Apache Hudi, Apache Iceberg, and Delta Lake introduce ACID transactions, schema evolution, and time travel capabilities traditionally found in warehouses, while keeping data in an open and accessible format. Additionally, these table formats optimize storage and query performance through techniques such as clustering, compaction, cleaning, and indexing in the case of Hudi, similar to those used in data warehouses. This closed the gap that was seen with data lakes!

As organizations sought more cost-efficient and flexible architectures, there was a demand for better ways to access data without forcing ingestion into proprietary storage formats. This need led to the introduction of external tables that enable cloud warehouses to query data stored in external object storages (such as S3) from file formats such as Parquet, ORC, or Avro without physically moving the data into the warehouse. It is important to note that prior to 2021, cloud warehouse vendors such as Snowflake had very limited external table support for Parquet, requiring workarounds for accessing data in the data lakes.

In parallel, another approach has also emerged: some warehouse platforms have chosen to adopt and integrate a specific open table format more deeply into their proprietary stack. This model replaces the warehouse’s native storage format with an open one while continuing to manage it through closed services such as metadata catalogs, optimization layers, and internal lifecycle tools. We refer to this as a vendor-managed open table model.

One of the most compelling reasons customers adopt a lakehouse architecture is its ‘openness’ - the ability to store data in open file and table formats as an independent storage tier, enabling any compute engine to be layered on top based on workload requirements. As lakehouse adoption grew, enterprises sought greater flexibility to use openly stored data in a lakehouse format rather than being locked into proprietary systems. In response to these demands for flexibility, customer control, reduced vendor lock-in, and cost efficiency, cloud warehouses introduced support for lakehouse formats such as Apache Hudi, Apache Iceberg, and Delta Lake as external tables and vendor-managed open formats.

The Reality Check: How "Open" Are Cloud Warehouses Now?

The introduction of open table format support in cloud warehouses signals a positive industry shift, but there are nuances to how this model is implemented. While some cloud warehouses now allow querying formats as external tables or offer a managed version of the format, every major cloud warehouse still defaults to its own proprietary storage format and reserves full platform support for its native tables. These implementations, though promising, surface some fundamental limitations in how openness is delivered today:

  • Limited with external tables & vendor-managed formats: Cloud warehouses today support open formats either as external tables or through vendor-managed format models. While both approaches enable some level of support for open formats, they introduce platform-specific dependencies and limit cross-platform interoperability.
  • External table support is not yet first-class: Open table formats in cloud warehouses still operate in a constrained model. External tables often lack full support for critical features like ACID guarantees, time travel, schema evolution, and write operations. Vendor-managed formats offer partial improvements but come with platform lock-in and don’t yet match the capabilities of native tables.
  • True openness demands interoperability:  Openness isn’t just about supporting table formats - it’s about enabling flexibility across compute, storage, and metadata. A truly open lakehouse lets organizations mix and match components freely. Today, cloud warehouses retain varying degrees of control, restricting critical operations and limiting external interoperability.

To understand these limitations more concretely, let’s examine how open table formats are currently implemented across major cloud warehouses. While support is evolving, several functional gaps still remain - especially when compared to the capabilities available with native tables.

In this section we will explore some of the functional gaps in open table format support across widely used cloud warehouses such as Snowflake, Redshift, and BigQuery and understand their current state. 

External/Managed Open Tables Do Not Offer Parity with Native Tables

External or managed table support is primarily read-only, with inconsistent or missing support for core DML operations (INSERT, UPDATE, DELETE) across cloud warehouses.

  • Snowflake offers support for Iceberg tables via two approaches - Snowflake-managed Iceberg tables, where the catalog and table lifecycle is managed by Snowflake, and externally-managed Iceberg tables, where Snowflake connects to an external Iceberg catalog (such as AWS Glue or an Open Catalog).
  • Snowflake allows writes only for Snowflake-managed Iceberg tables, where it controls metadata and table lifecycle. However, externally managed Iceberg tables (such as those registered in AWS Glue or Open Catalog) are read-only, preventing updates from Snowflake. Support for other formats such as Delta Lake is strictly read-only and has no native Apache Hudi support (this would be possible with support for Apache XTable, which abstracts the translation of lakehouse table format metadata).
  • Redshift supports external table querying via Redshift Spectrum and also has support for Iceberg tables in Redshift Serverless. With Redshift Serverless, Iceberg support is read-only. There is no write support for Iceberg, and users must rely on other services such as AWS Athena or Apache Spark on EMR for ingestion and updates. Redshift Spectrum offers read-only support for Apache Hudi and Delta Lake tables as external tables, with key limitations.
  • BigQuery also has two different levels of support for open table formats. It allows using BigLake tables to query external open formats including Apache Hudi, Apache Iceberg, and Delta Lake, etc., but it also provides BigQuery-managed Iceberg tables. BigLake Iceberg tables are entirely read-only, and while BigQuery-managed Iceberg tables allow writes, modifications must be performed exclusively within BigQuery. External modifications (such as appending files via another engine) can result in query failures or data loss.

Takeaway: While cloud warehouses have introduced support for open table formats, their external and managed table implementations remain constrained, offering limited or no write support, enforcing platform-specific constraints, and restricting interoperability with external engines.

Key Capabilities Are Unavailable or Inconsistently Implemented

Many of the key features that make Iceberg, Delta, and Hudi powerful, such as schema evolution, time travel, partitioning, and query optimizations remain either unsupported or inconsistently implemented across cloud warehouses.

Schema Evolution & Metadata Limitations: 

Schema evolution is one of the most important features of open table formats, yet support varies significantly.

  • Snowflake has limited support for schema evolution for Snowflake-managed Iceberg tables, while externally managed Iceberg tables lack full metadata tracking and schema evolution flexibility. Both of these approaches do not have the full schema evolution capability as with native tables. Additionally, Snowflake does not support altering data file locations or snapshot metadata for externally managed Iceberg tables.
  • Neither Redshift Serverless nor Spectrum has schema evolution support for its Iceberg tables. It recommends using AWS Glue (another service) to do so.
  • BigQuery restricts schema evolution for Iceberg tables - modifications such as nested field additions via SQL DDL or INT to FLOAT conversions are unsupported.

Time Travel & Versioning:

Time travel and rollback capabilities - a core advantage of open table formats - are also fragmented in cloud warehouse implementations.

  • Snowflake supports time travel only for Snowflake-managed Iceberg tables; externally managed Iceberg tables must be manually refreshed before snapshot expiration to retain version history.
  • Redshift does not support time travel for Iceberg tables at all, meaning users cannot query historical snapshots.
  • BigQuery also does not have support for time travel for Iceberg tables (either via BigLake or BigQuery-managed Iceberg).

Partition, Clustering, and Query Optimizations:

Performance optimizations such as partition pruning, clustering, and indexing are critical for scaling open table formats, yet cloud warehouse implementations are incomplete.

  • Snowflake only supports clustering for Snowflake-managed Iceberg tables, while externally managed Iceberg tables do not support clustering.
  • Redshift does not offer automatic materialized views or query rewriting for external Iceberg tables, making external queries less performant than native Redshift tables.
  • BigQuery does not support partitioning for Iceberg tables; only clustering is available.

Takeaway: Cloud warehouses do not provide the same level of advanced capabilities available to their native tables. Schema evolution, time travel, and query optimizations - key advantages of open table formats - are either missing or inconsistently implemented. Users relying on these formats in cloud warehouses may have to navigate fragmented support, manual interventions, and external service dependencies.

Here’s a comparative analysis of feature support for open table format across the three cloud data warehouses.

Feature Snowflake Redshift BigQuery
Support for Iceberg Snowflake-managed & externally managed Redshift Spectrum & Redshift Serverless BigLake External & BigQuery-managed Iceberg
Support for Delta Very basic support Read-only support via Spectrum Read-only support via BigLake External
Support for Hudi No native support Read-only support via Spectrum Read-only support via BigLake External
Write Support Only for Snowflake-managed Iceberg tables No write support Only for BigQuery-managed Iceberg tables
Schema Evolution Support Limited (Only Snowflake-managed Iceberg) No schema evolution Limited (No INT to FLOAT coercion, no nested field addition via SQL DDL)
Time Travel Support Good support for Snowflake-managed Iceberg tables, Limited for externally managed Iceberg (requires manual refresh) No time travel support No time travel support
Partitioning Support Partial Support for Snowflake-managed Iceberg (not micro-partitioned) Supported No partitioning
Clustering Support Only for Snowflake-managed Iceberg tables Not supported Supported
Query Performance Optimizations (Materialized Views, Rewrites) Limited optimizations for external & managed Iceberg tables Supports materialized view for Iceberg but no auto refresh, or auto query rewrites No materialized views or automatic query rewrites
External Modification Allowed Not Allowed for Snowflake-managed Iceberg tables No external modifications allowed No external modifications allowed (may cause data loss)
Vendor-Managed Open Table Format Option Snowflake-managed Iceberg Redshift Serverless Iceberg BigQuery-managed Iceberg

Note: The feature support for open table formats across cloud warehouse platforms is evolving. The information in this table reflects the current state at the time of writing, but capabilities may change as vendors continue to enhance their support. Please refer to the official documentation for the most up-to-date details.

Lack of Extensibility Compared to Open Source Engines

Apart from the functional gaps, cloud warehouses also lack extensibility as compared to open source platforms or engines. One of the defining advantages of open source compute engines such as Apache Spark, Flink, or Ray is their extensibility. Developers can plug in new table formats, extend read/write capabilities, or build custom connectors to suit their workload needs. However, this level of control is not available in cloud warehouses, where the execution engine is tightly controlled and closed to external extension.

For example, you cannot write a new format plugin or extend the compute layer in Snowflake to support a table format that isn’t natively supported, whereas in OSS frameworks, you can integrate any format that follows an open spec and APIs. This restricts innovation and flexibility, and reinforces a vendor-defined boundary around what’s possible with open table formats.

The Bigger Picture: Why Does This Matter?

While cloud warehouses now support external tables and offer managed versions of open table formats, these approaches still introduce constraints that limit the full potential of an open data architecture. A common picture that is usually painted by vendors is that using an open table format model suddenly makes the architecture open, enabling multiple compute engines to interact with the same set of tables without any restrictions or performance implications. This isn’t the case today (as we saw) with external tables or managed flavors of open table formats. 

As highlighted in the previous sections, external formats or warehouse-managed versions of open table formats still impose critical limitations that go against the principles of an open data architecture. In this section, we go over some of the critical aspects  of why all of these matters.

  • Limited multi-engine interoperability: Managed open table formats in warehouses often restrict writes from external tools. For example, Snowflake does not allow third-party clients to modify Snowflake-managed Iceberg tables, and BigQuery warns that modifying Iceberg tables externally can cause query failures or even data loss. This directly contradicts the promise of open table formats, which are supposed to be engine-agnostic and allow seamless collaboration across different compute engines.
  • Partial functionality compared to native tables: As seen in our comparison, external table support remains a second-class citizen in cloud warehouses, with missing features such as schema evolution, time travel, and performance optimizations. External tables today lack full capabilities, reinforcing the warehouse’s preference for proprietary, native storage.
  • New lock-ins with proprietary catalogs & table services: In cases where warehouses offer a managed open table format (e.g., Snowflake-managed Iceberg), users mostly have to depend on the warehouse’s catalog and table management services to get the best of open table formats’ capabilities. There are also no ways for customers to use open table management services (e.g., Spark’s procedure to clean Iceberg snapshots or Hudi’s open clustering table service) inside these platforms. The reliance on proprietary catalogs and table management services creates a new, hidden layer of lock-in, making it difficult to use the same data and services seamlessly across multiple platforms.
  • External table performance trade-offs reinforce warehouse-native storage: At its current state, performance optimizations (e.g., clustering, materialized views, query rewriting, and indexing) are primarily available for native warehouse tables. External tables often lack these enhancements, while there is very limited support for these in the managed-versions of open table formats. As a result, users may be incentivized to migrate data into proprietary storage for better performance, which again defeats the purpose of an open architecture.
  • Inconsistent implementations make standardization difficult: As seen in the functional gap analysis, each cloud warehouse implements managed versions of open table formats differently, meaning organizations cannot simply "adopt Iceberg or Hudi or Delta" and expect the same experience across Snowflake, Redshift, and BigQuery. This forces organizations to make warehouse-specific decisions, which can lead to fragmentation and operational complexity, further reinforcing vendor dependence.

Looking Forward

Major cloud warehouses have acknowledged the importance of the openness enabled by lakehouse table formats - driven by growing customer demand for open and interoperable architectures that allow multiple workloads to operate on a single source of truth. While some support has emerged, the conversation now needs to move beyond basic enablement toward standardization and true interoperability. Real progress will come when open table formats are no longer treated as exceptions or add-ons, but as fully integrated, first-class layers within modern data architectures. To reach that future, a few critical points must be addressed:

  1. Standardized implementation of open table formats: One of the biggest challenges today is the inconsistent implementation of open formats such as Iceberg, Delta, and Hudi across different cloud warehouses. Even though these formats have well-defined specifications and APIs, warehouse platforms often implement them selectively or in their own ‘internal way’ - sometimes skipping key features or layering proprietary behaviors on top.

    Going forward, cloud warehouses must work toward standards-based compatibility, i.e. adhering more closely to the format’s reference specs and read/write APIs. Doing so would ensure that an open table behaves the same regardless of whether it’s queried from a particular vendor or an open source engine, removing friction and increasing interoperability.
  1. Closing the gaps between native and open table formats: Today, open table formats in cloud warehouses are not treated on par with native tables. Most platforms recommend defaulting to proprietary storage formats for performance, feature completeness, and tighter integration. For example, Snowflake highlights that Iceberg is not a replacement for native Snowflake tables and is only recommended when:
    1. Data cannot be moved into Snowflake-managed storage,
    2. External engines need to access the same data,
    3. Or customers have already committed to Iceberg for other workloads.

This reinforces that open table formats are currently supported as exceptions, not defaults, and they often lack critical capabilities that are available for native warehouse tables.

For open formats to become truly useful in enterprise environments, they must be supported with the same depth and operational tooling as native tables, without forcing customers to trade off openness for performance.

  1. Interoperability must be a first-class concern: The core value proposition of an open lakehouse is modular interoperability - where organizations can mix and match compute, storage, and catalog components to suit different use cases. But as we've discussed, today’s external tables and vendor-managed open formats impose restrictions on how third-party tools can interact with the data, with some explicitly warning against using other compute engines to modify tables.

These restrictions break the core tenet of openness, where any compatible engine should be able to read and write to the same dataset. To move forward, cloud warehouses must enable multi-engine interoperability. This also extends to interoperability in other layers, such as with catalogs and storage engines (for table optimizations).

The road to open data architecture doesn’t end at open format support - it begins with interoperability. We’ll explore these topics in greater depth in the future parts of this series.

Authors
Dipankar Mazumdar, ‍Staff Developer Advocate
Dipankar Mazumdar
Staff Data Engineering Advocate

Dipankar is currently a Staff Developer Advocate at Onehouse, where he focuses on open-source projects such as Apache Hudi & XTable to help engineering teams build robust data platforms. Before this, he contributed to other critical projects such as Apache Iceberg & Apache Arrow. For most of his career, Dipankar worked at the intersection of Data Engineering and Machine Learning. He is also the author of the book "Engineering Lakehouses using Open Table Formats" and has been a speaker at numerous conferences such as Data+AI, ApacheCon, Scale By the Bay, Data Day Texas among others.

Profile Picture of Vinoth Chandar, ‍CEO/Founder
Vinoth Chandar
‍CEO

Onehouse founder/CEO; Original creator and PMC Chair of Apache Hudi. Experience includes Confluent, Uber, Box, LinkedIn, Oracle. Education: Anna University / MIT; UT Austin. Onehouse author and speaker.

Subscribe to the Blog

Be the first to read new posts

We are hiring diverse, world-class talent — join us in building the future