Two of the key pillars of the data lakehouse architectural pattern are openness and interoperability. Building a data lakehouse on cloud storage systems such as S3, GCS, ADLS, and storing your data in open formats provides a ubiquitous foundation that almost every other data service in your stack can leverage.
At the core of this architecture are the data lakehouse table formats Apache Hudi, Apache Iceberg, and Delta Lake. Each of these projects has unique technical differentiators and strong growing communities which makes it increasingly difficult to choose which format fits a particular use case.
In February, Onehouse announced OneTable and invited any interested parties to come collaborate and build bridges between these projects. Two partners in particular expressed interest driven by demand from their customers as well, Microsoft and Google. After a few months of collaboration and extensive testing of an MVP, today we are excited to announce that OneTable is now open source and publicly available for the rest of the community on Github: https://github.com/onetable-io/onetable.
Learn more about OneTable at the official website: https://onetable.dev
Watch this presentation from Onehouse, Microsoft, and Google describing how OneTable works and showing demos across Spark, Trino, Microsoft Fabric, and Google BigQuery and BigLake: https://opensourcedatasummit.com/
To understand what OneTable is and how it works, first let’s understand the basics of the data lakehouse table formats Apache Hudi, Apache Iceberg, and Delta Lake. Each of these projects provides a special metadata layer on top of Apache Parquet files. Hudi uses a metadata timeline, Iceberg uses avro manifest files, Delta uses json transaction logs, but the common denominator across these formats is the actual data in parquet files.
OneTable is not a new table format, instead OneTable provides tools and abstractions necessary to seamlessly convert Hudi, Delta, Iceberg metadata in an omni-directional way. Omni-directional means you can start from any format and convert to any other format and you can round-trip or round-robin across them in any combination you need with very little performance overhead as no data is ever copied or rewritten, only a small amount of metadata. When using OneTable the metadata layers from all 3 projects can be stored side-by-side in the same directory making the same “table” available to be queried as a native Delta, Hudi, or Iceberg table.
The metadata conversion is achieved with lightweight abstraction layers that have defined a common in-memory model for defining a table. This common model can interpret and translate everything from schema, partitioning information, and file metadata like column level statistics, row count, and size. Alongside of that, there are interfaces for sources and targets which are responsible for translating to and from this model respectively. These interfaces are designed to allow users to extend and evolve the current functionality OneTable provides today for the three major table formats. For example, a developer could write an Apache Paimon implementation for the source interface and immediately be able to expose those tables as Iceberg, Hudi, and Delta to gain compatibility with existing tools and products in the Data Lake ecosystem. See more details at the GitHub repo: https://github.com/onetable-io/onetable
How organizations use OneTable today
Onehouse supports several customers using OneTable in production today. Some customers want their data available in both Databricks Delta and Snowflake’s private preview Iceberg tables. Some users need fast ingestion and incremental processing of Hudi, but they also want to take advantage of some of the special caching layers inside BigQuery’s support of Iceberg tables. Some users only need one format, but they want the assurance of being future proof, and Onehouse gives them all 3 simultaneously.
Watch the video in Open Source Data Summit to see a fun example demo of Microsoft Fabric joining three tables from Hudi, Delta, and Iceberg, all into analytics in one PowerBI dashboard:
The Road Ahead
Today’s announcement of OneTable being open source, is only the beginning of the journey. The project currently offers the basic foundation and support of omni-directional interoperability, but plenty of exciting things still remain to be designed and built together in the community. The roadmap below contains a rough outline of some of the advancements we want to build in the coming year and beyond.
A foundational element to the success of this project is that it is neutral and governed by strong community principles. We are starting day zero from a very strong position of diverse community support. Beyond the initial code contributors, we have mentors and advocates who are supporting the project development across Microsoft, Google, Cloudera, Netflix, Apple, Adobe, Amazon, LinkedIn and more. To ensure this foundation is prioritized we are also excited to announce our intent to submit the project for incubation into the Apache Software Foundation: https://cwiki.apache.org/confluence/display/INCUBATOR/OneTable+Proposal
What makes the difference between a good open source project and a great open source project, is the community. So today we eagerly invite you to join us! Come to the GitHub repo, try out the quickstart, add a little star, open an issue, start a discussion, or send in your PRs and become part of the early committers. If you have ideas, questions, or want to chat with someone directly, please reach out to any of the current github contributors and they would be happy to talk more.
Be the first to read new posts