Unity Catalog and Open Table Format Unification
Open table formats, including Delta Lake, Apache Hudi™, and Apache Iceberg™, have become the leading industry standards for Lakehouse storage. As these formats have grown in popularity, so has the importance of catalogs, which are responsible for managing reads and writes to tables. In this session, we will cover how new data silos have emerged along these two foundational components of a lakehouse. We will show you how Unity Catalog breaks data silos and how new features in OSS are unifying the Lakehouse ecosystem.
Transcript
AI-generated, accuracy is not 100% guaranteed.
Demetrios - 00:00:06
I am here to welcome our second keynote of the day. Jonathan, where are you at? Let's bring you on to the stage. Hello, sir.
Jonathan Brito - 00:00:15
Hey folks. Great to have you here.
Demetrios - 00:00:19
Yes. So I am excited about your talk. I'm gonna jump off the stage. I'll share your screen, and I'll be back in 20 minutes to ask you a few questions.
Jonathan Brito - 00:00:29
Yeah, no worries at all. Awesome. So first and foremost, wanted to introduce myself. My name's Jonathan Brito. I'm a product manager at Databricks. I particularly work on Lakehouse storage. My focus area is really around Iceberg in particular. I lead a lot of our Iceberg related products, a handful of ones that we'll be talking about today, our managed Iceberg and foreign Iceberg tables, and then broadly, some of the efforts that we're trying to do in the open source communities for both Delta and Iceberg to really bring the ecosystem together.
The real crux of this talk is really talking about ways that we can unify the Lakehouse. I'll start the talk really talking about what we see as big silos that have emerged to Lakehouse, and really the ways that we see both Databricks as well as the different communities working together to really close those gaps for folks and resolve some incompatibilities. So let's get started.
Jonathan Brito - 00:01:28
First and foremost, a really common silo that folks bump into is what we call table format silos. This is kind of an inadvertent outcome of all the excitement around the open lakehouse architecture, in which open table formats have really taken the industry by storm, where a lot of folks have been rushing to adopt these formats, both customers and practitioners as well as platforms.
The problem though with that is that as everything grows, things weren't necessarily the most coordinated in terms of how folks really adopted different formats. So what you end up seeing is a world in which really the Lakehouse became split primarily around Delta Lake and Iceberg, in which you had some platforms that lean heavily into Delta Lake, which you're seeing on the left hand side of this page.
Jonathan Brito - 00:02:18
So, obviously Databricks, with some of the original creators alongside Delta Lake, as well as Fabric, which I think you just heard from earlier in the talk, as well as Iceberg, which on the right hand side is particularly a AWS ecosystem. Snowflake, you probably add BigQuery to this as well, that lean a little more towards Iceberg. So you end up in a world where if you are building out a lakehouse, you have this really tough decision to make, which is which format to use. If you pick one, you get the ecosystem, you pick the other, you get the other half, but for most customers, they want both. So how do you actually make that vision a reality? And that's what we'll talk about a little bit today.
Jonathan Brito - 00:02:58
A lesser known issue that folks are starting to run into now as they move more towards open formats and even really get to be able to standardize on one format is what we call catalog silos. The issue here is that there are a variety of different catalogs, and within the context of an OpenTable format, catalogs actually play a much bigger role than they have in the past.
Catalogs in formats like Iceberg, for instance, actually require a catalog. Delta Lake is actually moving towards being more of a catalog based format itself. As a result of that, the catalog becomes the way that you actually access the data. It's how you manage commits, it's how you read these tables. Also, it's how you do optimizations on a lot of these tables to make sure they're performing and healthy over time.
Jonathan Brito - 00:03:45
So the catalog has a really important role of connecting your engine to your storage layer. What we've seen in the market is that different catalogs can connect to different engines. Even if you're able to get your data into a single format and you're happy and excited, picking the actual catalog that you use to manage your data can then shrink your ecosystem to a certain set of engines.
What you're seeing here is basically which catalogs can talk to which engines, which ones have full support, which ones have some limited support, and other ones that have no access at all. The real downside for customers is that this creates a scenario where your data governance and discovery can be really separated across disparate catalogs that can't necessarily talk to each other. So you have new system silos that are created.
Jonathan Brito - 00:04:38
Our solution for a lot of this is Unity Catalog, which is the catalog that was within Databricks and is also available in open source, Unity Catalog. What we'll talk about today supports a whole variety of table formats. So you're seeing this at the bottom of the page where not just Delta Lake, you can actually have Iceberg Tables registered, Parquet, as well as other legacy file formats from an objects perspective. You actually can manage really any object within Unity Catalog, both tables, AI models, files, et cetera.
What's most important is what's on the outside of this diamond that you're seeing here, which is the ability to connect to other platforms and sources, which we'll spend a lot of time talking about today. This, I think, is a big differentiator for Unity Catalog because it really provides this openness and interoperability that breaks these data silos that we were talking about, both at the format level and at the catalog level.
Jonathan Brito - 00:05:38
All right, so here's the exciting part. Let's talk about what's coming and what's new. On the left hand side, this is a vision that we have for Unity Catalog going forward and how we'll be able to break these silos I just referenced.
The top half of the page is really the ability to connect to any client. So this is if you have any client, whether it be Delta and Iceberg being able to create a managed table within Unity Catalog. The big change here is that we're actually implementing the ability to write to Unity Catalog via the Iceberg Rest Catalog APIs, which if you're not familiar with those are standard interfaces within Iceberg, which allows any catalog backend to implement these APIs so that you have a standard interface for actual interaction with your Iceberg tables.
Jonathan Brito - 00:06:26
So you can create, write, read from any Iceberg client, whether it's Spark, Trino, or Flink. You can create a managed table within Unity Catalog that we will then manage on your behalf. For Delta clients, we also make available the Unity Rest APIs, which we open sourced last year at our summit around this time. So you have this really nice world where whether you're choosing a Delta or Iceberg client, you can create a managed table and read and write to Unity Catalog.
The bottom half of the page is an equally important story. This is really the story around Federation, which I'll talk a little bit more in the next few pages. But really what Federation does is it provides an ability to interoperate with other catalogs, because for a lot of customers, what we've seen is that you're probably gonna end up in a world with multiple catalogs just because you have multiple engines that require a specific catalog.
Jonathan Brito - 00:07:19
As a result of that, being able to actually access data that's managed by other catalogs becomes equally important for unifying the ecosystem. Lastly, we'll talk about interoperating across formats and some of the work that's happening in open source made us a lot easier. In the last session, you talked a lot about things like X Table. There's also a tool called Uniform that works with Delta Lake to sort of do metadata translation to make UHD operate across formats. This is some work happening in open source to really make that even easier, so you don't actually need the metadata translation itself.
Okay, so let's start with the first of these pieces and talk a bit about the sort of new look around managed tables. These managed tables have support from both Delta and Iceberg. So you could have a managed Delta table, you could have a managed Iceberg table with Unity Catalog, and again, you have full access to the full ecosystem to do this.
Jonathan Brito - 00:08:17
The Iceberg tables are accessible via Iceberg Rest Catalog API, and you also can use tools within Databricks to make your Delta tables available as well to Iceberg clients. What's really nice about all our managed tables, they benefit from a tool called predictable optimization, which actually looks at how you're using these tables to optimize them and run optimizations like file compaction, clustering, storage optimizations. So you get really fast performant tables out of the box without a lot of tuning, which for most folks, that isn't really what their expertise is or what they want to invest their resources in.
Jonathan Brito - 00:08:55
In terms of federation, this is really the sort of path that we've seen really evolve in the ecosystem in which what we're trying to see is bidirectional Federation amongst catalogs. So pictured here is Unity Catalog, Snowflake Horizon catalog, as well as Glue Catalog. What you're seeing is this ability to federate in both directions.
On the Snowflake side, Snowflake actually can federate to Unity's Iceberg Rest Catalog interface. We actually have a private preview now for Federating Snowflake Horizon catalog. On the Glue side, something very similar. We actually have the ability to federate to Glue, and we expect something similar in terms of where Glue is heading in terms of federating to Unity Catalog.
Jonathan Brito - 00:09:43
So what does this actually look like from a user experience perspective? If you want to actually access these foreign tables, it's really a few simple steps. First it is creating a connection. In Unity Catalog in the UI or through the secret interfaces, you can create a connection to the other catalog, which provides an ability for you to authenticate using say, OAuth or personal access token.
From there, you can create a foreign catalog. What this is is it basically creates a catalog underneath your Unity Catalog metastore that will have all of your schemas and all of your tables, whether they be Delta, Iceberg, Parquet, really anything that you can possibly have in these foreign catalogs. You get a really nice unified view of your data across that catalog and formats.
Jonathan Brito - 00:10:41
From there, all you have to do is read the table. What Databricks does for you and what Unity Catalog does for you is a just in time table metadata refresh. So you never have to be concerned about the fact that, "Hey, am I getting the latest snapshot? Something happened on the other side? Someone's writing through this table. Am I making sure that my downstream users are actually seeing the latest, most fresh data?" That's already guaranteed by Databricks. We'll do the refreshes for you.
One thing to call out: these tables are read only. So that is one difference between foreign tables as we call them here at Databricks and managed tables, which have both read and writes.
Jonathan Brito - 00:11:22
One really awesome thing that we've been seeing in the community is really efforts in the open source communities to push towards format interoperability. I started to talk by mentioning a bit around how today a lot of the way you interoperate across formats is through metadata translation. There's already work happening in the open source communities in order to push the formats closer together.
Iceberg V3, which is actually there's a vote happening right now for Iceberg V3, which we're really excited about. We expect that to be adopted by the community very soon. What it actually does is provides consistent data and delete files across Delta and Iceberg.
Jonathan Brito - 00:12:19
A good example of this is something like deletion vectors, where deletion vectors are a feature that existed in Delta Lake, which allows you to move to more of a merge on read paradigm versus copy on write. Instead of every time you write to a table having to rewrite all the Parquet files that relate to that individual delete or update, you can actually just write out delete files, which then get merged on read. This was implemented in Delta Lake. A lot of customers loved it.
There was a similar feature in Iceberg called Positional Deletes, which in V2 a lot of folks in the community actually had issues with it around performance and how it was implemented. So in V3 there was a kind of push to do this a bit better. Working together across these two communities, the implementation of the V3 positional deletes is actually identical and shares a lot of the same consistent concepts to what is done in Delta Lake.
Jonathan Brito - 00:13:10
The good thing about this is that the delete files then in both Delta and Iceberg are written with very similar implementations in a way that if you want to interoperate across these two formats, you don't actually need to rewrite the delete files. We did this in the data layer as well, broadly the data layer as well, unifying things like different data types, so for instance, variant data type and geography and geometry data types would also be available in V3.
It's a very similar theme here where you have aligned the concepts at the data layer. This is a really big push, really exciting because it provides a set of out of box interoperability between the two formats and resolves these incompatibilities. We're hoping and expecting to see this trend continue just because I think across both Delta and Iceberg, there's a lot of shared interest in moving towards common concepts and alignment to enabling interoperability.
Jonathan Brito - 00:14:04
Cool. Moving a bit, wrapping things up and pulling it all together. This is the overall vision of how all this stuff comes together into what we call a Lakehouse catalog.
For a lot of folks, a Lakehouse has a lot of things in it, and with Unity Catalog, it was really built with that vision from the ground up. Working from bottom to top, you can see the fact that Unity Catalog supports now really all the big formats that you would consider using within a lakehouse like Delta Lake, Iceberg, as well as legacy formats, various objects, whether they be tables, which most catalogs have support for, but also other things like AI models, files, notebooks, and dashboards.
Jonathan Brito - 00:15:05
In terms of actually features, what's really important here is that Unity Catalog goes well beyond access control and auditing, which is typically what you see with catalogs, and goes into lineage, quality monitoring, cost controls, quality controls as well.
There's a lot of things built into Unity Catalog that's important for Lakehouse. The top part here, which is the most important, is really the connectivity. We talked a lot about that today. If you remember early on in this presentation, I talked a lot about how you have these catalog silos and format silos here. You really can access any engine that you'd be interested in using from Unity Catalog, either through Unity Rest APIs, Iceberg Catalog APIs, and then from there, in terms of federation, really building out Federation capabilities to any major catalog that you can actually interact with.
I talked a bit about HMS and Glue as well as Snowflake and Iceberg Rest catalogs. We also have the ability to federate to some traditional old app warehouses through Query Federation, BigQuery included in that, as well as other warehouses you may be using. Also coming soon would be the ability to connect to Salesforce's Data Cloud too.
Cool. So that's the overall presentation. Definitely open to opening it up for comments and questions.
Demetrios - 00:16:17
Alright, all, got a lot of questions that is for sure. I'm gonna just kick it off hot right now with a question from Vinod. Is Delta Lake and Iceberg Metadata going to converge completely, yes or no?
Jonathan Brito - 00:16:38
That's definitely the goal. So it's exciting stuff that's happening in the communities where I think this first step was really around what we've been calling data unification, where you'll see a world where you can interoperate between Delta and Iceberg without having to rewrite the files themselves.
For later releases, particularly Iceberg V4 as well as Delta V5.0, looking to then move on to the metadata layer and trying to see if we can converge in that direction too. There's been a lot of work required to do that, but ultimately I think as a North Star, it'd be a great vision that we can accomplish through the communities.
Demetrios - 00:17:17
What's the source of truth on that?
Jonathan Brito - 00:17:21
Like,
Demetrios - 00:17:23
Delta Log versus Iceberg Manifest, which ones are the source of truth and then what's the future of Delta Uniform?
Jonathan Brito - 00:17:33
Yeah, so in terms of the idea here would be that whether you're running a Delta table or an Iceberg table, it wouldn't actually need to write out both copies of metadata. That'd be the idea here. The idea would be you'd write out a Delta log in a way that is understandable by Iceberg clients because the way that we commit those Delta tables is consistent with how Iceberg commits to its tables. So there's ways that you can kind of get into this.
A lot of that stuff is being worked out amongst the community right now, but the idea here would be you wouldn't necessarily have both copies of metadata required. It would just be a consistent way of writing out, say the metadata tree for Iceberg as well as the checkpoints for Delta Lake.
Demetrios - 00:18:20
How did Unity Catalog handle tables in mixed formats? Delta, Parquet, Iceberg in the same workspace? Is there a single metadata model under the hood or does it federate per format?
Jonathan Brito - 00:18:36
We were built kind of with the different formats in mind. Generally speaking, you can within right now if you want to build out a managed table, that would be using Delta or Iceberg, and then you can create external tables at least that we model basically the older formats. So things like just a Parquet table or ORC, those are basically modeled as external tables under Unity Catalog metastore.
Demetrios - 00:19:05
Nice. One of the challenges is to maintain the interoperability for the use of views between the different query engines. This is an important functionality to provide from user experience perspective. How is it being handled currently? You need me to repeat it?
Jonathan Brito - 00:19:27
Yeah, if you repeat, I'm gonna have like a fire truck that's driving back.
Demetrios - 00:19:30
I know, man. Are you like on the street right now? Is that why you put up this virtual background?
Jonathan Brito - 00:19:34
No, I'm in my apartment, but I live on a busy street, so sorry about that.
Demetrios - 00:19:39
That's all good, dude. Appreciate you making the time to come here and chat with us. One of the challenges is to maintain the interoperability for the use of views between the different query engines. This is an important functionality to provide from the user experience perspective. How is it being handled currently?
Jonathan Brito - 00:20:03
That's a great question. So right now, in terms of interop between platforms, there isn't really, I think, a great solution for views today. If you think about it more for like a walk-run, crawl-walk-run sort of paradigm, we're very much in the crawling sort of phase where we've been able to get tables into somewhat consistent open formats in the beginning to share those tables across platforms.
Views is probably the next milestone for this, where Iceberg Rest catalog interface does have some semantics around sharing views. That seems to be the direction where a lot of folks in the industry are moving towards. I expect the solution for a lot of these problems to probably come through the Iceberg Rest catalog. But as of today, there isn't a great way necessarily to share views across platforms.
Demetrios - 00:20:55
Excellent. I'm gonna keep cruising. Since Unity Catalog supports reading foreign tables from HMS, does it also create the lineage information for such tables?
Jonathan Brito - 00:21:10
Great question. We do create lineage for all the downstream work that you'll do with these tables. Once you have the table registered in Unity Catalog, if downstream you are creating a whole bunch of tables that tie back to the table, you could go into Unity Catalog and see all your lineage mapped out really nicely.
There is some work to think about how to do this for third-party clients as well. That would be coming soon, but within Databricks itself, we already have that capability.
Demetrios - 00:21:41
Nice. All right. This one must be a Hudi community member. Why not also support Hudi, which is also popularly used in the community?
Jonathan Brito - 00:21:54
Great question. Hudi, as we've seen, particularly from an ecosystem adoption perspective, Hudi does have a great community. But from a platform perspective, a lot of the major platforms, Snowflake, Databricks, AWS, BigQuery, they've really pushed more towards Delta and Iceberg.
What we've seen more is a broader adoption of those two formats, which actually have a lot more similarities than when you add in Hudi. Hudi from the ground up had a lot more really cool features that make it a bit more different from Delta and Iceberg. So it just makes unification much harder, especially given the fact that you see a lot more folks, at least from a platform perspective, pushing towards Delta or Iceberg.
Demetrios - 00:22:46
How do we get around the egress costs for cross region data replication with syncing in the Delta format?
Jonathan Brito - 00:22:57
That's a great tricky question.
Demetrios - 00:23:02
You didn't realize this is like an interview right now, <laugh>.
Jonathan Brito - 00:23:04
You guys are really drilling me here. Yeah, I think the best ways I've seen getting around cross-cloud egress, because I've seen a lot of folks using R2 for that across region. A little trickier, it may have to just work with the different cloud providers to see what you can kind of push. There's a competitive space, so hopefully they start to fix that across clouds, but it is a tricky problem.
Demetrios - 00:23:34
So the official answer is let your account executive take you out to dinner and tell them you need to talk about these egress fees. All right.
Jonathan Brito - 00:23:44
That's, I've seen that work, <laugh>.
Demetrios - 00:23:48
The non-technical answer, basically.
Jonathan Brito - 00:23:50
Non-technical answer. Yeah.
Demetrios - 00:23:52
I like that. All right. Are the processing for accessing foreign tables different from Delta sharing?
Jonathan Brito - 00:24:03
Demetrios - 00:24:04
Is processing for accessing foreign tables different from Delta sharing?
Jonathan Brito - 00:24:12
Generally, the way that we think about the differences between sharing versus using things like catalog federation, catalog federation usually is really useful if you're working within an organization. So if you're one big company and you're trying to figure out how to connect your Snowflake platform to your Databricks platform, catalog federation is a great way to solve that problem.
However, if you are trying to share with someone outside of your organization, then using things like Delta sharing actually is a better way to do that, generally, because that provides abstractions around your share and your recipient in a way where it's a little bit more secure for that type of sharing. So that's probably the biggest difference between these two approaches.
There's actually some work already that we're doing to make it easier to do Delta sharing with Iceberg clients. So there would be some similarities between some of the federation that we talked about here. But generally speaking, I think the differences are more around how you would use them from a use case perspective.
Demetrios - 00:25:17
Jonathan, there's a lot more questions in the chat, but as I mentioned before, you can call me Ringo Star. I'm keeping the time like a Rolex today, <laugh>. Dude, it's been awesome. I appreciate you coming on here. I'm gonna say farewell.