Unity Catalog and Open Table Format Unification

calendar icon
May 21, 2025
Speaker
Jonathan Brito
Staff Product Manager

Open table formats, including Delta Lake, Apache Hudi™, and Apache Iceberg™, have become the leading industry standards for Lakehouse storage. As these formats have grown in popularity, so has the importance of catalogs, which are responsible for managing reads and writes to tables. In this session, we will cover how new data silos have emerged along these two foundational components of a lakehouse. We will show you how Unity Catalog breaks data silos and how new features in OSS are unifying the Lakehouse ecosystem.

Transcript

AI-generated, accuracy is not 100% guaranteed.

Speaker 0    00:00:00    
<silence>

Speaker 1    00:00:06    
I am here to welcome our second keynote of the day. Jonathan, where are you at? Let's bring you on to the stage. Hello, sir.  

Speaker 2    00:00:15    
Uh, hey folks. Uh, great to, good to have you here.  

Speaker 1    00:00:19    
Yes. So I am excited about your talk. I'm gonna jump off the stage. I'll share your screen, and I'll be back in 20 minutes to ask you a few questions.  

Speaker 2    00:00:29    
Yeah, no worries at all. Awesome. Uh, so first and foremost, wanted to introduce myself. Uh, so my name's Jonathan Brittle. I'm a product manager at Databricks. Um, I particularly work on Lakehouse storage. So, uh, my focus area is really around, uh, iceberg, uh, in particular. So I lead a lot of our iceberg related products, in particular, a handful of ones that we'll be talking about today, uh, our managed iceberg and foreign iceberg, uh, tables, and then broadly, some of the efforts that we're trying to do in the open source communities for both Delta and Iceberg to really bring the, um, ecosystem together. So the real crux of this, uh, talk is really talking about, um, ways that we can unify the lake House, and I'll start the talk really talking about what we see as big silos that have emerged, uh, to Lakehouse, and really the ways that we see both Databricks as well as the different communities working together to really close those gaps for folks and resolve some incompatibilities. So let's get started.  

Speaker 2    00:01:28    
Uh, so first and foremost, um, a really common, uh, silo that folks, uh, bump into is what we call table format silos. So this is kind of an inadvertent, um, con, uh, outcome of all the excitement around, uh, the open lakehouse, uh, architecture, uh, in which open table formats are really taken. I think the industry by storm, where a lot of folks are really have been rushing to adopt, uh, these formats, both, uh, customers and practitioners as well as platforms. Uh, the problem though with that is that as everything grows, uh, things weren't necessarily the most coordinated in terms of how folks really adopted different formats. So what you end up seeing is a world in which really the Lake us became split, uh, primarily around, uh, Delta Lake and Iceberg, in which you had some platforms that lean heavily into Delta Lake, which you're seeing on the left hand side of this page.  

Speaker 2    00:02:18    
So, obviously Databricks, uh, with some of the original creators alongside Delta Lake, as well as fabric, which I think you just heard from earlier in the talk, um, as well as Iceberg, which on the right hand side is, uh, particularly a BS ecosystem. Snowflake, you probably add BigQuery to this as well, that I leaned a little more towards iceberg. So you end up in a world where if you are building out a lakehouse, you have this really tough decision to make, which is which format to use. If you pick one, you could have the ecosystem, you pick the other, you get the other half, but for most customers, they want both. So how do you actually make that, uh, vision a reality? And that's what we'll talk about a little bit today.  

Speaker 2    00:02:58    
A lesser known issue that, uh, folks are starting to run into now as they move more towards open formats and even really get to be able to standardize on one format is what we call catalog silos. So the issue here is that there are a variety of different catalogs, and within the context of an OpenTable format, uh, catalogs actually play a much bigger role than they have in the past. Uh, so catalogs, uh, ice formats like Iceberg, for instance, actually require a catalog. Delta Lake is actually moving towards being more of a catalog based, um, format itself. And as a result of that, the catalog becomes the way that you actually access the data. It's how you manage commits, it's how you read these tables. Also, it's how you do optimizations on a lot of these tables, uh, to make sure they're performing and healthy over time.  

Speaker 2    00:03:45    
Um, so the catalog has a really important role of really connecting your engine, uh, to your storage layer. And what we've seen the market is that different catalogs can connect to different engines. Uh, so even if you're able to get your data into a single format and you're happy and excited, uh, picking the actual catalog that you use to manage your data can then shrink your ecosystem to a certain set of engines. So what you're seeing here is basically, which catalogs can talk to which engines, which ones have full support, which ones have some limited support, and other ones that have no access at all. And the real downside for customers is that this creates a scenario where your data governance and discovery can be really separated across disparate catalogs that can't necessarily talk to each other. So you have new system silos that are created.  

Speaker 2    00:04:38    
So our solution for a lot of this is, uh, unity Catalog, uh, which is the catalog that, um, was within Databricks. And we also is available in open source, um, unity Catalog. What we'll talk about today, uh, Knapp supports a whole variety of table formats. So you're saying this at the bottom of the page where not just Delta Lake, you can actually have Iceberg Tables registered Parquet, as well as other legacy file formats from an objects perspective. You actually can manage really any object within Unity catalog, both tables, AI models, files, et cetera. Um, and what's most important is what's on the outside of this, uh, diamond that you're seeing here, which is the ability to connect to other platforms and sources, which we'll spend a lot of time talking about today. Um, and this, I think there's a big differentiator for you to catalog because it really provides this openness and interoperability that breaks these data silos that we were talking about, both at the format level and at the catalog level.  

Speaker 2    00:05:38    
All right, so here's the exciting part. Let's talk about what's coming and what's new. Um, on the left hand side, this is a, a vision that we have for Unity Catalog, uh, going forward and how we'll be able to, uh, break these, uh, silos I just referenced. So the top half of the page is really the ability to connect to any client. So this is if you have any client, whether it be Delta and Iceberg being able to create a managed table within Unity Catalog. Uh, the big change here is that we're actually implementing the ability to write to Unity Catalog via the Iceberg Rest Catalog APIs, uh, which if you're not familiar with those are, these are, uh, standard interface within an iceberg, which allows any catalog backend to implement these APIs so that you have a standard interface for actual interaction with your iceberg tables.  

Speaker 2    00:06:26    
So you can create, write, read, um, from any Iceberg client, whether it's Spark, trino, or Flink. Uh, you can create a managed table within Unity Catalog. That's, uh, we will then manage and, uh, on your behalf for Delta clients. We also make available the Unity Rest APIs, um, which we're open sourced, uh, last year at our summit around this time. So you have this really nice world where you have, whether you're choosing a Delta or ICE per client, you can create a managed table and read and write, uh, to Unity Catalog. The bottom half of the page is an equally important story. Uh, and this is really the story around Federation, which I'll talk a little bit more in the next few pages. But really what Federation does is it provides an ability to inter with other catalogs, because for a lot of customers, what we've seen is that you're probably gonna end up in a world with, uh, multiple catalogs just because you have multiple engines that require a specific catalog.  

Speaker 2    00:07:19    
So as a result of that, being able to actually access data that's managed by other catalogs becomes equally important for unifying ecosystem. Lastly, we'll talk about, um, interoperating across formats and some of the work that's happening in Open Source made us lots a lot easier. In the last session, you talked a lot about, uh, things like X Table. Uh, there's also, uh, uh, tool called Uniform, um, that works with the Delta Lake to sort of do metadata translation to make UHD operate across formats. This is some work's happening in, in open source to really make that even easier, so you don't actually need the metadata translation itself. Okay, so let's start with the first of these, uh, pieces, uh, and talk a bit about the sort of new look around managed tables. So these managed tables I have support from both Delta and Iceberg. So you could a managed Delta table, you could a managed iceberg table with an Nuity catalog, and I said, again, you have full access to the full ecosystem to do this.  

Speaker 2    00:08:17    
Um, the iceberg tables are accessible via Iceberg Rest Catalog, API, and you also can use tools within Databricks to make your Delta tables available as well to Iceberg clients. What's really nice about all our managed tables, they benefit from a tool called predictable optimization, which actually looks at how you're using these tables to optimize them and run optimizations like file compaction, clustering, uh, storage optimizations. So you get really fast performant tables out of box without a lot of tuning, which for most folks, that isn't really what their expertise is or what they want to invest their resources in.  

Speaker 2    00:08:55    
In terms of federation, this is really the sort of, uh, path that we've seen really involved in the ecosystem in which what we're trying to see is, is bidirectional Federation amongst catalogs. So pictured here is UNI catalog, snowflake Horizon catalog, uh, as well as Blue Catalog. And what you're seeing is this ability to federate, um, in both directions. So on the Snowflake side, snowflake actually can federate to UN's Iceberg Risk Catalog interface. Uh, we actually, uh, have a private preview now, uh, for Federating Snowflake Horizon catalog. On the glue side, something very similar. We actually have the ability to federate to glue, um, and we expect something similar in terms of where, uh, glue is heading in there in terms of federating to, uh, UNI catalog.  

Speaker 2    00:09:43    
So what does this actually look like from, um, a user experience perspective? So if you want to actually access these foreign tables, it's really a few simple steps. Um, first it is creating a connection. So what you do here is, uh, in Unity Catalog in the UI or through the secret interfaces, you can create a connection to the other catalog, which provides an ability for you authenticate, uh, using say, OAuth or, uh, personal access token. From there, you can create a foreign catalog. Uh, what this is is it basically creates a, uh, a catalog underneath your unique catalog meta store that will have all of your schemas and all of your tables, um, whether they be Delta, iceberg, parquet, really anything, uh, that you can possibly have, uh, in these, uh, foreign catalogs. And you get a really nice unified view of your data, uh, across, uh, that catalog and formats.

Speaker 2    00:10:41    
From there, all you have to do is read the table. Uh, what Databricks does for you and what Uni Alag does for you is a just in time table metadata for refresh. So you never have to be concerned about the fact that, Hey, am I getting the latest snapshot? Something happened on the other side? Um, someone's writing through this table. Am I making sure that my downstream users are actually seeing the latest, uh, most freshest data that's already gonna be guaranteed by Databricks? We'll do the refreshes for you. Um, one thing to call all these tables are read only. Um, so that is one difference between foreign tables, uh, as you call 'em here at Databricks and managed tables, which have both read and writes.  

Speaker 2    00:11:22    
One really awesome thing that we've been seeing in the community is, uh, really efforts in the open source communities to push towards format interability. So I started to talk by mentioning a bit around how, uh, today a lot of the way you interoperate across formats is through the sort of metadata translation. Um, and there's already work happening in the open source communities, uh, in order to push the formats closer together. So Iceberg V three, which is actually there's a vote happening right now for Iceberg V three, which we're really excited about. Um, we expect that to be adopted by the community very soon. Um, what it actually does is provides consistent data and delete files, uh, across Delta and Iceberg. Um, so a good example of this is something like deletion vectors, where deletion vectors, uh, are a feature that existed in Delta Lake, which allows you to move to more of a merge on read, uh, paradigm versus copy on write.  

Speaker 2    00:12:19    
So instead of every time you write to a table having to rewrite, uh, all the Parquet files that related to that individual insert, um, or sorry, that, uh, delete or update, you can actually, uh, just write out delete files, which then get merged on read. Um, this was implemented in Delta Lake. A lot of customers loved it. Uh, there was a similar feature in iceberg called Positional Deletes, um, which in V two, a lot of folks in the community actually had issues with it around performance and how it was implemented. Uh, so in V three there was a kind of push to do this a bit better. So working together across these two communities, um, the implementation of the V three positional leads is actually identical and shares a lot of the same. It's consistent sort of, uh, concepts to what is, uh, done in Delta Lake.  

Speaker 2    00:13:10    
So the good thing about this is that the delete files then in both Delta and Iceberg are written with very similar implementations in a way that if you want to inter operate across these two formats, you don't actually need to rewrite the delete files. And we did this in the data layer as well, broadly the data layer as well. Well, unifying things like different data types, so for instance, variant data type and geography and geometry data types would be also available in P three. And it's a very similar sort of, uh, theme here where you have aligned the concepts at the data layer. So this is a really big push, really exciting 'cause it said it provide set of out box interopability between the two formats and resolve these incompatibility. So we're hoping and expecting to see this, uh, trend continue just because I think across both Delta and Iceberg, there's a lot of shared interest in moving towards a common, um, so basically common concepts and alignment to enabling drivability.  

Speaker 2    00:14:04    
Cool. And then moving a bit, um, sort of wrapping things up and pulling it all together. Um, this is the overall vision of how all this stuff comes together into what we call a Lakehouse catalog. Um, so for a lot of folks, um, a Lakehouse has a lot of things in it, and with Uni catalog, it was really built with that vision from the ground up. Um, so as I mentioned before, working from bottom to top, you can see the fact that UNI catalog supports now really all the big formats that you would consider using within a lakehouse like Delta Lake <inaudible>, as well as level legacy formats, um, various objects, um, whether they be tables, which most catalogs have support for, but also other things like AM models, files, notebooks, and dashboards. In terms of actually features, uh, what's really important here is that UNI catalog goes a well beyond access control and auditing, which is typically what you see with catalogs and goes into lineage, quality monitoring, cost controls, uh, quality controls as well.  

Speaker 2    00:15:05    
So there's a lot of things built into Uni catalog that's important for Lakehouse. And then I think the, the top part here, which is the most important, is really the connectivity. Um, so we talked a lot about that today. Um, if you're, remember early on in this presentation I talked a lot about how you have these catalog styles and format silos here. You really can access any engine that you'd be interested in using from U Catalog, either through Unity, rest, APIs, whether the Iceberg Catalog APIs, and then from there, uh, in terms of federation, really building out Federation capabilities to any, uh, major catalog, uh, that you can actually interact with. So, uh, I talked a bit about HMS and Glue as well as, uh, snowflake and Iceberg Rest catalogs. We also have the ability to federate to, um, some traditional old app warehouses through Query Federation. So big queries included in that, as well as, uh, other, uh, sort of, um, warehouses you may be using. And then also coming soon would be the ability to, to Salesforce is, uh, data Cloud too. Cool. So that's the overall presentation. Definitely open to opening it up to for comments and questions.  

Speaker 1    00:16:17    
Alright, all, uh, have got a lot of questions that is for sure. I'm gonna just kick it off hot right now with a question from Vinod. Is Delta Lake and Iceberg Metadata going to converge completely, yes or no? <laugh>,  

Speaker 2    00:16:38    
That's definitely the goal. Uh, so it's exciting stuff that's happening in the communities where I think this first step was really around what we've been calling data unification, uh, where you'll see a world where I said you can interoperate between Delta and Iceberg without having to rewrite, uh, the files themselves. Um, and then I think for later releases, particularly Iceberg V four as well as, uh, Delta V 5.0, uh, looking to then move on to the metadata layer and trying to see if we can converge, uh, in that direction too. There's been a lot of work I think required to do that, but ultimately I think as a, as a North Star, it'd be a, a great sort of vision that we can accomplish through the communities.  

Speaker 1    00:17:17    
What's the source of truth on that?  

Speaker 2    00:17:21    
Uh, like,  

Speaker 1    00:17:23    
Um, Delta Log versus Iceberg Manifest, which ones are the source of truth and um, and then what's the future of Delta Uniform?  

Speaker 2    00:17:33    
Yeah, so in terms of the idea here would be that, um, whether you're running a Delta table or an Iceberg table, um, it wouldn't actually need to write out both copies of metadata. That'd be the idea here. The idea would be you'd write out a Delta log in a way that is understandable by iceberg clients. 'cause like the way that we commit those Delta tables is consistent with how iceberg commits to its tables. So there's ways that you can kind of get into this. Um, a lot of that stuff is being worked out amongst the community right now, but the idea here would be you wouldn't necessarily have both copies of metadata, uh, required. It would just be a consistent way of writing out, say the, um, like basically the metadata tree for Iceberg as well as like the checkpoints for Delta Lake.  

Speaker 1    00:18:20  
How did Unity Catalog handle tables in mixed formats? Delta Parquet iceberg in the same workspace? Is there a single metadata model under the hood or does it federate pre per format?  

Speaker 2    00:18:36    
Uh, so we were built kind of with the different formats in mind. Uh, so generally speaking, um, you can within right now, uh, if you wanna build out a managed table, uh, that would be using Delta or Iceberg, and then you can create external tables at least that we model. Um, basically the, uh, any sort of the older format. So things like just like a Parquet table or or rc, those are basically modeled as external tables under Uni Catalog me store  

Speaker 1    00:19:05    
Uhhuh. Nice. One of the challenges is to maintain the interoperability for the use of views between the different query engines. This is an important functionality to provide from user experience perspective. How does it, how is it being handled currently? You need me to repeat it?  

Speaker 2    00:19:27    
Yeah, if you repeat, I'm gonna have like a fire truck that's driving back.  

Speaker 1    00:19:30    
I know, man. Are you, are you like on the street right now? Is that why you put up this virtual background?  

Speaker 2    00:19:34    
<laugh><laugh>? No, I'm in my apartment, but there's, I live on a busy streets, so sorry about that.  

Speaker 1    00:19:39    
That's all good, dude. Uh, appreciate you making the time to come here and chat with us. Uh, one of the challenges is to maintain the interoperability for the use of views between the different query engines. This is an important functionality to provide from the user experience perspective. How does it, how is it being handled currently

Speaker 2    00:20:03    
That's a great question. So right now, in terms of interop between platforms, there isn't really I think a great solution for views, um, today. If you think about it from more for like a walk run, uh, sort of, uh, crawl flag, I guess, uh, paradigm we're very much in the, even maybe it's the crawling sort of phase where hey, we've been able to get tables into somewhat consistent open formats in the beginning to share those tables across platforms. Um, views is probably the, probably the next, uh, milestone for this where Iceberg Rest catalog interface does have, uh, some semantics around sharing views. And that seems to be the direction where a lot of folks in the industry are moving towards. So I expect the solution for a lot of these problems to probably come through the Iceberg Breast catalog. Uh, but as of today, there isn't a great way, I think, necessarily to share, uh, views across platforms.  

Speaker 1    00:20:55    
Excellent. I'm gonna keep cruising. Since Unity Catalog supports reading foreign tables from HMS, does it also create the lineage information for such tables?  

Speaker 2    00:21:10    
Great question. Uh, we do create Lineage for all the downstream, um, uh, work that you'll do with these tables. So if once you have the table register unique catalog, if downstream you are creating a whole bunch of, uh, tables that tie back to the table, you could go into Unity Catalog and see all your lineage mapped out really nicely. Uh, there is some work to think about how to do this for three P clients as well. Uh, that would be coming very, that'd be coming Zoom, uh, but within, uh, Databricks itself, we already have that capability.  

Speaker 1    00:21:41    
Nice. All right. This one is a, it must be a hoodie, community Membered. Why not also support Hoodie, which is also popularly used in the community?  

Speaker 2    00:21:54    
It's a great question. Uh, so hoodie, uh, as we've seen, uh, particularly within, just from a, purely from an ecosystem adoption perspective, uh, hoodie document has a great community. Um, and, but from a platform perspective, a lot of the sort of major platforms, the Snowflakes, Databricks, AWS, BigQuery, et cetera, they've really pushed more towards Delta and Iceberg. Uh, so I think what we've seen a bit more is really a broader adoption of, uh, those two formats, which actually have a lot more similarities, um, than when you add in hoodie. I think hoodie from the ground up had a lot more really cool features that make it a bit more different from Delta and Iceberg. So it just makes unification very harder, especially given the fact that I said you see a lot more folks in the, at least from a platform perspective, pushing towards Delta or Iceberg.  

Speaker 1    00:22:46    
How do we get around the egress costs for cross region data replication with syncing in the Delta format?  

Speaker 2    00:22:57    
That's a great question. Great tricky question. Um,  

Speaker 1    00:23:02    
You didn't realize this is like an interview right now, <laugh>. Yeah,  

Speaker 2    00:23:04    
You guys are, you guys are really, you're really drilling me here. Uh, yeah, I think the best ways I've seen getting around sort of cross, uh, these cross cloud, uh, beating egress because I've seen a lot of folks using R two for that, uh, across region. Uh, a little trickier it may have to just work with the, the different, uh, cloud providers to see what you can kind of push, uh, see there's a competitive space, so hopefully they start to fix that across, uh, clouds, but is a, is a tricky problem.  

Speaker 1    00:23:34    
So the official answer is let your account executive take you out to dinner and tell 'em you need to talk about these egress fees. All right.  

Speaker 2    00:23:44    
That's, that's, I've seen that work, <laugh>, that's, that's  

Speaker 1    00:23:48    
The non-technical answer, basically  

Speaker 2    00:23:50    
Non-technical answer. Yeah,  

Speaker 1    00:23:52    
I like that. All right. Are the processing for accessing foreign tables different from Delta sharing?  

Speaker 2    00:24:03    
Uh,  

Speaker 1    00:24:04    
Is, is processing for accessing foreign tables different from Delta sharing?  

Speaker 2    00:24:12    
So generally, like the way that we think about the differences between, say, sharing versus, um, using things like catalog federation, uh, catalog federation usually is really useful if you're working within an organization. So if you're one big company and you're trying to figure out how to connect your Snowflake platform to your data book platform catalog federation is a great way to solve that problem. However, if you are trying to say share with someone outside of your organization, um, then using things like Delta sharing actually is a better, uh, way to, to do that, uh, generally 'cause like that, that provides abstractions around your share and your recipient in a way where, um, it's a little bit more secure for that type of sharing. So that's, that's probably the biggest difference between these two approaches. Uh, there's actually some work already that we're doing, uh, to make it easier to do Delta sharing with iceberg clients. So there would be some similarities between some of the federation that we, we, we talked about here. But generally speaking, I think the differences are more around how you would use them from a use case perspective.  

Speaker 1    00:25:17    
Jonathan, there's a lot more questions in the chat, but as I mentioned before, you can call me Ringo Star. I'm keeping the time like a Rolex today, <laugh>. Dude, it's been awesome. I appreciate you coming on here. I'm gonna say farewell.