Not Just Lettuce: How Apache Iceberg™ and dbt Are Reshaping the Data Aisle

calendar icon
May 21, 2025
Speaker
Amy Chen
Staff Product Manager

The recent explosion of open table formats like Iceberg, Delta Lake, and Hudi has unlocked new levels of interoperability, allowing data to be stored and accessed across a growing range of engines and environments. However, this flexibility also introduces complexity, making it more challenging to maintain consistency, quality, and governance across teams and platforms.

dbt is the data control plane, bringing order to this fragmentation by centralizing business logic and enforcing best practices around data quality, documentation, and governance. Through support for the Analytics Development Lifecycle (ADLC) and deep integration with open formats, dbt empowers teams to standardize development workflows while choosing the right compute and storage for each use case — enabling a scalable, future-proof foundation for modern data platforms.

In this short talk, Amy will dig into how open table formats, with a focus on Iceberg, have changed the way the industry scales data. Open table formats give you the head of lettuce — dbt gives you the recipe to make something useful out of it.

Transcript

AI-generated, accuracy is not 100% guaranteed.

Speaker 0    00:00:00    
<silence>

Speaker 1    00:00:06    
Without further ado, I would like to bring on to the Stage R next keynote guest. Amy. Hello. Whoa, why are you so small? Hold on. You gotta get up here and big.  

Speaker 2   00:00:22    
No, I let, lemme hide down there.  

Speaker 1   00:00:24    
Lemme hide. You're the star of the show here. There we go. This is a little bit more even footing <laugh>. Amy, I'm very excited for your talk. I'm going to hand it over to you and grab your screen. Throw it up here. I will see you in about 20 minutes.  

Speaker 2    00:00:42    
Okay. Awesome. Uh, I guess we're, we're starting <laugh>. Um, hi everyone. Uh, welcome to my talk. Today we're gonna talk about, uh, open table formats, um, specifically iceberg because, you know, it's a little hard to find the delta of salad. I'm sorry. I promise that was my only dad joke. <laugh>, uh, maybe now before we get started, I have to come clean. Uh, from the time I actually wrote my abstract to me actually completing my deck and, you know, really understanding what is the story I wanted to talk about and have it materialized, things have changed a little bit. So I'm hoping you'll humor me and come along with me on this journey, uh, as we veer a little bit off course. Um, now this is also a very short session. I have 15 minutes to just talk, but I do wanna dig into something that's a tad meaty.  

Speaker 2   00:01:34    
Um, I wanna dig into the concept of what does it mean to have an open ecosystem that benefits the community users and inspires innovation? And of course, we're in the era of ai, so how do we do that with the right guardrails or else this would never scale. Now, uh, just to introduce myself a little bit, I am Amy Chen. I am a staff product manager at DBT Labs. I currently oversee our ecosystems integrations, including how, uh, DBT actually connects to data warehouses, data platforms, whatever you would like. I'm also overseeing, uh, a good chunk of our iceberg strategy. Now, for context, I've been at a DBT Labs for over six years now, especially when we were phish on analytics. And it's been really fun to see the, the growth, the establishment of the modern data stack, um, having that grow and essentially explode into this massive ecosystem that we see today where there's a lot of awesome people speaking about it in this, uh, conference.  

Speaker 2    00:02:37    
There's a lot more engines, there's a lot more formats, and there's a way more amount of metadata. Now to talk about open ecosystem, I should probably define it first in terms of what does it actually mean and why do we want this? So open ecosystems are really built to allow for organizations to truly retain their data metadata and business logic. The idea here is it actually will empower users to choose what is actually best for their ever evolving workflows and prevent you from lock, lock being locked into like a single vendor stack. What's also really fun is it opens up the, uh, the door for community driven development and innovation. And to get here, we actually need three key things. We need interoperability tools and platforms need to be able to speak to each other without any proprietary requirements. We need open standards and protocols.

Speaker 2   00:03:35    
Can you imagine what the industry we work in today would look like if we didn't have things like, uh, SQL or REST APIs without these widely accepted standards, everything would look a lot more different. And that also opens up the door to portability and, uh, integration. And when I talk about full integration, I'm actually more curious about what organizations get from that, because ultimately where I wanna live and the industry I think everyone wants to be in, is where organizations actually get to choose the best in class components rather than being stuck in, say, like a bundled suite where it may kindly kind of do the thing, but it it's not gonna delight the developer. Uh, so let's tie that actually to something more tangible. So when I initially launched, um, iceberg, uh, support at Coalesce, which is our conference, uh, in October, please do come.

Speaker 2    00:04:33    
It's a lot of fun. Um, this was actually one of the first responses I got in our DBT Slack community. And I thought it was really amazing because let's be honest, there is definitely a level of fatigue that can happen in the data space. Every few years the industry latches onto like a big shiny toy, and everyone crowds in, uh, they write blog posts, they get really excited. Before Iceberg, it was semantic layers. And even 20 years ago, the, the latest iceberg was, uh, basically called Hadoop. Um, and there's a lot of meaning to being able to question the the new thing because let's admit it, adopting new technology always comes at a cost. And if you choose wrong, that's a little scary. Now for Open Ecosystems, if there's also no standardization, if not enough folks are actually on board, then you're always gonna be limited what you can do.  

Speaker 2    00:05:28    
Because if vendors aren't integrating, then you don't get access to it or you're gonna have to build it yourself. And that comes with an additional cost. And that is actually what we saw a few years ago with the Semantic layer. Basically every tool published their own spec, like spec, and that limited the scope of where you could actually bring that spec to because you had to go to the tools that actually supported that spec, or you had to duplicate it across these many tools. However, what's been really nice to see is Iceberg doesn't feel like it's the next Hadoop for one, it's also solving a different problem. Hadoop really was coming at an era where engineers were trying to solve for big data and wrangling in data lakes. But we've evolved from then. We are now in the era of like more separate compute and storage, and we're also in the space of rising cost, especially in this economy.  

Speaker 2    00:06:22    
Um, and what's been really fun to watch is the commodification of the compute engines, uh, because of this modern data stack. Boom, we've had, there's so many choices out there. I think Snowflake is a really good example of one that has popped up. Uh, snowflake is really well known for the fact that it's one of the first warehouses to really push the idea of separating, uh, storage and compute. And that allowed for a lot of flexibility. That meant you could spend more money on the component that you needed more effort on, but there's limitations. It was storage and compute on their own infrastructure. Now, the fun thing about the, the, the table format wars that we've been seeing is the paradigm is actually shifting to the, the next step. It's actually truly breaking apart compute and storage. And outside of letting me generate really fun, uh, memes like this, the table format wars have actually shown us the importance of what users actually want, which is an open ecosystem.  

Speaker 2    00:07:24    
Now, taking a step back, iceberg Rose through this because, you know, it came out of Netflix, which of course has a lot of data, and it became very widely adopted by industry giants like Apple and LinkedIn. And last year when Databricks bought Tabular, we basically got to see the bringing together of the founders of Data Lake and Iceberg under the same roof. And that put a little bit of water on that fire. Uh, just like Jonathan was talking about earlier. We're starting to see some of the specs start to converge and hopefully just become one. Uh, it was really exciting to get to see deletion vectors start to come in on the new Iceberg Spec. But I think one of the bigger, more pivotal milestones that we saw was the level of compute engines that started to ship support. While some did ship it to just check the box.  

Speaker 2   00:08:14    
We're actually seeing vendors really lean in. And what's also exciting is a lot of them are leaning in with also supporting metadata standardization to the Iceberg Spec. And without that, it honestly would be another failed experience to have a very successful open ecosystem. Tools need to meet that customer demand. And what's been nice about Iceberg is we're seeing the technology and the business requirements start to come together making this a necessity rather than just a nice to have. So now that I've kind of talked about the why I, I wanted to dig into the what if you're already familiar with Iceberg, this is not gonna be new news to you, but I actually wanted to make sure we're all speaking the same vocabulary, um, especially when it comes to what even is Iceberg. So I'm gonna give you these three key words when talking about iceberg.

Speaker 2   00:09:04    
A lot of times vendors are referring to the table format, the data catalog, and the risk protocol, which often can be referred to as the iceberg res API as well. Um, the table format is essentially how the underlying data is actually stored. The data catalog is where the iceberg metadata is actually organized and accessible. And I think the most fun part of Iceberg is actually the rest protocol. This is actually the standard way of interacting across catalogs. Um, very similar to what MCP servers are doing today for AI agents. This also means if you have a platform that's supporting that protocol, it's actually opening up the door for other engines to be able to read metadata from your catalog so that the computer engine always knows where the object is stored, what it's actually called, and even what files it wants. And this protocol is very key to ensuring that iceberg gets to live in an open ecosystem because systems really need to be able to speak to each other with one standard rather than just custom integrations, which let's be, um, let's be honest with competition most vendors will not wanna do.  

Speaker 2    00:10:16    
Now I stole this wonderful slide from one of my coworkers, uh, Leah. And one of the reasons I love this slide is it really shows all the standards, the tools that are necessary to make iceberg very powerful and part of this open ecosystem. Now, uh, I did tell you about the components of Iceberg that I've been thinking of. Now it's actually applied to this ecosystem slide. So you can see we're starting out with the storage, uh, with Iceberg. In order to make it interoperable, you need to have your data actually accessible in a central location for your data so all the compute engines can actually access. Um, this also gives you the ability to say, Hey, I truly own my data, it's in my infrastructure. And to do that, you're gonna wanna store that in iceberg table format or you know, through a way to make it, uh, iceberg compatible.  

Speaker 2   00:11:06    
And then you have your catalogs. I've listed some iceberg catalogs and also some iceberg compatible catalogs like Unity, um, which is going to manage your iceberg metadata and tell the compute engine where to look for your data. And the more fun part is the, the compute engine is actually what's processing that data and reading and writing to the catalog. Now, this is probably not surprising for me to say, but the benefit of an open ecosystem is actually being able to show you a stack like this full integration in interoperable in the DBT ecosystem. Uh, and also specifically our customers. We're starting to see that this is more common. Um, I think about 50% of our users currently are using, uh, multiple compute engines inside of their company. So we're seeing more people have, uh, more companies start to have a lot more multiple, uh, compute engines, storage layers, and even data catalogs.  

Speaker 2   00:12:05    
But this also leads us to another thing. Um, drew Bannon, who is one of our co-founders, loves to say more data, more problems. And that's really true, the more data you have. And now that you're unlocking, uh, the, the blockers to being a able to access yet another team's data, well, you got yourself a lot of fun. Uh, the stakes just got higher. And now governance and trust are gonna be very paramount to ensuring that you have a good analytics workflow. And this is where DBT comes in. Uh, the goal of DBT is to help you manage that complexity. Bring in the standardization that you need to ensure, hey, this is one way everyone is working together. Um, the where we live as an extraction layer is that we wanna ensure the way you interact with dbt with one compute engine is exactly the same as you would with another.  

Speaker 2   00:13:03    
Basically if I jump into a DBT project, I should know exactly what to do if I'm executing it against Databricks, snowflake, spark, whatever you would like. I also just realized I might have gone a little ahead of myself. Uh, if you are new to dbt, um, and you're like, what are you talking about? Uh, I'll give you the high level understanding of what DBT is. Uh, essentially what data teams today are using DBT for is to transform your data. Uh, they're creating cleaned, prepared, tested data sets that can be used to power your downstream use cases, whether it's AI or bi. The goal of DBT is to really ensure that you can trust your data and also the decisions it's feeding into, uh, making sure that it's accurate and consistent without ever lowering your productivity. And in terms of where DBT sits today in your, in the O open ecosystem, DBT is an open standard.

Speaker 2   00:13:59    
It's widely adopted in the data warehouse, data lake ecosystem, and it is fully integrated into that ecosystem. Um, I might be a little bit biased, but we have a lot of connectors, both community and dbt Labs owned and vendor owned, uh, to be able to connect to your data platform and also downstream tools like data catalogs and BI tools to be able to query and consume the metadata and the assets we generate. The way I see DBT coming together is to, uh, create that cohesion that you need with iceberg in this open ecosystem. And in terms of how that fits in, uh, in that tool stack that I showed, uh, it is essentially now with the compute engine catalog and table format. Um, our first part of our iceberg strategy is actually very focused on the table format and the catalog. So we have two integrations, one that is already out, we launched last year, and one that's actually gonna be released, uh, to the, uh, to the open source, uh, community, uh, at the end of this month.  

Speaker 2   00:15:06    
Um, actually you can start playing around with it already, but uh, it is still in beta. Um, what we launched initially is, uh, model Materialization. The idea here is when DBT actually materializes a model as a table, the underlying adapter, whether it's Spark, Trino, snowflake, Databricks, is able to create a iceberg table. Now the catalog configuration is starting to, uh, point to what I mentioned earlier in terms of the REST protocol. Uh, with our new catalog framework, users are now going to be able to define the catalog that the metadata of the table is actually being written to, um, whether that is the default information schema or to an external data catalog based on what your platform supports. Now we've been working on this for a while and we're going to be releasing initial support of it on May 28th, officially on select adapters. And I'm gonna talk a little bit more once I get into the demo.  

Speaker 2    00:16:02    
But, uh, the reason why this is super important is DBT today actually uses and creates a significant amount of metadata before every run. What DBT is actually doing is it's querying the catalog or the information schema to actually find out what already exists. So it knows how to compile the code for conditional logic, being able to take, say the ref function and actually say the three part name space of the object. It also informs where DBT actually materializes the object. By supporting all of these things, DBT gets to be clever and can adjust based on the environment, the code logic, and the use cases based that have been defined in the DBT project project. And these features are very foundational to being able to support what we are calling cross-platform DT mesh and Cross-platform DT Mesh is a evolution, um, of our initial launch of DBT mesh.

Speaker 2   00:16:59    
The idea here is each domain team can now, uh, have their own DBT project to hold their logic and they, all of these projects can actually be connected to different platforms with this, uh, the DBT mesh features are, you know, creating the right guardrails with versioning and access control across projects and across platforms. So everyone always knows what's actually available and more importantly, all the things that could potentially break based on the dependencies. So I wanted to, I, I know I'm running a little short on time, so I'm gonna go a little bit fast here. Um, I'm in my DBT project and what you're seeing here is that catalog, uh, framework that I mentioned earlier. The user gets to find a catalog. They can, uh, the initial launch will support one right integration. So I get to say, Hey, this is my catalog, uh, this is the catalog type, whether it's built in, we do sup plan to roll out, uh, iceberg Rest compatible catalog soon.  

Speaker 2   00:18:04    
And they can define, say the, the table format. Um, also where is the, uh, S3 bucket for all of your iceberg data storage? So once that's been defined, I can go into the DBT project, I can als uh, into a DBT model. I can apply this also on the directory, but, um, demoing here on a model, all I have to do is say, Hey, I want this to be a table and I want this materialize, uh, the metadata materialize in this catalog. And also know, hey, this is actually going to be materialized in iceberg table format. And when I go ahead and execute, um, I've already ran this, you can see that what the code that actually gets generated and is sent to the warehouse is, hey, create my iceberg table and use, uh, in this case, this is writing to the built-in Snowflake, uh, horizon catalog.  

Speaker 2   00:18:54    
Soon we'll be able to write to Polaris, uh, open catalog, whatever is rest compatible. And it just creates my object. What's also fun is this particular model that it's selecting from is actually coming from a Snowflake proprietary table. So now I'm able to maybe build everything inside of Snowflake in the last stage, put it into ice, uh, into an external storage and iceberg format so other compute engines can start to consume it. I also applied tests. Um, you can see I have basic not all and unique tests. And what that does is essentially allows us to be able to add in additional governance to assure, Hey, I'm sending these downstream, this can be trusted. And the next step that we're launching, uh, if you're interested, please come to our launch showcase. Uh, next week it's on Wednesday. We're gonna talk a little bit more about where we're going with cross platform, but being able to parse out and know what the object's name is across different, uh, platforms is gonna allow us to support cross platform mesh. Okay, I I, I think I might have hit time, so happy to turn it over. Ask, answer some  

Speaker 1   00:20:03    
Questions and questions. There are, let us quickly cycle through them. Does Apache Iceberg go against serverless cloud computing?  

Speaker 2   00:20:17    
Oh, uh, it ultimately any adapter that we support, whether that is, uh, if you're, if they're referring to like Spark server, uh, like data warehouse, serverless or all of that, it, it just works based on uh, the platform support.

Speaker 1   00:20:34    
All right. Next up we have one. Is there any way we could see column level lineage using DBT core?  

Speaker 2   00:20:47    
Ah, um, great question. Uh, I highly recommend that they come to the Launch showcase next week. Our VS code extension is now giving to our community the opportunity to run Fusion in a vs code extension. And that will give you common level  

Speaker 1    00:21:01    
And the link to that event.  

Speaker 2   00:21:06    
I would just Google it 'cause we have a lot of advertising somewhere. You could just look up a DBT lodge showcase and I'm sure it's, I'm pretty sure we put some money. All good  

Speaker 1   00:21:15    
Back there. Alright. Uh, and then there was a question when you were going through the slides and you were talking about this situation. Some Kevin was asking, is on-prem storage not an option?  

Speaker 2   00:21:29    
Ooh, that's a good question. I would say it's very dependent on the adapter. Um, right now a lot of on-prem, like you, there are some platforms that are supporting being able to reach into on-prem storage. Uh, but you have to do it through private link. I haven't seen many people who've done that very easily to be  

Speaker 1   00:21:51    
Honest. Oh, interesting. All right. Well, all right. Somebody's asking about the survey. So I'm dropping the QR code in there. There are more questions. And since we have a break that is right after this, where we will be giving away the AirPod maxes, I'm going to cut into that break, but I will not cut into the trivia that gives away the AirPod Maxes. So Amy, is there a type of data that is best suited for DBT or perhaps a certain read write ratio?  

Speaker 2   00:22:20    
Ooh. Um, right now DBT is very focused on structured data. Uh, we haven't really jumped into unstructured just yet. Um, I feel outside of that, I don't think it matters. At least I, none  

Speaker 1   00:22:34    
That comes to mind. I like that. It's good to know. This is another great question. Um, can we write CTE for Iceberg Table or similar like Snowflake?  

Speaker 2   00:22:51    
Yes. Um, I assume they're saying CTE. What's interesting is Cask gets a little funny based on the external catalog versus internal, uh, CTE should work just the same because we're just, uh, compiling it to the, compiling the sequel. But yeah, it should work. Normally CTAs is, uh, ask me about glue later,  

Speaker 1    00:23:13    
<laugh>, that's getting very deep into the weeds. Uh, I like it though. Yeah, that's a, that's a cool that question. So Amy, thank you very much. We're gonna end it there and we will keep on cruising.