OneLake: The OneDrive for data
OneLake eliminates pervasive and chaotic data silos created by developers configuring their own isolated storage. OneLake provides a single, unified storage system for all developers. Unifying data across an organization and clouds becomes trivial. With the OneLake data hub, users can easily explore and discover data to reuse, manage or gain insights. With business domains, different business units can work independently in a data mesh pattern, without the overhead of maintaining separate data stores.
Transcript
AI-generated, accuracy is not 100% guaranteed.
Speaker 0 00:00:00
<silence>
Speaker 1 00:00:07
This time we've got another hard hitter. Josh, I'm gonna bring you up onto the stage. Where you at? Where you at? There he is. How you doing, man?
Speaker 2 00:00:18
Good. Yeah. Thank you for having me.
Speaker 1 00:00:20
I'm very excited for your talk. I do not want to get in the way at all. So I'm gonna let you share your screen and I'll be back in 20 minutes to ask you a few questions.
Speaker 2 00:00:31
Great, thank you. Yeah. So I'm Josh Kapp and I'm the director of Product Management for OneLake at Microsoft. Uh, we're, I'm gonna talk about how you can unify the entire data state with, uh, OneLake and OneLake. If you're not familiar, uh, is actually the data foundation of Microsoft Fabric. If you're not familiar with Fabric Fabric's, a new, uh, data, unified data platform that we've launched that unifies all the different services within Azure data into one SaaS service. This includes, uh, Azure Data Factory. All the analytics services like data warehousing, data engineering, even SQL databases and real-time analytics and Power BI are at all part of this now single SaaS service of Microsoft Fabric. Now, OneLake is the foundation, and, um, everything we did in fabric, we took a very lake centric approach. Now, that wasn't always the plan. Uh, OneLake wasn't actually originally part of the plan for fabric when we were getting ready to launch it, and we sat down, we talked to a lot of customers about their data lake strategies, and we saw some very common themes.
Speaker 2 00:01:35
Every customer had a very high expectation of data lakes. They had these visions of these pristine data lakes that basically provided one place to land all data, whether it's structured or unstructured. And because it's in one place, it would automatically break down data silos, make it much easier to use blend, analyze data together. Um, because it's in one place, again, it simplifies security. You only have to secure data once, uh, discovery your data, it's all in one place. You look, it's there, it get, it's easy to find. It gets into all the hands of the users and the applications that need it. That was the goal. That was the vision reality. I like to compare it to file sharing prior to, uh, Dropbox in OneDrive coming along, if you remember how we used to share files, it was a very storage oriented solution. You'd get storage, you'd rack it, uh, you'd rack servers, and you'd create these network file shares and folders, and you use that to share files.
Speaker 2 00:02:23
And Dropbox OneDrive, they come along, they give you a way to share files, and they've evolved a lot beyond just storage and sharing of files to collaboration, to governance of files. This is where I think we are today with, uh, data Lakes. You don't buy a data lake, you buy storage. And very quickly, that idea of a single pristine data lake goes out the window sometimes for technical reasons, but often just for people and process reasons. Very hard to coordinate across a large organization to have everyone agree on the same standards, the same way of storing data, same way of placing data. And it's much easier for every team to just go create their own. And that's what you end up with these multiple siloed stores. Now, there are ways to connect across these, but the most common way to still break down these silos is data movement, consolidating the pieces of the data that you need in one place.
Speaker 2 00:03:10
And even once you do this, even once you get the data together, still most users, most applications are not able to talk directly to the storage. They're just not capable. So you build serving layers, data, Mars, data warehouses, cubes, and these don't typically reference the data. They're usually, again, another copy, or sometimes they're a copy of a copy. Um, and the organization has to build systems to manage this <laugh> systems to keep this data in sync systems to explain why two numbers that are supposed to be the same don't actually match systems secure and govern this. And it's a lot of work, but it's a lot of value that you get out of this data once you're able to actually achieve this. So just like Dropbox, OneDrive give you a SaaS service for data, for document sharing with OneLake, we aim to give you a SaaS service for data lakes provide you that value outta the box. So you just need to focus on the data itself. Now, how do we do this?
Speaker 2 00:04:07
We looked at some of the challenge, some of the challenges, um, uh, this idea that it's easier to create your own lake versus reuse. We wanted to, we wanted to stop. Um, so there can only ever be one OneLake within an organization. It's automatically provisioned, just like you'd only have one teams, uh, one SharePoint, uh, or office tenant. You only have one OneLake. And all the data within the organization automatically lands within that OneLake. And there's a, there's a tenant admin that sets the initial controls, the initial boundaries, and ultimately sets the initial governance so that anything that lands within that OneLake is, is governed out of the box. But we don't want that admin to be a gatekeeper. So just like within Windows, within, within teams or in SharePoint, you don't have to go through an admin to create a SharePoint site or a teams channel. We have the concept of workspaces. Anyone can create a workspace. Each workspace has an admin own access control. It allows different parts of the organization to work independently while all still contributing to the same lake itself.
Speaker 2 00:05:09
Now, we took this a bit further. So workspace tends to correspond to a team or a project. A typical business domain will have multiple teams, multiple projects. So we took the, from the, we borrowed from the data mesh, uh, pattern and productized allow those features into OneLake. So the concept of domains is now available within OneLake. So you can wrap a, you can wrap a set of workspaces now within a domain, a business domain. And this allows you to then manage multiple workspaces. Have an admin now that can control multiple workspaces, manage multiple projects. This also gives us another opportunity. A lot of times customers will ask me, how do we prevent people from bringing data into one? Like, we wanna avoid data swamps. So we don't want the official data sitting next to the unofficial data. And I warn them, if you do that, if you prevent people from bringing data in, it's gonna go somewhere else.
Speaker 2 00:06:02
If it goes somewhere else, you don't know if it's, uh, governed by, uh, you don't know if it's been secured, you don't know how it's being governed. You don't even know how it's being used. So you want them to be able to land it in OneLake. And you can use this concept. We have data endorsements whereby each domain, you can go in and certify and recommend the official data. And that's discoverable when someone goes and reuses that data. And now what should happen is the official certified data rises to the top of the lake while everything else sinks to the bottom. But if that doesn't happen, then you can go and you can take action and you can, uh, go to certify that data or go to move that data out, but you're not working in the blind anymore. Now another thing that that's very typical in the data mesh pattern is you're gonna build data products that are gonna consume data from multiple domains.
Speaker 2 00:06:49
And since everyone's working in the same data lake, there's no need to move copy, duplicate data just to reuse it somewhere else. So we have this cop concept of shortcuts. Shortcuts are just like shortcuts and windows or shortcuts and Linux. They're a pointer. They point from one location to another, and they allow you to basically virtualize the data into another location without data movement or data duplication and without changing the ownership of the data. So this provides the connections between domains and it provides a lot more. So when we were building this too, we realized there was no data in OneLake yet we hadn't released it. So there's data in other places though, and many of those places now we're very, we're open storage platforms. So storage and Azure blob storage, Azure, a DS, gen two, Amazon S3, Google Cloud storage, anything that's S3 compatible, all store the data in open storage formats and open storage systems.
Speaker 2 00:07:41
And with shortcuts, we can just point to those directly as well. Create a reference to something outside of OneLake and virtualize it back into OneLake. So whether data's physically in OneLake or virtually in OneLake, it all looks and works the same from someone consuming it. You don't have to know where it's coming from, you don't have to know it. S3, you can all use it with the same set of APIs, the same hierarchy, um, and consume it like any other data that had been physically loaded or moved. Whether that data is originating from inside OneLake or coming from outside, even Azure.
Speaker 2 00:08:14
Let me actually show you how this looks. Uh, and we're actually go build, uh, a data mesh. So here I'm actually in the fabric portal and you can see the list of workspaces I have here. Creating workspace is very lightweight, takes a few seconds. I'm gonna go to the existing one. And I have a data item here. A data item is like a lakehouse, a warehouse here. I happen to have one data warehouse and it happens to have a schema in it. Now this experience should feel like any data warehousing experience you've ever worked with, it should feel like SSMS basically in the web. Um, and I should see my one schema and my one table. Now, even though this looks and and feels like a data warehouse, all this data is actually being stored in OneLake in an open file format. So I can see this by going through Windows and Windows. I can see the same list of workspaces, but this time they're folders. I can see that data warehouse, I can see my schema, I can see my table. Now creating a new table is just like you would for any data warehouse table. I'll use T SQL here. I'll create a table. I'll insert one row.
Speaker 2 00:09:26
And what's gonna happen when I flip back over to, to look at the file view, the actual lake view, you'll see the new table gets created. You'll see a Delta log in here and you'll see an actual Parquet file, which stores our data. So anything that can work with sql, any application, any user that can work with SQL can read or write data to a lake in open file format. So you don't have to learn any new skills here. Now, why is this important? Well, let's say I wanna reuse this data. Let's say I'm gonna build a data product in a different business domain, a different user. I don't have to ETL this data out. Now everyone's working over the same lake. So if I flip over to a different workspace in a different domain this time I wanna go ahead and, uh, this user's gonna go ahead and create what we call a lakehouse.
Speaker 2 00:10:07
And a lakehouse is our most lake like data item that we have. It supports both structured and unstructured data. And anything you can do basically in storage containers you can do here in the lakehouse, but also has that structured aspect to it. I'm gonna reuse that data from the other domain, from that data warehouse we just created. That table we just created. Um, and I don't have to go ETL it, I don't have to set up a pipeline. What I'll do is I'll create a shortcut and this time I'll say I want the data from OneLake and it brings up the OneLake catalog, which shows all the data that I have access to. And I can even filter this down by domain. So let's filter it down for the, uh, sales domain. And I can see all the data in the sales domain and I can see the official data. So you see that data endorsement there, that's showing me what is actually officially certified plus the other stuff that's there. And I'll take the, the, the data warehouse we just took and the table we just created.
Speaker 2 00:10:58
And I'll just link to it in the seconds it appears here as if I had moved it as if I had copied it. But no data movement has happened. Now, to build a consolidated data lake, I have to combine data from lots of sources. Maybe my data is not even in Azure. So let's say I have data coming from Amazon S3, my customer data is there. I can create a shortcut to Amazon S3. All I gotta do is provide my bucket information. And there's something different about this data. The last, the last data we created I showed you was a del it had a Delta log in it. This happens to be iceberg. This is, it is actually being written by Snowflake as iceberg tables.
Speaker 2 00:11:53
I am gonna go create the shortcut to it even though there's no delta log here. And again, within seconds it's gonna appear in my lake as if I had copied it as if I had moved it. No data movements occurred. But what's special here is many of the engines and fabric only understand Delta Lake. Under the covers. We're actually using Apache X Table to make it so that our engines, our users don't have to care. So any data provided to OneLake will automatically get iceberg and Delta metadata supplied. So if I actually look at this table now, uh, it previews and if I look at the files behind it, you are gonna see something here.
Speaker 2 00:12:42
We have both the Delta metadata and the iceberg metadata provided. So the engines only have to support one of the formats and they're able to read from it. So if I actually go and I use the SQL engine in fabric, I can select from that data, but I can actually join across these two. And by joining across these two, uh, I'm joining across clouds, I'm joining across domains and I'm joining across file formats. And when new data is added, like we finally get that second sale, all I have to do is rerun my query. And I always get the latest DA data. I'm not, don't have to worry about working off a stale copy 'cause there isn't ever any copy here. New metadata is virtualized automatically. New data shows up automatically next time I run, building a single lake, um, physically and virtually cross file formats. And this is really important 'cause you know, Microsoft's one company, but we have multiple engines that we have that were built over many years for many different reasons. They're good at different things. All of them have been reworked to store their data in OneLake is tabular data that's gonna store it in Delta Lake format, which means they can all read the same data without creating copies, without duplicating data.
Speaker 2 00:14:09
And we started with Delta Lake and that's what our engines write in. Now we have partnerships with Snowflake, we're working with them. Um, and there are many others writing Iceberg. So with Apache X Table, we're building in the seamless translation between Delta Lake and Iceberg. Uh, so, you know, our philosophy is customers engines shouldn't have to worry about it. Uh, file format should not be a way to lock you into one engine or another. Um, as long as you support at least one open format, we will handle the translation for you and provide both sets of metadata. So right now we have available, you bring us iceberg data, we'll give you Delta. We're in the final stages of all the Delta data will also be available in Iceberg. And then at that point, any table you have that you've brought into OneLake, we will have both sets of formats and work with all sets of engines that can understand either one, whether they're in fabric or outside of fabric. I think I hit my 15 minute mark. Can pause there for, uh, questions.
Speaker 1 00:15:08
Yo, that's some open data. All right, I like it. This is cool. Now, there's a lot of questions coming through this chat. I'm gonna start with the most upvoted. Feel free for folks that are active to upvote the questions that you wanna hear answered. We're gonna say the first one, how is history maintained in all the layers, bronze and silver and gold
Speaker 2 00:15:38
So, um, I would interpret that as the history of, uh, changes to the data. Yeah. Um, so yeah, it's, if it's a fabric engine running, it's gonna be in Delta Lake. Um, we'll have the time travel available there. We'll have the different version of the Delta log. You can decide your own compaction and clean up there. Um, but because we're using these open formats, we're taking the advantage of the, the features that are there. Uh, same if you're using, let's say Snowflake, to write through iceberg. We'll take advantage of the features that are there in Iceberg. Um, the great thing with these open formats is we didn't have to invent anything, anything new here,
Speaker 1 00:16:12
<laugh>. Yep. You get to stand on the shoulders of giants. I like it. Is it a good strategy to use shortcuts for S3 and database mirroring for Databricks in OneLake? How does latency and speed for queries look?
Speaker 2 00:16:26
Yeah, shortcuts in general. So there, there are pointers to data, um, which means they're not physical. Um, but typically we're running distributed systems on top of this data anyway. So data's coming from multiple different places. So if I shortcut two data in, um, let's say the same region, the same data center, there's really no difference whether it's a shortcut or if it's physically in OneLake. Um, now the further you get apart from where your compute is to where the, the, the data is actually coming from, then physics starts to play a role. And if you're gonna shortcut something to the other side of the world, then yeah, you can start to see, uh, latency on, on, on cold cache. And you can also see egress charges if you're coming from another cloud. Uh, we have this, uh, smart caching built in with the exact goal of reducing that, where we will cache stuff locally if it's been accessed once, uh, if you turn it on, uh, with mainly the goal of reducing that, um, that latency. One of the nice thing about, and, and the egress charge is too, if youre coming from a different cloud, um, the nice thing about that too is, you know, most of these file formats are immutable. So once we cache something, once it's, you can keep it for pretty much ever and new stuff that comes in. We just have to be additive to it. So it's a very simple, simple cache, but can be very powerful.
Speaker 1 00:17:37
Hmm. How does OneLake transparently unify data storage for all fabric experiences, data engineering, data warehousing, et cetera, and what trade off should I know in choosing file formats or folder structures?
Speaker 2 00:17:56
Yeah, so, um, that was one of the big challenges we had was these different engines that were built over different decades, um, for very different purposes in a lot of cases, different decades. It's almost like, it's like it was going across yeah, decades. Um, in some cases for sure. Some were acquisitions originally. So in a lot of ways it's like going across different companies. Um, and uh, we had the motivation to do so, and I think it took a big organizational change and challenge to actually push us all in that direction to, to even be motivated to do it at the time. And we, you know, we all felt it was right for customers. Um, but it took some agreement for one thing, locking on a, a single format at the time, um, locking on a kind of a, at least a top level, a similar hierarchy.
Speaker 2 00:18:39
Um, and then making sure that the, the information that's that each engine needed was centralized in one place that any other engine could get it without having to go through the original engine. Now different engines have different requirements, like our data warehouse. They wanna own their data, they wanna make, they wanna guarantee that the data's consistent and accurate. They don't want anyone writing to it. That's not them. So we put controls in there to say, okay, you're the owning engine for this data. You're the only one who can write here, but anyone else can still read. Um, so they control the consistency, they control the accuracy, um, and they control the rights essentially, but they're putting enough information in there that anyone else can read without going through them. So that was the first challenge is how do we make these formats work? And there were lots of, you know, these things are in different architecture, so even just getting the file formats right was one thing.
Speaker 2 00:19:28
But they used different encoders, they used different compression, they couldn't always understand each other. So working through all those details was a challenge. And, you know, we got through it. I think now the, the biggest challenge is security. Um, we've worked out, we're in preview a feature now that lets you define security in the lake, both row level, column level with filter predicates, join predicates, um, that can then be consistently and performantly applied across all engines. And that was a whole, that's gonna take a whole nother hour to talk through how we did that. But, uh, yeah, standardizing these things and doing it in a way where it can be easily reused has been kind of the, the mission we've been on. And I won't say it's easy, but it's, uh, you know, the file formats were definitely the great start and we've been just building patterns and practices around that to make sure it happens across everything.
Speaker 1 00:20:13
Speaking of file formats, does the source file format import automatically into the metadata or do you have to tag it yourself?
Speaker 2 00:20:23
Uh, is, I'm assuming this is referring to the, uh, supplying the iceberg metadata and the delta metadata automatically? Yeah, so essentially we'll know it's, if you supply one of those formats, we'll know it's a table. Um, and we virtualize the file system to show both sets of metadata there. And then at runtime, when someone actually tries to access the other other metadata, um, it'll be generated on the fly. So it's there, uh, for you to access at that point. So what's nice about this is you don't have to change the, the publisher of the data, the writer of the data to supply both formats. You don't even have to have re uh, right access to the data itself. Anything, any data you can read that has at least one of these formats just shortcut into OneLake or store it natively. Either way, the other format will be there.
Speaker 1 00:21:07
All right, last one for you. Let's see. There's maybe not last one we'll see. Is fabric API or do you ha does the fabric API have control over all the products in like OneLake in the platform? Does that make sense? <laugh>,
Speaker 2 00:21:28
Uh, try,
Speaker 1 00:21:29
You know, what I'm trying to get at does the fabric API have control over all the products.
Speaker 2 00:21:34
So the OneLake API, so there's Fabric and there's OneLake. The OneLake, API is the data plan API, and that does give you a consistent view of the entire, um, lake. No matter who, who's the writer, it was a Lakehouse, a warehouse, uh, snowflake, whoever writes to it, that's a consistent view. Um, that's one set of APIs for that. And Fabric also has control plane APIs that, that span all the experiences. And, you know, CIC expands all the experiences. Um, really try to, uh, to give you a solution here where you're not integrating products together, we're giving you one product that you can just use. And that's kind of our, the SaaS philosophy. We've been following
Speaker 1 00:22:12
One ring to rule them all. So the clarifying question before we move on is under the hood, basically it's a data pointing to the storage. And this is a managed table approach and a query engine on top.
Speaker 2 00:22:32
It's, um, so one leg itself, I mean, you break anything down long enough, you'll get to some of the, the commodities here. So, uh, under the covers it's object store, it's it's blob store, it's a DS, just like you break down OneDrive, you're gonna get to eventually, uh, uh, blob store as well. Mm-hmm. So it's a SaaS service on top of that, a managed service on top of that. For the engines themselves, they're not part of OneLake. Um, they might be part of fabric and they might be outside of fabric, like in the, the Snowflake example I gave. Um, there's n number of those and they, the goal there is that they can operate over the same copy of data. So that that's a separate piece, um, from the actual lake itself, the, the engines. Um, so the lake itself is a managed service, um, just like you would again see another SaaS service. I, oh, my Golden example is still OneDrive, um, that sits on top of that actual commodity storage services.
Speaker 1 00:23:26
Is OneLake built on a DL Gen two?
Speaker 2 00:23:29
Yes. Which is built on storage, which is, yeah. Keep going down the, the line there.
Speaker 1 00:23:35
<laugh>. Well, Josh, this was awesome man. There's more questions in the chat if you wanna jump in there and answer 'em. We're gonna keep it moving because as you know, I just gotta do one thing right today. And that is keep time. So Josh, thank you sir.