Panel: The Rise of Open Data Platforms

calendar icon
May 21, 2025
Speaker
Arturas Tutkus
Engineering manager
KAYAK
Jing Li
Senior Staff Software Engineer
Uber
Dipti Borkar
VP & GM
Microsoft
Jonathan Rau
VP/Distinguished Engineer
Query

The rise of open data platforms is reshaping the future of data architectures. In this panel, we will explore the evolution of modern data ecosystems, with a focus on lakehouses, open query engines, and open table formats. We will examine how these open-source technologies are breaking down traditional data silos, enabling scalable, flexible, and cost-effective solutions. Panelists will discuss the impact of open standards on data accessibility, performance, and interoperability, while offering insights into the growing importance of community-driven development in shaping the future of data platforms. Join us for an engaging conversation about the convergence of open technologies and the next wave of data architecture evolution.

Transcript

AI-generated, accuracy is not 100% guaranteed.

Adam - 00:00:07
The Cambrian explosion analogy was excellent, and I think just helped to frame, I think for me and a bunch of people that I saw in the chat just helped to frame this day perfectly. Really, we're just seeing this explosion in open data formats too, but in engines. Perfect analogy. And it's continuing to evolve, and I'm very excited to hear what this panel, the kind of light that this panel will shine on this explosion moving forward. We're letting everybody in on Arturas. Welcome to the stage. Hey, everyone. We got Jing, Dipti, and Jonathan. All right, guys, take it away. I'll be back soon.

Vinoth Chandar - 00:00:47
All right, folks. Great to see everybody again. And yeah, welcome, welcome to this panel. I think this is gonna be just a very casual, deep dive into the world of open data. So, thanks for taking time. So I'll quickly kick off a quick intro about me. I'm a founder at Onehouse, also, you know, primarily wake up and think about open data, work on something around open data for a while now. And I also have a lot of amazing folks here who work in and around the area. So I'll quickly do intros here, starting from Dipti. Dipti is VP and General Manager at Microsoft, managing OneLake, and also has a long storied career on databases, startups and, but not so deeply. Thanks for being here. We look forward to kind of tapping into some of your wisdom here around this area.

Vinoth Chandar - 00:01:46
We also have Arturas. Arturas leads data platform at Kayak. Long, long-term Kayak user here. Still use it to pretty much every day and Arturas has seen the different variants of these systems in the trenches dealt with them. So, you know, Arturas, as I'm hoping you'll lend a deeper practitioner lens into the conversation today, and we have Jonathan, VP and distinguished engineer from Query, actually brings a very different perspective around data security. Somebody who's looking at it from a bird's eye view. And Jing from Uber, senior staff engineer working on data infrastructure.

Vinoth Chandar - 00:02:43
And as you all probably know, Uber's one of these companies where we build a lot of the infrastructure. I used to work at Uber. We originated the transactional data lake, the OG term for what we call data lakehouses. So, Jing, happy to have you here and also learn as well. Alright, so let's kick it off. Dipti, I want to start from you. So you've seen this clear wave moving away from closed proprietary platforms towards open architectures. What do you think is fundamentally driving this? Is it cost, flexibility, community, something deeper? What's going on? Why now?

Dipti Borkar - 00:03:25
Yeah, thank you so much, and thank you for having us here. Always fun talking about data and learnings as well. So there's definitely a big shift happening. Some of us who've been in this space for a while feel like we've been talking about it forever, but honestly, this is a foundational change that needs to happen through so many segments of customer bases. It's not just the Ubers of the world that need to go through it. And there's various different reasons. While cost and flexibility are a big part of it, it comes down to three foundational things that we are seeing. One is just expanding use cases for data in general. There's so much more that customers, users want to do with their data, used to be analytics and reporting. Increasingly it's all sorts of AI use cases.

Dipti Borkar - 00:04:13
And so what this means is that data needs to be interoperable across systems. And what that means is that the second point is interop across databases, where we used to have a lot of proprietary formats in the past and try to squeeze every bit of performance out of those formats. And in some ways, the customer was locked in and today it really is the interop, which is very important for customers because at the end of the day, it's their data and they want to do different things with their data and possibly with different engines. So multiple engines then come into play. And if you truly want to interop across things, you need to have some consistency with the formats.

Dipti Borkar - 00:05:03
And that's kind of the core reasons why we're seeing this change. For Microsoft, it was a very big deal to move away from proprietary formats that have been optimized for years and years to open formats. We started off with Delta, we now support Iceberg using Xtable and we've been working with the community on this for some time, but it's not just one as well. The flexibility means multiple formats, especially the clouds our customers. We see all the options. And so that's how we're looking at it.

Vinoth Chandar - 00:05:42
Awesome. Great perspective. And we cut to Arturas. So you've been at Kayak for over a decade. You've seen a bunch of things, right? So it's always great to hear from those building in the trenches. How do you see the move to open data? Why and why do we think if it's so great, why do we end up with closed platforms as a norm? And what do you think are the top three reasons from a practitioner lens on why people move to an open platform?

Arturas Tutkus - 00:06:15
Thank you Vinoth for having me here, really excited. As you said, very often we from the trenches don't have to say a lot, just digging there and doing our work. I think before talking about the move, we need to first realize why we need to move away to the open and what happened before, like how did we end up in the world where we have these proprietary solutions available to us and why it's so interesting or useful for us to be there. I think you are a very good example of that, which I just learned. You worked at Uber, you found a solution to the problem, and you said, well, solve this problem. We might solve this problem for other users.

Arturas Tutkus - 00:06:59
And when you are a company which specializes in some activity, you not necessarily want to solve all the problems. So then you would buy just solution which is given out there off the shelf. And that's great because your return of master of developing a tool, you would rather rely on the team who does those things. Now, the interesting thing that happens is that if you are shaping your entire architecture around a single vendor, you basically start thinking as a vendor. And it might be that your business not necessarily matches the thinking of the vendor. And you might be in the situation where you need to figure out what your business actually needs. So you need to have this balance. There's always a balance between what is useful to buy, what is useful to make yourself. I think the balance is somewhere in the merge.

Arturas Tutkus - 00:07:51
And if you ask me when you consider you need to move to some open source solutions or open platform solutions, I think you need to look into your business or organization and figure out what your requirements are, of course, what is the cost. And then eventually you can think of things like that. For us it was more about how big we are and how much more we can scale with the given solution. And we end up understanding that, and I think this is the majority of the companies have the same thing. Not all the tools fit all the needs. You sometimes need to have different tools and open standards actually allow us to achieve that goal because you're not locking yourself to a single solution. You are able to move to other maybe more optimized solutions for different business needs.

Vinoth Chandar - 00:08:46
Got it. Got it. I think that's your point about you start to think like a vendor around this, that hits home. I think I've seen a lot of companies. I think it's a really good way of framing it. I want to go to Jing. So from these two conversations, the theme that is kind of standing out here is open data platforms bring a lot of flexibility and that's the theme. So Uber has been like a big champion around open data lakes and data lakehouse and stuff. Can you share more? I was there, but I think we all overlap a little bit. I don't know how things are today. So can you share from your perspective on what an open data platform means to you? First of all, what do we mean when we say open data platform? How do you define it? And also maybe share a little bit about how Uber is building and operating their massive platform, the different, how do you use multiple engines? That's the aspect that I found very intriguing at Uber as well as at LinkedIn before that. I'd love for you to share more of that with our audience here.

Jing Li - 00:09:57
Sure. Thanks. Super happy to be here. So, in my view, the open data platform is a multi-layer architecture. So in the middle we have the transaction management and we're leveraging the different open table formats. We talk about Iceberg, Hudi and Delta Lake underneath. There's also physical data layer, which uses open data format like Parquet, and those data are queried, read by different engines. Now we talk about the open query engine. We at Uber have Spark, we have Presto. We also have a stream solution using Flink. So on top of that, we actually have a lot of applications. I'll take data ingestion at Uber, which is one application built on dual engines. So they ingest data using Spark, and also ingest data using Flink and offering different levels of latency SLA based on the business requirements.

Jing Li - 00:10:58
Beyond that, we actually also have a large number of table services, which do maintenance operations such as compaction, old run of data. Also, we have those table services managing compliance, legal operations such as encryption and account division. Plus it's also for cost efficiency, like pruning old and unused columns. And then to manage those table services is also a big task. It's a large number of table services, which at Uber we're building systems to be able to run those table services as part of that open data platform. Beyond that, I talk about the application layer. There's also a growing trend at Uber, especially in the AI space. I think that's pretty common, like right now in the whole world. And all of those applications are being built on top of the open data platform at Uber.

Vinoth Chandar - 00:11:55
Got it. Got it. Awesome. So if you could share, do you guys use any closed proprietary engines at all, or are all your main engines open source engines on open formats?

Jing Li - 00:12:07
Yes. So open data format, we're using actually Apache Hudi and for the open query engines, we have Presto, Spark and Flink and all of those are open source and Uber is a big contributor to that.

Vinoth Chandar - 00:12:18
Alright, amazing. Alright. Jonathan, like I mentioned before, you have a very interesting vantage point here. You work in data security. You deal with all these different systems in one way or the other. How do you see this whole data lakehouse movement, if you will? Is this the defacto starting point now? Do you think we've achieved that level? Does it just start and end with open table formats like Delta, Hudi, Iceberg, or are there more challenges to actually making open data lakehouse the norm in the industry?

Jonathan Rau - 00:13:01
No, I mean, well, first off, thank you for having me on the panel and for the conference for letting the security engineer talk on the open table format panel. But yeah, we sort of cross a lot of open formats as well as closed, right? Your BigQueries, your Redshifts and whatnot of the world. Coming from the security perspective, I feel like my industry usually lags behind a couple years. Hudi and Iceberg have been around for almost half a decade, a little bit more. And now you're starting to hear security practitioners talk about it a little bit more, but not too much. But I hope it'll be defacto if not de jour as far as what you go with. And a lot of the impetus is around cost sensitivity, access, and just interoperability.

Jonathan Rau - 00:13:45
I think that's why my peers on the real data side of the house use the OpenTable formats. But even still today, folks are using SIEMs like Splunk, LogRhythm, New Relic, and Datadog and whatnot, and I'm sure everybody listening and everybody here on the panel's probably seen one or two of those before to force you to put all your logs and all your telemetry into. But now openness being kind of the mood, I suppose, you would call it, it will be that way soon. We have a joke in the security industry, the SIEM is dead long live the SIEM and I think the transactional data lake OpenTable formats are the closest we're actually gonna get to killing those. You have folks on the cloud side who want to return to on-prem.

Jonathan Rau - 00:14:32
So they're using things like Ceph and MinIO to build their own data lakes, S3 compatible storage on-prem. Or you have folks who want to use S3, GCS, ADLS Gen2 over on the Azure side. It definitely starts with Iceberg, Hudi and Delta, obviously, they all have kind of different well differences and other strengths and weaknesses. If I was building an alert data lake, where I needed to actually keep track of the canonical data coming from an EDR or some other security system, I might go with Hudi, but if I'm just writing in a ton of raw EDR data, I might go with Iceberg. And if I was trying it up for the first time, and I don't like catalogs, which should use a catalog, I know Roy's listening somewhere, so I'm not gonna say don't use a catalog.

Jonathan Rau - 00:15:13
Maybe you'll go with Delta. But beyond that, obviously you have to work backwards from your use cases. I think the hardest part is not necessarily writing the data, even though there isn't a ton of write support for all the engines from all the places outside of the incumbent players, Flink, Spark and so on, but figuring out how to move all that data. A security company could easily have petabyte scale daily from one source system. Pretty soon we'll have companies that are at exabyte scale operating that way, so the pressure's only gonna mount. But then also it's from the security requirements. You have regulatory side of the house where you have different privacy regimes that manage things like right to be forgotten, the ability to go in and scrub records.

Jonathan Rau - 00:15:55
Maybe you'll have ePHI or some other PI as defined by GDPR or HIPAA, HITRUST, that you need to remove. But then at the same token, the data that's in your lake and your warehouse and that lakehouse is also handled appropriately. So building minimum necessary privileges and different security controls and constraints without hitting the brakes on people. You want to build guardrails, you don't want to necessarily hit the brakes, but it's an exciting time in the industry. And it's not that the closed formats will really go away anytime soon. We're kind of forcing them, all of them now have Iceberg support, which is pretty funny. I think the only last thing is more interoperability. Like I said, when we introduced ourselves before the panel today, we need a clear winner. I'll keep my opinion to myself on what the winner should be.

Vinoth Chandar - 00:16:56
Alright, fair enough. I think just to touch on that, there was an earlier BigQuery talk where I asked the same question, which is, what is the majority right now? Is it closed or open? Obviously, technical opinions aside, I think it'll be good for the industry overall to flip to an open format as a default. So I'm also waiting for the day where warehouses, it's still not the default. The closed formats are the default if you use any managed service. I think we first crossed that bridge. We spent a lot of time around the unification of the table formats, and in some sense we've done a lot of work around that. We managed to make progress on that. But the real elephant in the room is the closed format versus open format. That's something that remains to be seen.

Vinoth Chandar - 00:17:51
And I'm really hoping that in the coming years open becomes the default across warehouses everywhere, and then we actually start to see that it truly happens when open is the default. Anyways, cutting through that. Dipti, you've been a huge proponent of the disaggregated data stack. I don't know what the latest cool term for that is. I think that's what we used to call it a couple of years ago. Now in this conversation, we've talked about open formats, closed formats, open compute engines, closed compute engines. There's also a lot of work around open catalog, closed catalog. If you look at it, the stack is getting disaggregated. You have open and closed flavor for each. In this model, how should companies be thinking about build versus buy and where are open versus closed? What do you think are the factors that you would encourage people to closely consider to make those decisions?

Dipti Borkar - 00:18:53
Yeah, and I think this is tied to your previous comment about open, closed and so on, right? Disaggregated stack overall, I think everyone agrees that that's a non-regrettable move. You have to be moving to a disaggregated stack. The question is open or closed, off the shelf or truly open source, there are options. But from a scale perspective, especially with AI, we could argue the stack was built for AI, actually it originated before AI, but it turns out it's a very good stack for AI as well. So as we think about it, it's a non-regrettable move to have a disaggregated stack. You've gotta have that. Then on top of that, the question becomes open formats or closed formats.

Dipti Borkar - 00:19:47
I think we are very much moving towards open formats. There are a few people who would argue that if a cloud company like us can move to open formats, I think everyone can move to open formats. It was a big deal. It's hard, open heart surgery with the query engine to be able to support it, but it's possible. And I really hope that we move towards that. Then it comes to open source versus off the shelf, and that's where you have to assess what your company does, what the talent is, what your custom stack looks like. If you're on an open stack, you probably continue to be open, maybe pick Apache or Linux Foundation projects and so on. But if you're a smaller company that's starting off, you may not have the talent that's needed to build truly open source stack. And that's where SaaS experiences really simplify, API first and SaaSify products that are super easy to use and that consolidate many different pieces and still keep them disaggregated. So you get that flexibility, it's gonna be really important.

Dipti Borkar - 00:20:34
So yes, there are a lot of different considerations. I would say talent, cost. Nothing is free, even if you pick open source, there's no free lunch. So it'll be in cost of talent. But I think it's about assessing that, figuring out what your strengths are as an organization and then picking the right path towards it. With OneLake, what we are trying to do is go to multiple engines. So we can support many different flavors for table formats as well as engines. So you can mix and match. So the mix and match option will continue. I think it'll evolve and end up being a bit of a hybrid is the way I see it.

Vinoth Chandar - 00:21:40
Got it. So one quick follow up on the complexity part of it, the common criticism that I'm a huge proponent of the disaggregated stack as well is that it's so many vendors. Suddenly something that took one vendor is now taking four or five vendors. Sure, I have the flexibility, but how do you think about that?

Dipti Borkar - 00:22:05
Yeah, so if you piece it together, then yes, there is complexity from piecing it together. A lot of platforms will have a disaggregated stack within their platform. So you might go with some options that the platform provides, but use an open source engine, for example. You can run Databricks on OneLake, for example. So that's where that combination might come in with open formats, open engines. Open source new models will come up, new engines will come up, and you don't want to get locked into that. And that's where the open formats and table formats is really foundational because up the stack you could change and mix and match, but if you don't have that foundation that's open, your interop is gonna be really hard.

Vinoth Chandar - 00:22:52
Alright. Got it. Awesome. Arturas, coming back to you, so many teams that I see, they take on a data lake project to say, oh, I'm gonna save cost on my warehouse. More recently, oh, let's do an Iceberg project to save cost on a warehouse. So how do you think an open data architecture actually saves cost compared to your warehouse? Where do these cost savings come from? And there's complexity in building your own open source stack. Do the cost and complexity of building it cancel the gains that you get? So what teams should and shouldn't be building open data in your mind?

Arturas Tutkus - 00:23:44
That's a very great question. Before I answer that, I would just like to say that I completely agree with what Dipti has to say about all the stuff that's the reality. Having the open storage format or open table format, that's the raw point because this is where your data is, and just like the flexibility of going, 100% agree. So I think for myself, I'm still trying to figure out the lakehouse value, what value does it propose? I'm not saying that this is a bad technology or anything like that. It just still needs to validate to myself what type of problem it's solving.

Arturas Tutkus - 00:24:39
Now to your question about teams going into the lakehouse, I would say the following. Look into your data. If you have just a couple megabytes of daily data, you don't need to go to these open source formats. Postgres will do just fine. If you have gigabytes of data, then you most likely don't really have a data engineer or maybe already have one but still don't have a team. It's expensive. I completely agree again with what Dipti said, talent is expensive and just even starting on this project will be painful. So if you just have one gigabyte of data per day, just go with any of the vendor providers. They will take care of you and you will have an easy go.

Arturas Tutkus - 00:25:36
If you are going into the situation where you say, look, I have staging data, I want to add a logical layer on top of it. Now we're talking that you need to have a data layer, if you want to support the asset on top of that, then this is where the lakehouse comes into play. In my personal opinion, I will always be more pro to data warehouse when it comes to performance and ability to do. So the lakehouse solves certain problems which many companies have, but certain size companies have. So my recommendation for anyone going there is to figure out if you really want to go there, if you really should go there. If you really need to go there, then the conversation is different.

Arturas Tutkus - 00:26:41
And I think open source format helps you a lot because first of all, you figure out your storage, you can choose from a variety of compute engines to work on top. You can switch and match. Even some data warehouses have become a tradition. They expose you with the Iceberg format via the external table interface or Parquet in the past. The industry just goes there because everyone understands that even today, if you look at new products, they'll always have S3 support as storage and then Iceberg or Parquet reader to continue reading that. So this will eventually cost you money, but only if you truly need to go there. And that's the important thing you need to answer to yourself.

Vinoth Chandar - 00:27:18
Alright, awesome. I think we are running very tight on the clock. So very quickly just to close it out, starting from Jing, maybe go around and talk about anything that you're personally excited about in this open data space. Just stuff that you're looking forward to. Maybe you want to go first? We have I think a minute.

Jing Li - 00:27:38
Yeah, sure. I think looking forward, people have already talked about interoperability between different table formats. This has been going on with Iceberg, which is the transform format and the majority makes the work easy. You don't need the clients, the customer do not need to worry about which format to use, you just actually leverage the benefits. This is exactly what we want to enable, to adopt all different kinds of open table format and leverage the strengths from the underlying technologies. Apart from that, there's another thing which we are partnering with Datastrato, to enhance the open data catalog, Gravitino, and building all the different data agents at Uber. This open catalog becomes the foundation, the core concept and foundation to an information hub, which we can have those agents talk to each other and then be smart about all information and give the business another push.

Vinoth Chandar - 00:28:39
Awesome. Jonathan, do you have any?

Jonathan Rau - 00:28:43
Yeah, multi-engine writes for sure. And next gen tools. I love Spark. I hate managing Spark, I hate jars. It'd be great to see more Rust-based SDKs come out that I don't have to use 20 other tools just to write an Iceberg table. Let me do it. And I guess the milquetoast answer is more security features. The Iceberg REST catalog getting extended to build in more authentication, authorization, use SSO, things like that.

Vinoth Chandar - 00:29:08
Right? Quickly maybe, we are anything really on the clock, very quickly, Arturas, what are you looking forward to?

Arturas Tutkus - 00:29:16
Yeah, I mean, honestly I think it's a miracle that we've come down to three formats across the industry. Now let's get to another miracle and maybe come up with a catalog or something that's more standardized across, otherwise it becomes hard for everyone. I think AI use cases on top of open lakes and open data are gonna be phenomenal. So that's what I'm looking for today. And I personally, just a quick, I love S3 as storage, but come on. It's already 20 years in the business, so maybe we can come up with something better. I mean, that's what I would look for.

Vinoth Chandar - 00:29:57
Oh, that's a big topic. We should chat. Alright. Thanks everyone for being here. This was awesome. This was such a great chat. We would go on and on. But thank you for being here.