Data as Software and the programmable lakehouse
Two very big things happened in recent years that completely transformed the data landscape:
- Pre-trained AI models: These models have democratized AI, enabling software engineers to integrate advanced capabilities into applications with simple API calls and without extensive machine learning expertise.
- Lakehouses: Open formats over object storage bring together the flexibility of data lakes and the data management strengths of data warehouses, offering a more streamlined and scalable approach to data management.
And yet, most data platforms remain difficult to digest for traditional software developers.
This talk introduces "Data as Software," a practical approach to data engineering and AI that leverages the lakehouse architecture to simplify data platforms for developers. By leveraging serverless functions as a runtime and Git-based workflows in the data catalog, we can build systems that makes it exponentially simpler for data developers to apply familiar software engineering concepts to data, such as modular, reusable code, automated testing (TDD), continuous integration (CI/CD), and version control.
Transcript
AI-generated, accuracy is not 100% guaranteed.
Speaker 0 00:00:00
<silence>
Speaker 1 00:00:07
It is my great pleasure to bring a good friend of mine on to the stage to round out this incredible day. Mr. Cheeto, where you at, bro?
Speaker 2 00:00:22
Can you see me?
Speaker 1 00:00:24
Yeah. How you doing, dude?
Speaker 2 00:00:25
Okay. I'm pretty good. How about you?
Speaker 1 00:00:28
I'm great. It's awesome to have you here. Let me turn this music down so you can get rocking and rolling.
Speaker 2 00:00:37
Beautiful. Okay. Thank you very much for having me. Uh, we're gonna talk a little bit about data software, and what we call the programmable lakehouse, which in practice means we're gonna talk a lot about ai. And it's a good segue from, uh, the, the previous, um, I think talk, um, a little bit about me. Um, I spent a lot of time doing machine learning and AI and building infrastructure to support to those use cases. Build a company, was doing N-L-P-N-M-L for recommender systems and Search, got acquired, led ai, uh, from, um, scale up to IPO at Coveo. And like my team and I, uh, kinda like built a lot of systems from the perspective of having, you know, um, data heavy workflows and, uh, and models embedded inside of, um, software applications like a recommender system or a search engine, like something that is out there.
Speaker 2 00:01:33
Uh, and there's a user that tams with it in, in production. And a lot of that boils down to what I'm doing today. I'm the co-founder and CEO of a company named De Balin. Um, and Balin is essentially a t uh, data platform. And we focus a lot on data transformation and AI workloads. We help essentially developers to bring, um, to build applications very quickly on their data, and very seamlessly, we help them by abstracting away the, um, the infrastructure. Um, so all the compute provisioning and configuration, all the containerization, the environment management. So that's why is a, as a serverless, completely serverless, um, we help them by bringing, uh, very simple code first abstractions because that is a very good way then to incorporate those, uh, workflows then in a software application. And we build on S3 because we believe that the separation between compute and, uh, and storages here to stay.
Speaker 2 00:02:31
We're very excited about what is happening in the data lake and Lakehouse, um, space. Um, we're gonna talk mostly about, um, AI and how to build AI to fruition. And, uh, if I have to summarize what AI looks like today, is essentially data plus code. And what I mean by that is that one of the biggest differences between, uh, AI five years ago and AI today is that of course, like you have pre-trained models and pre-trained models make, uh, available this incredible inferential capabilities through just like an an API call. And that has a bunch of like consequences. The first one is that, um, more software engineers are able now to access those capabilities and bring that intelligence inside of software applications. Um, so there is a broader range of developers, which is very exciting for organizations that want to see more ROI from their ai uh, um, initiatives.
Speaker 2 00:03:25
'cause there's, there's literally like more developers that can do that. Um, and the other one is that also your applications are gonna look like a little bit more like software applications. 'cause is an API. And so the workflow and the kind of like end-to-end life cycle of your applications now embeds these capabilities, but it looks a lot like a more traditional software engineering, uh, life cycle. And one of the main point that I, that I wanna make today is that AI is essentially pushing data into software. And, uh, companies need more and more software engineering friendly platforms, abstractions, and tools. And a lot of that boils down to be very code first and code oriented so that developers can work, uh, more easily. Now, code means, uh, effectively means python for this thing. Um, because Python is historically the link of franca for data scientists and machine learning people, there's a lot of great support that comes from the history of Python.
Speaker 2 00:04:18
And right now is, um, according to GitHub is the most widespread, um, uh, language. And it's growing like p it clearly that is driven by ai. Um, so like also for people who run organizations, keep an eye on your Python workloads 'cause the importance of those workloads will change radically to compared to what you're used to in the next few years. Um, so the bottom line of that is that you wanna have then basically ways for developers to access these capabilities as simple as as possible and write their, their, their code, um, against your data. Um, so the question that remains open is what it means to bring data into the mix. 'cause AI became radically simpler thanks to pre-trained models. Um, data kind of like, it depends, like there's, there's still a lot there, there's still a lot of fragmentation and complexity. Um, a good place to start for data, uh, for use cases for AI use cases is usually object storage.
Speaker 2 00:05:14
Um, that again, has historical reasons. There are, um, you know, um, that's where your unstructured data used to live. Um, the image that you see now on the screen is a reference implementation that AWS advise, uh, developers to follow. And this is a very, uh, familiar kind of workflow. If you did data science, you have some data in a street, some files in a street, and you're gonna read those files into a pandas data frame. You do your thing and then you save your result as a, as another file industry, right? Um, now this has been like historically the way to do things. So developers in AI are well versed in and, uh, in object storage, but it hasn't been necessarily a very good way to do things though. Um, because, you know, like when you wanna do things like at scale and more systematically, files are not great.
Speaker 2 00:06:04
What about schema evolution? What about transactions? Um, what about performances? 'cause clearly there are better ways to process data, uh, than reading an entire file into the memory of my process. And just like turning into a pandas data frame and save another file, uh, which you see has, takes a toll, especially when you, when you work like with large, uh, scale data sets. And what about versioning, which is a very important, um, capability that you wanna have in your stack because it's so important when you do AI and ML to be able to iterate quickly on different versions of models and data and data sets. Um, so object storage, great, not so great up until now, but that's where what happens nowadays, nowadays in, in data lakes is so exciting for us. Um, OpenTable formats really come to the rescue bringing a lot of capabilities that, uh, were usually just like in the warehouse world to the data lakes schema evolution, all three major open table formats, um, supports just schema evolution.
Speaker 2 00:07:07
So I don't have, the developer now interacts with a higher level abstraction, which is a table, um, which is a much more intuitive abstraction and it's easier for us to keep track of what happens if we don't really need to know what happens at the file level, which is a bit too low in the stack to, um, you know, reason clean cleanly about, um, performances are a great improvement. 'cause like, instead of just like reading an entire file in the member of my process, I can now, you know, say I want these columns and the system now provides a higher level of abstractions, as I was saying, and the system now can query and, uh, and just access only the data to correspond towards certain, to certain columns, to certain filters. So I can do a lot of push down against, uh, object storage, which has, you know, a hundred x improvements, especially if you work on, on large data sets.
Speaker 2 00:07:57 Um, and versioning is another very important thing, as in, as I was saying, is particularly important because like, if I have to single out one thing that your data, your machine learning and your AI team needs to do to be very effective is fail fast. Um, you are gonna, you know, train a model, fine tune a model, uh, just try a model against a certain dataset. You get very quickly to the point that you get a result is not quite what you want. You want to iterate on that and maybe change something in your data, change something in your model, both, um, and you kind of like progressively walk your way to where you want to be, um, which is great and also is the kind of thing that never ends. 'cause like data will keep changing. So this progressive like approximation to the, the optimal solution is something that, you know, a machine learning team and an AI team basically never stops doing.
Speaker 2 00:08:49
It's very important that you have good ways and clean ways to version your data sets and your models. And think a little bit about that. Like you would think about branches. I have my data set. I I run a model against that in one branch. Um, I save those results somewhere. Um, and now I open another branch and I do that again with a different model or where different data set or both and so on. And, and again, like all the major, uh, all the, the major, uh, open table formats do support these capabilities, which is very exciting. So if you wanted to kind of like wrap up like the, uh, takeaway message around what you should do to make, uh, very easy for your organization to, um, get AI basically in production in real use cases that are part of applications that bring value to users.
Speaker 2 00:09:37
Um, you wanna have a lakehouse. So you wanna have like your data stack built on object storage plus open formats, and you wanna make sure that your developers can develop directly in Python. And this QL s QL maintains like a very important role, especially in aggregation, but ultimately you're kind of like lingua franca will be Python for all those applications that you build with ai rag recommender systems, chat agents, a lot of data pipelines that do data augmentation or synthetic data generations and so on. So it's a matter of bringing, um, the lake like Python in the lakehouse, like as closed as possible, as, as more as efficiently as possible. Now, running Python in the cloud has historically been pretty complicated. We're not, not gonna talk about that. Uh, there's a lot of work that we did about on that, but like, we're not talking about that.
Speaker 2 00:10:24
Let's talk about like more about, so I'm a Python developer. How do you bring the Lakehouse to me? Like how do we make it easy for me to just like access those abstractions and work with the data that are, you know, somewhere in a data catalog in, in, in, in my object storage? Um, one very important point that I want to make here is that to make things effective in an organization, it's very important that people and teams can work together on shared abstractions and you minimize the silos and you minimize the amount of effort that you need to go from one phase to another, from prototype to production to, you know, uh, higher scale and so on. When you build up software applications, you need a little bit of everybody. You need data scientists and data engineers doing like the business logic. You need software engineers managing the applications.
Speaker 2 00:11:14
You need DevOps to manage like the reli reliability and robustness of your lifecycle. These folks will share one thing that is code. What they don't share, on the other hand is infrastructure. So the obstructions of the code can be made in a way that different tribes in your organization can share obstructions as much as possible. The more infrastructure you bring into the mix, the more you are gonna create silos. Um, so essentially what you wanna do here is obstruct away as much infrastructure as you can and really piggyback on obstructions that every developer understands, like make an effort a a conscious and constant effort in making things as simple as possible. If, if it's not already familiar to a developer, is, um, an obstruction needs a justification in your system. So I'll give you like a very practical example of this. Like, um, let's say that I have a lakehouse.
Speaker 2 00:12:13
So I have a data catalog on top of my object storage and I have some tables there. Um, and I want to start like building my application and I wanna do that in Python. So I want to take some data from that data lakehouse, and I want to process and write something back into, into the Lakehouse, right? Um, one very good way of thinking about this is like, what are the things that everybody knows and what are the things that I do not, I should not take for granted? For instance, not every developer knows necessarily what containers are. Uh, not every developer certainly knows what containers are in your specific organization, like how you manage that and how you decided to organize that. But every developer understands what a package is. Um, so a good way to abstract that away is why don't we allow developers to, um, declare directly in the code what packages they need and then the system figures out how to create a container and run that in a, in a, in a, in an isolated fashion in the, in the, in the cloud for them.
Speaker 2 00:13:12
Um, the other thing is data. Um, I have my process. I need to get data from my data lakehouse inside of my process and do something. Um, not every developer knows what a data lake is. Not every developer is super familiar with Parquet or iceberg or hoodie, uh, or hive, but every person on earth knows what a table is. It's just a schema, a bunch of rows and a bunch of columns. And the system should obstruct the way, all the way in which this table is actually implemented in your data lake. We don't care whether this is an iceberg or a hoodie table. What we care about is that we can declaratively express that in my code in the way that you see on this slide. This, this is just a bunch of columns and filters. The system should figure out how to fetch the data to correspond to this table in whatever implementation your, your system supports.
Speaker 2 00:14:00
Um, and finally, running things in the cloud is also challenging depending on your layer of abstract, like of your level of abstraction. Not every developer knows how to, you know, deal with Kubernetes or super familiar with Spark. Um, but every Python developer on earth will know how to do functions 'cause that literally what you do in Python, you write functions. So if you do something like this, every developer in your organization will be able to master the major abstraction that your system has. The same goes for your data management. Not everyone knows how to version data or to keep track of that or to, you know, use the open format tables of a table formats to do that. Um, but every developer is very familiar with systems like Git. So the abstraction that we want to expose to them should again leverage the fact that these concepts are very clear in their mind.
Speaker 2 00:14:52
So if you look at the snippet of code on this slide, even if you never used the system and you never saw this in your life, um, it's pretty easy to understand what this snippet of code does. Um, and, and that's really what you wanna do. This creates a branch, you put your data in a branch, you run now in a sandbox, something goes wrong, you wanna roll back with a notion of a comet, which is very familiar to everybody. If everything goes well, you can branch it, you can merge it into your, into your main data lake. Um, again, it's easy to understand what this snippet of code does and it boils down to that to how simple your system should be and how little your developers should care about the implementational detail. When you bring this kind of simplicity in large enterprises, which is the final point I wanna make.
Speaker 2 00:15:38
Um, this gets like very powerful. Um, one of the largest companies that we work with is a fairly complicated um, organization. 'cause Mediaset is one of the largest broadcaster in the world and they operate clearly at a very, uh, very large scale. They have petabytes of data and billions of events and millions of users. Um, so there's a lot of developers, a lot of different teams, a lot of different tooling. If you bring a system that really is code first and is abstract abstracts away infrastructure and is based on things that everybody somehow can easily understand, what you get is this nice progression that goes from one team that starts using a system, maybe data scientist, and they do something simple, maybe analytics. And then that team gets to the point that they build more things and now they can pass it over to data engineers to go to production.
Speaker 2 00:16:32
And maybe we start doing something more complex like ml and now we have two teams and two use cases. And then finally, when you're really ready, we wanna put this on a, on a large scale in production. This is a large organization with large applications. So now you need DevOps. And now DevOps came into the, into the, into the mix and you have automation and now you can do like more complicated use cases and you really have this progression that goes across the different teams in your organization and you can get like 10 x faster in bringing to production more use cases because more developers can basically work on the same concepts. If I have to wrap up, this is basically that I think AI is turning data into software. And more and more we need software engineering friendly, uh, platforms, which for data means code first platforms, which are not very common out there. Um, if you're doing ai, Lakehouse plus Python is probably the best way for you to get like more ROIs. But as much as AI as possible branching and versioning super important. And I cannot stress this enough. Good APIs, good APIs, good APIs, keep it simple, keep it easy and every developer in an organization can work on the data, which really unlocks the next level, uh, kind of productivity. Um, yeah, if you wanna reach out, I'm always here.
Speaker 3 00:17:47
I wanna reach out.
Speaker 1 00:17:49
That was a very articulated conversation, very clear, very easy to follow. I appreciate you so much, man. Thank you for closing out the day with us. I'm excited for when we get to hang out again soon, hopefully.
Speaker 2 00:18:07
Fantastic. Always.