Data as Software and the programmable lakehouse

calendar icon
May 21, 2025
Speaker
Ciro Greco
Co-founder and CEO
Bauplan

Two very big things happened in recent years that completely transformed the data landscape:

  • Pre-trained AI models: These models have democratized AI, enabling software engineers to integrate advanced capabilities into applications with simple API calls and without extensive machine learning expertise. ​
  • Lakehouses: Open formats over object storage bring together the flexibility of data lakes and the data management strengths of data warehouses, offering a more streamlined and scalable approach to data management.

And yet, most data platforms remain difficult to digest for traditional software developers.

This talk introduces "Data as Software," a practical approach to data engineering and AI that leverages the lakehouse architecture to simplify data platforms for developers. By leveraging serverless functions as a runtime and Git-based workflows in the data catalog, we can build systems that makes it exponentially simpler for data developers to apply familiar software engineering concepts to data, such as modular, reusable code, automated testing (TDD), continuous integration (CI/CD), and version control.

Transcript

AI-generated, accuracy is not 100% guaranteed.

Demetrios - 00:00:07  

It is my great pleasure to bring a good friend of mine on to the stage to round out this incredible day. Mr. Cheeto, where you at, bro?  

Ciro Greco - 00:00:22  

Can you see me?  

Demetrios - 00:00:24  

Yeah. How you doing, dude?  

Ciro Greco - 00:00:25  

Okay. I'm pretty good. How about you?  

Demetrios - 00:00:28  

I'm great. It's awesome to have you here. Let me turn this music down so you can get rocking and rolling.  

Ciro Greco - 00:00:37  

Beautiful. Okay. Thank you very much for having me. We're gonna talk a little bit about data software, and what we call the programmable lakehouse, which in practice means we're gonna talk a lot about AI. And it's a good segue from the previous talk. A little bit about me. I spent a lot of time doing machine learning and AI and building infrastructure to support those use cases. Built a company doing NLP and NML for recommender systems and search, got acquired, led AI from scale up to IPO at Coveo. My team and I built a lot of systems from the perspective of having data-heavy workflows and models embedded inside software applications like a recommender system or a search engine, something that is out there.  

Ciro Greco - 00:01:33  

There's a user that teams with it in production. A lot of that boils down to what I'm doing today. I'm the co-founder and CEO of a company named De Balin. De Balin is essentially a data platform. We focus a lot on data transformation and AI workloads. We help developers build applications very quickly on their data, very seamlessly, by abstracting away the infrastructure—compute provisioning and configuration, containerization, environment management. That's why it's completely serverless. We bring very simple code-first abstractions because that's a very good way to incorporate those workflows in software applications. We build on S3 because we believe the separation between compute and storage is here to stay.  

Ciro Greco - 00:02:31  

We're very excited about what is happening in the data lake and lakehouse space. We're gonna talk mostly about AI and how to build AI to fruition. If I have to summarize what AI looks like today, it's essentially data plus code. One of the biggest differences between AI five years ago and AI today is that pre-trained models make available incredible inferential capabilities through just an API call. That has a bunch of consequences. The first is that more software engineers can now access those capabilities and bring that intelligence inside software applications. There's a broader range of developers, which is very exciting for organizations that want to see more ROI from their AI initiatives.  

Ciro Greco - 00:03:25  

There's literally more developers that can do that. Also, your applications are gonna look a little more like software applications because it's an API. The workflow and end-to-end lifecycle of your applications now embed these capabilities, but it looks a lot like a more traditional software engineering lifecycle. One of the main points I want to make today is that AI is essentially pushing data into software. Companies need more software engineering-friendly platforms, abstractions, and tools. A lot of that boils down to being very code-first and code-oriented so developers can work more easily. Code effectively means Python for this because Python is historically the lingua franca for data scientists and machine learning people. There's a lot of great support that comes from the history of Python.  

Ciro Greco - 00:04:18  

Right now, according to GitHub, Python is the most widespread language and it's growing. That is clearly driven by AI. For people who run organizations, keep an eye on your Python workloads because the importance of those workloads will change radically compared to what you're used to in the next few years. The bottom line is you want ways for developers to access these capabilities as simply as possible and write their code against your data. The question that remains open is what it means to bring data into the mix. AI became radically simpler thanks to pre-trained models. Data depends; there's still a lot of fragmentation and complexity. A good place to start for AI use cases is usually object storage.  

Ciro Greco - 00:05:14  

That again has historical reasons. That's where your unstructured data used to live. The image on the screen is a reference implementation that AWS advises developers to follow. This is a familiar workflow: if you did data science, you have some data in a stream, some files in a stream, you read those files into a pandas data frame, do your thing, then save your result as another file in the stream. Historically, that's how things were done. Developers in AI are well versed in object storage, but it hasn't been a very good way to do things. When you want to do things at scale and more systematically, files are not great.  

Ciro Greco - 00:06:04  

What about schema evolution? What about transactions? What about performance? There are better ways to process data than reading an entire file into memory and turning it into a pandas data frame and saving another file, which takes a toll especially with large datasets. What about versioning, which is very important when doing AI and ML to iterate quickly on different versions of models and datasets? Object storage is great but not so great up until now. That's where what's happening nowadays in data lakes is exciting. Open table formats really come to the rescue, bringing capabilities usually just in the warehouse world to data lakes. Schema evolution: all three major open table formats support schema evolution.  

Ciro Greco - 00:07:07  

The developer now interacts with a higher-level abstraction, a table, which is more intuitive and easier to keep track of. You don't need to know what happens at the file level, which is too low in the stack to reason cleanly about. Performance is a great improvement because instead of reading an entire file into memory, you can say you want these columns. The system provides a higher level of abstraction and can query and access only the data corresponding to certain columns and filters. You can do a lot of pushdown against object storage, which has 100x improvements, especially with large datasets.  

Ciro Greco - 00:07:57  

Versioning is another very important thing, particularly because if I have to single out one thing your data, machine learning, and AI team needs to do to be very effective, it's fail fast. You train a model, fine-tune a model, try a model against a dataset, and quickly get a result that's not quite what you want. You want to iterate on that, maybe change something in your data, your model, or both, progressively walking your way to where you want to be. This progressive approximation to the optimal solution is something a machine learning and AI team never stops doing.  

Ciro Greco - 00:08:49  

It's very important to have good and clean ways to version your datasets and models. Think about that like branches: I have my dataset, I run a model against that in one branch, save those results somewhere, then open another branch and do that again with a different model or dataset or both. All the major open table formats support these capabilities, which is very exciting. To wrap up the takeaway message about what you should do to make it easy for your organization to get AI in production in real use cases that bring value to users:  

Ciro Greco - 00:09:37  

You want a lakehouse. Your data stack built on object storage plus open formats. Make sure your developers can develop directly in Python. SQL maintains an important role, especially in aggregation, but ultimately your lingua franca will be Python for all those AI applications like RAG, recommender systems, chat agents, data pipelines that do data augmentation or synthetic data generation. It's about bringing Python in the lakehouse as close and as efficiently as possible. Running Python in the cloud has historically been complicated. We're not gonna talk about that. There's a lot of work we did on that, but not today.  

Ciro Greco - 00:10:24  

Let's talk about how to bring the lakehouse to a Python developer. How do we make it easy for them to access those abstractions and work with data somewhere in a data catalog in object storage? One very important point is that to make things effective in an organization, people and teams need to work together on shared abstractions, minimizing silos and effort going from prototype to production to higher scale. When building software applications, you need a bit of everybody: data scientists and data engineers doing business logic, software engineers managing applications, DevOps managing reliability and robustness. These folks share one thing: code. What they don't share is infrastructure.  

Ciro Greco - 00:11:14  

The abstractions of code can be made so different tribes in your organization share abstractions as much as possible. The more infrastructure you bring in, the more silos you create. You want to abstract away as much infrastructure as you can and piggyback on abstractions every developer understands. Make a conscious and constant effort to make things as simple as possible. If it's not familiar to a developer, an abstraction needs justification in your system.  

Ciro Greco - 00:12:13  

A practical example: I have a lakehouse, a data catalog on top of object storage, some tables, and I want to build my application in Python. I want to take data from the lakehouse, process it, and write something back. What are things everybody knows and what should not be taken for granted? Not every developer knows what containers are or how your organization manages them, but every developer understands what a package is. A good abstraction is to allow developers to declare in code what packages they need, then the system figures out how to create a container and run that in an isolated fashion in the cloud for them.  

Ciro Greco - 00:13:12  

Data: I have my process, I need to get data from the lakehouse and do something. Not every developer knows what a data lake is or is familiar with Parquet, Iceberg, Hoodie, or Hive, but everyone knows what a table is: a schema, rows, and columns. The system should abstract how this table is implemented in your data lake. We don't care if it's Iceberg or Hoodie. We care that we can declaratively express that in code as columns and filters. The system figures out how to fetch the data corresponding to this table in whatever implementation your system supports.  

Ciro Greco - 00:14:00  

Running things in the cloud is challenging depending on your abstraction level. Not every developer knows Kubernetes or Spark, but every Python developer knows functions because that's what you do in Python: write functions. If you do something like this, every developer can master the major abstraction your system has. The same goes for data management. Not everyone knows how to version data or use open table formats, but every developer is familiar with systems like Git. The abstraction you expose should leverage these clear concepts.  

Ciro Greco - 00:14:52  

Look at the code snippet on this slide. Even if you've never used the system, it's easy to understand what it does. It creates a branch, you put your data in a branch, run in a sandbox, if something goes wrong, you roll back with a notion of a commit, which is familiar to everybody. If all goes well, you branch and merge into your main data lake. It's easy to understand and boils down to how simple your system should be and how little developers should care about implementation details.  

Ciro Greco - 00:15:38  

When you bring this simplicity to large enterprises, it gets very powerful. One of the largest companies we work with is Mediaset, one of the largest broadcasters in the world, operating at very large scale with petabytes of data, billions of events, millions of users, many developers, teams, and tooling. If you bring a code-first system that abstracts away infrastructure and is based on things everyone can understand, you get a progression: one team starts using the system, maybe data scientists doing simple analytics, then they build more things and pass it to data engineers to go to production.  

Ciro Greco - 00:16:32  

Maybe you start doing more complex ML, now two teams and two use cases. When ready to put this on large scale in production with large applications, you bring in DevOps. DevOps comes into the mix with automation, enabling more complicated use cases. You get a progression across teams and can get 10x faster bringing more use cases to production because more developers work on the same concepts.  

Ciro Greco - 00:17:10  

To wrap up: AI is turning data into software. We need software engineering-friendly platforms, which for data means code-first platforms, which are not very common. If you're doing AI, lakehouse plus Python is probably the best way to get more ROI. Branching and versioning are super important. Good APIs, keep it simple, keep it easy, and every developer can work on data, unlocking the next level of productivity. If you want to reach out, I'm always here.  

Demetrios - 00:17:47  

I want to reach out.  

Demetrios - 00:17:49  

That was a very articulated conversation, very clear, very easy to follow. I appreciate you so much, man. Thank you for closing out the day with us. I'm excited for when we get to hang out again soon, hopefully.  

Ciro Greco - 00:18:07  

Fantastic. Always.