Building a Data Lake for the Enterprise

calendar icon
May 21, 2025
Speaker
Pushkar Garg
Staff Machine Learning Engineer

In this talk, I will go over some of the implementation details of how we built a Data Lake for Clari by using a federated query engine built on Trino & Airflow while using Iceberg as the data storage format.
Clari is an Enterprise Revenue Orchestration Platform helping customers run their revenue cadences and helping sales teams close deals more efficiently. Being an enterprise company, we have strict legal requirements of following data governance policies.
I will cover how to design a scalable architecture for building data ingestion pipelines for bringing together data from various sources for the purposes of AI and ML.
I will also cover some of the use cases that have been unlocked in the company with respect to building Agentic Frameworks with the development of this data lake.

Transcript

AI-generated, accuracy is not 100% guaranteed.

Speaker 0    00:00:00    
<silence>

Speaker 1    00:00:06    
Folks, when I wake up and I'm not feeling all right, I grab myself some of these pills and get working. Well, oh my bad, wrong virtual conference. I'm gonna bring up Pushkar right now. <laugh>, how you doing, dude? What's happening?  

Speaker 2    00:00:21    
Hey, Rios. I'm doing great. How are you?  

Speaker 1    00:00:24    
I'm great. I am excited for your talk. As always. I know that we've had quite a few of your talks in the ML ops community and on the virtual conferences, and they never disappoint, so I'm just gonna get outta the way and I'll see you back in a few minutes.  

Speaker 2    00:00:41  
Um, hey everyone. Uh, my name is <inaudible>. Um, I work as a staff machine learning engineer at Clari. Uh, and today I'm here to present, uh, on the topic of how we built a data lake for the enterprise use case. Uh, so yeah, let's get started. Uh, so a little bit about Clari. So Clari is, uh, so something which started as a revenue forecasting company and has now evolved into a revenue orchestration platform. And we use AI to generate insights, uh, for the deals in the pipeline, uh, that any, any of our customers may have. Um, and, uh, yeah, so the, the business need for building a data lake was that we needed a centralized, scalable data repository for, uh, our AI and ML workloads, and then integrating with the data sources from the acquisitions that we've had over time, uh, that seemed like a good idea to do, uh, for, um, our AI to be even more accurate.  

Speaker 2    00:01:39    
Um, so that we, we, we do have insights from, um, other, uh, data sources that, that we require over time. Um, uh, the primary data sources that we use to ingest data, um, our Rev db. So Rev DB is our, uh, state-of-the-art revenue database that is populated using the upstream CRM systems, for example, Salesforce, that contain the data about these deals and about, uh, other, inform other information about, uh, the customers, uh, sales. We ingest that into, into our Rev db. Um, and the other data sources are MongoDB. So every deal would have some kind of conversation happening around it, whether that is, uh, over Zoom, whether that is, uh, via emails exchanged. So those two additional data sources, um, we needed to, uh, incorporate into our data lake. Um, and then the key requirements for building the data lake, uh, were that we need, we needed like reliable data replication, uh, with good data quality checks, um, support for schema evolution.  

Speaker 2    00:02:46
So the upstream CRM systems, uh, they, uh, work, uh, on a config, right? Um, and then that config gets updated, uh, over time, uh, and then that reflects as schema changes in, in REV db. Um, so we needed to be, uh, to, we needed to be flexible to, uh, support those conflict changes in the, uh, upstream, uh, upstream CRM systems, um, so that we, uh, still, our, our AI workloads can still, uh, work, uh, with failures. Uh, and then one of the other use cases was that we wanted efficient query, um, the data that the, the data that is persisted in Rev DB may not be, uh, well partitioned for the AI use cases, which is mostly based on timestamp. Um, so we, we wanted to be able to, uh, recreate those partitions, uh, such that the querying of the data became, uh, more efficient.  

Speaker 2    00:03:43    
And then being an enterprise company, obviously we want to have, uh, proper data governance and lifecycle management of the data available in the data lake. So, yeah, so, um, going through, uh, the high level architecture of what we built, uh, we deployed, uh, um, uh, an instance of Trino on K eight, um, locally. And, uh, we built connectors with S3, uh, with Rev db, which is deployed in Postgres, um, and MongoDB. Um, and then we use Airflow for, um, orchestration of the ingestion pipelines and the creation, dynamic creation of ingestion pipelines. So our, our use case is that we, for every customer of Clari, we would have, uh, uh, a different pipeline, um, uh, for ingestion, and then we use, uh, the managed instance of Airflow for orchestrating and managing, uh, those, those, those pipelines. And we use Trino to, to run queries, which give us the data, and then we persist that in the data lake in the form of iceberg tables.  

Speaker 2    00:04:42    
Um, and the downstream from the data lake happens, uh, via several different ways. Uh, first is we use, uh, SageMaker for running our training jobs. We use any scale for deploying our models, and then that's how the inference, uh, gets the data, uh, using the data lake. And then most recently we've deployed, uh, some AI agents like deal inspection agent, which also get data from the data lake, um, for the Boost Bootstrap use case. So an an agent is going live for a particular organization. Um, it would need all of the historical data for all the deals, uh, that may be in the pipeline. Um, so that, that is powered by the data lake as well. Um, in terms of, uh, going over some more details about the architecture, so we use S3, uh, for the storage layer. Uh, we use Apache Iceberg, like I said, um, for, uh, uh, as a table format, uh, to support schema evolution, and then efficient querying.  

Speaker 2    00:05:36    
Um, we use AWS Glue as the metadata, uh, store. Um, and then what we've done is we create one database, um, per org, um, that helps us with the lifecycle, uh, management of the data, and then also, um, keeps the data lake as a multi-tenant, uh, store, but also isolated, um, across different orgs. We use Trino as the, um, uh, query engine. So Trino has, um, catalogs built for S3, for F DB for MongoDB as well. Um, and then it also has, um, a catalog built for the data lake. So any queries running on top of Data Lake are supported by Trino and Athena Booth. Um, uh, and in terms of the orchestration, we use Airflow for pipeline management, and we have something, um, uh, which is really interesting, is the dynamic deck creation. So our use case is that we need separate pipelines for each organization, uh, or each customer of Clari.  

Speaker 2    00:06:33    
Um, so we use the same code and we use a config to add any new, um, organization that may become a customer of clarity or, um, over the course of time, uh, we need to run models and run, uh, insights, generate insights for a particular organization. Um, so we, we, we do that dynamically, um, and we don't need to, um, manually go in and create, uh, the ingestion pipelines. And then we have separate pipelines for ingestion and validation that helps us keep track of, um, uh, the data, uh, the data quality checks in terms of ingestion modes. Uh, we have full refresh, um, which is to create new tables. So whenever there is a schema change, we create another table. Um, and, um, uh, we also have incremental updates. So, um, based on if, if, if there is, uh, no change in the schema, we ingest, uh, an append new data since the last ingestion, uh, timestamp.  

Speaker 2    00:07:33    
Um, uh, so yeah, so one, one thing that we identified was that, uh, building, um, the, these ingestion pipelines, um, we like the, the ML platform team was becoming the bottleneck. Um, so we decided to create self-serve APIs, which our customers, uh, which are the internal data scientists and the machine learning engineers they use, um, to, to, to update these, these, uh, the config file, which helps in creation of this ingestion pipeline. So in this example, um, a user can call this particular API to, um, add an opportunity, uh, to add the opportunity table to the, um, org ID four oh ones ingestion pipeline. Um, and then similarly can do that for, for other orgs as well. Um, uh, I won't go into the details of, um, how that happens, but, um, we have, uh, several tags and, um, several databases involved, which ultimately result in the update creation and adaptation of, um, ingestion pipelines.  

Speaker 2    00:08:31    
So it, it's all democratized, uh, and that, that's been one learning that, um, uh, that, that has helped with the adoption. Um, and then finally, in terms of, uh, data quality and governance, we have, um, validation pipeline, which, uh, does data quality checks fi, uh, validates the account between the source and the target. Um, uh, detects duplicates within the target tables. Um, and then the results are stored within, uh, a dedicated management table for tracking. Um, um, as any organization we need to be compliant, uh, with, uh, when, when it comes to data. So, um, we have a high level, um, churn service, which, uh, involves, uh, the deletion of data, if, if any org or if any customer of cloud churns. Um, so we have integrated with that. And, um, whenever there is a notification, uh, the churn service sends a notification in, in the case of churn.

Speaker 2    00:09:27    
And whenever there is a notification, we, um, go in and, uh, automatically delete, uh, the physical data, uh, in the, in the form of files, the metadata in from glue, and then all of the ingestion artifacts for, um, for, for that particular org or, uh, customer of carry. Uh, and then in terms of table maintenance, we, we do, um, uh, cleanup of stale table versions. We only keep the two latest stable versions in, um, uh, uh, available. And, uh, we have some retention policies for historical data. So, uh, anything older than six months gets deleted. Um, and then we run optimize, uh, command, uh, from, uh, uh, from time to time, uh, to merge the small files into, into larger files so that, uh, the queries, uh, remain more efficient. Uh, yeah, that's it. That's the lightning talk. <laugh>, thank you.  

Speaker 1    00:10:25    
Boom. That was lightning fast. If you can promise to answer this one in the next two minutes, I'm gonna throw it at you. All right. So what technology are you using for the Data Lake Warehouse itself, or did I miss that in your slides?  

Speaker 2    00:10:40    
Um, yeah. We, we are using Iceberg, um, and we are using, uh, Trino for the querying. We are using Iceberg for the data storage, and we are using, um, uh, uh, Airflow, like I mentioned for the orchestration of the pipelines.