A Flexible, Efficient Lakehouse Architecture for Streaming Ingestion

calendar icon
May 21, 2025
Speaker
Rajwardhan Singh
Engineering Manager

Zoom went from a meeting platform to a household name during the COVID-19 pandemic. That kind of attention and usage required significant storage and processing to keep up. In fact, Zoom had to scale their data lakehouse to 100TB/day while meeting GDPR requirements.
Join this session to learn how Zoom built its lakehouse around Amazon Managed Streaming for Kafka (AmazonMSK), Amazon EMR clusters running Apache Spark™ Structured Streaming jobs (for optimized parallel processing of 150 million Kafka messages every 5 minutes) and Apache Hudi on Amazon S3 (for flexible, cost-efficient storage). Raj will talk through the lakehouse architecture decisions, data modelling and data layering, the medallion architecture for data engineering, and how Zoom leverages various open table formats, including Apache Hudi™, Apache Iceberg™ and Delta Lake.

Transcript

AI-generated, accuracy is not 100% guaranteed.

Speaker 0    00:00:00    
<silence>

Speaker 1    00:00:07    
Raj, how's it going, man?  

Speaker 2   00:00:09    
Thank you, Adam. Excited to be part of this discussion and the tech talk.  

Speaker 1    00:00:14    
I had the great pleasure of, uh, speaking to Raj in, in the green room, and we were going at it for about 30 minutes, just asking him so many questions about, uh, uh, data architecture, uh, and the system diagram of how they do everything at Zoom. And I am stoked for him to share his thinking, uh, about how to build a very fast data ingestion, um, with the Data Lake house. Raj, I'll be back in about 10 minutes.  

Speaker 2    00:00:42    
So, hi everyone. I'm Rasing. I'm working as an engineering manager in J I'm based out in Bangalore, India. Today, I will be talking about the lake house architecture of j and I will divide, uh, four section, the stock. First is the introduction. I will talk little more about the product of Zoom, and then I will go deep inside the architecture. Basically, uh, I will drive the, uh, bigger picture of our architecture, our data platform, data system architecture, and I will deep dive more on the Lakehouse architecture. So yeah, this is the intro, uh, introduction. Like, as you all know, like, uh, this is irony. I'm joining in Google Meet. Uh, but, uh, as you know, like, uh, uh, during pandemic, uh, we scaled our user customer to 200 times. And that time was very challenging, where we evolved our lakehouse from on-premise to hybrid and now to cloud.  

Speaker 2    00:01:44    
Like during pandemic, we, uh, started ingesting, uh, a hundred terabyte of a data per day. Uh, we are ingesting, uh, we were ingesting 150 million messages every five minutes. And before pandemic, we were completely on, on-premise, uh, uh, we were using Cloudera and hortonwork data platform. And then, uh, when Pandemic Stitch in our, uh, data center was not able to hold all these data. So we, uh, started approaching the hybrid approach and where we started using cloud, uh, we used, uh, um, and now we are totally on the cloud. So, uh, we are using, uh, Amazon EMR cluster, Apache Spark, Apache Flink structure, streaming jobs, and we are also using tabular format data like Apache Hoodie and uh, Dell Shop. And, um, we built very cost efficient Lakehouse architecture in June. Um, so going back, so, uh, for our audience, zoom is not just a meeting, uh, platform we provide, we are the AI first work platform.  

Speaker 2    00:02:54    
We provide a lot of products, uh, apart from meeting, like we have Zoom docs, uh, zoom webinars, zoom, contact center, uh, zoom Chat, uh, email, calendar, work vivo, and lot of products we are offering. Uh, uh, so Zoom is, uh, now, uh, uh, AI first, uh, platform, uh, and not just a meeting. We are offering a lot of products. So please go through this slide and, uh, please reach out to me, uh, in a chat, like if you want to know more about any other products like Zoom, phone team, chat, meeting, uh, any products. So, and now coming back to this, uh, the high level architectures, if you'll see, uh, this is the very, very high level architecture of our data systems. So, uh, we have divided our, uh, entire data systems into fourth part. One is our data source. Then we have a ingestion platform, uh, we call it, uh, data ingestion platform.  

Speaker 2    00:03:59    
And then we have a transformation platform. I will deep down more on this transformation platform because this is the exactly our Lakehouse architectures, uh, which is where we are doing a magic. And then, um, we have a data serving layer. Uh, so our total use case is based on our customers. So mostly product team, uh, DSML team engineering team are our customers. So we have a multiple data source, like, uh, third party vendors or our Salesforce, we have five story object, uh, and then we have OTP db, a MySQL database. And then we have a telemetry data where plant events and server logs are there. We are ingest, so we built our own ingestion platform, and we do have our own managed streaming services, which are ingesting all the bad jobs and the realtime jobs. And once this job is ingested, um, using our ingested platform in our data lake, then what we did, like our data lake, we created a Lakehouse and we did a data modeling on top of our lakehouse.  

Speaker 2    00:05:05    
So we divided our data lake, uh, lakehouse into three layer, we call it as a bronze layer, silver layer, and the gold layer. Now, when it comes to bronze layer, so bronze layer, we store the Paque original data, and it contains the full DB and the CDC locks, uh, in the bronze layer. And this layer is mostly the append only layer. And, um, the data written in for this, uh, branch layer is very sorted. Uh, we can say 15 to 30 days as per the use case. And mostly it supports the realtime frequency data. Uh, and it also have, uh, it also had the engineering and self serve, uh, driven, uh, template. Now, once we have a raw layer that is the raw data in a bronze layer, then we, uh, uh, we built on top of bronze layer that is called a silver layer.  

Speaker 2   00:06:03    
So silver layer is mostly, we are storing all our data in, in a delta format. Uh, it has, uh, it has a desired is schema table, uh, column naming conventions. It is template driven data. We are restoring in a silver, silver layer. And, uh, it is self-serve enabled. So layer and all the role based, uh, role tag and policy based access control is given in the silver layer. Uh, and also, uh, uh, we are also storing the audit logs, like some of the data we are storing in bronze layer that is good, getting deleted. And, uh, all these audit logs, we are also, uh, storing in a silver layer. So now comes the more refined layer. That's the gold layer. Most of our customer is using our gold layer. So gold layer is the mostly business ready and transform data that is being used by our, uh, customer.  

Speaker 2    00:07:01    
It is also in a data format, and it's a template driven and self-serve enabled data. We are restoring in a gold layer, and it supports mostly hourly and daily frequency. But when it comes to raw layer, that is the bronze layer. Uh, so bronze layer, basically, uh, it consists, uh, it contains the, uh, real time frequency data, but in a gold layer, we are having a hourly or daily, uh, frequency, uh, data. We are restoring in a gold layer. And it's, uh, it also focused mostly on the micro outage, not on the macro outage data. So once, uh, means we are, uh, storing our data, uh, in these three layers. And then, uh, we do have a data serving layer. We are using Trino, Databricks, uh, Apache Superset, and also API. And this being, uh, using this data serving layer, our customer uses this data.  

Speaker 2    00:07:56    
But, um, uh, there are also many corner case in our, uh, uh, in our Zoom data platform. Some of our engineering team, our customers, uh, want to access the raw air because they, uh, using API, they access the raw air, and then they do the customization on, on the data, and then they, uh, built a dashboard, a report on top of that. So, yeah, uh, this is the overall architecture, also the data returns and period. So if you'll see in the bronze layer, the data retention is 15 to 30 days. And then, uh, silver layer, we are steering the data, uh, for, uh, 15 months. And, uh, in the Gold Lake, uh, the data retention period is longer. So yeah, this is the, uh, overall architectures, uh, uh, for the Zoom data platform. And yeah, uh, uh, thank you. Uh, please connect me in LinkedIn. And also, uh, uh, if, uh, if you have any questions, uh, you can ask me if time permits, uh, uh, now, or you can, uh, please, uh, uh, ask in the change chat. I will try to, uh, answer as much as possible.  

Speaker 1   00:09:09    
We have. How do you manage to reprocess data that's older than one month if needed? You're saying that the bronze layer seems to have pretty short retention?  

Speaker 2    00:09:20    
Yeah, so if, uh, uh, if we need to, uh, reprocess the older data, then uh, we have to backfill the data again. So we do have a data in our data source. So if the data doesn't exist in our data lake, then we have to go back to the engineering team and backfill those data into our data lake. And then, uh, uh, uh, we customize those data as per the requirement from our customers.  

Speaker 1    00:09:46    
But that's where the assumption that you do have access to that raw data in the first place. Yeah.  

Speaker 2    00:09:50    
Yeah, definitely.  

Speaker 1    00:09:51    
But that isn't always, is that always the case? You can retrieve it again,  

Speaker 2   00:09:55    
Yeah. Means we can retrieve all, uh, always because, uh, uh, means our data is always there in our data la uh, data source. So we do have a data source like DynamoDB, MySQL file storage, which is being managed by the different team, different engineering team. But in data lake, we store that data only, which is being used very frequently by, by our customer. But if customer need the older data for the bronze layer, then in that case we have to work with, we are, uh, backfill means we have to work with the engineering team and backfill those data to our, uh, Bronx leaders.