A Flexible, Efficient Lakehouse Architecture for Streaming Ingestion

May 21, 2025

Speaker

Rajwardhan Singh

Engineering Manager

Zoom

Zoom went from a meeting platform to a household name during the COVID-19 pandemic. That kind of attention and usage required significant storage and processing to keep up. In fact, Zoom had to scale their data lakehouse to 100TB/day while meeting GDPR requirements.
Join this session to learn how Zoom built its lakehouse around Amazon Managed Streaming for Kafka (AmazonMSK), Amazon EMR clusters running Apache Spark™ Structured Streaming jobs (for optimized parallel processing of 150 million Kafka messages every 5 minutes) and Apache Hudi on Amazon S3 (for flexible, cost-efficient storage). Raj will talk through the lakehouse architecture decisions, data modelling and data layering, the medallion architecture for data engineering, and how Zoom leverages various open table formats, including Apache Hudi™, Apache Iceberg™ and Delta Lake.

Transcript

AI-generated, accuracy is not 100% guaranteed.

Adam - 00:00:07

Raj, how's it going, man?

‍

Rajwardhan Singh - 00:00:09

Thank you, Adam. Excited to be part of this discussion and the tech talk.

‍

Adam - 00:00:14

I had the great pleasure of speaking to Raj in the green room, and we were going at it for about 30 minutes, just asking him so many questions about data architecture and the system diagram of how they do everything at Zoom. And I am stoked for him to share his thinking about how to build a very fast data ingestion with the Data Lakehouse. Raj, I'll be back in about 10 minutes.

‍

Rajwardhan Singh - 00:00:42

So, hi everyone. I'm Raj Singh. I'm working as an engineering manager in J. I'm based out in Bangalore, India. Today, I will be talking about the lakehouse architecture of J and I will divide it into four sections. The first is the introduction. I will talk a little more about the product of Zoom, and then I will go deep inside the architecture. Basically, I will drive the bigger picture of our architecture, our data platform, data system architecture, and I will deep dive more on the Lakehouse architecture. So yeah, this is the intro, introduction. Like, as you all know, this is irony. I'm joining in Google Meet. But as you know, during the pandemic, we scaled our user customer to 200 times. And that time was very challenging, where we evolved our lakehouse from on-premise to hybrid and now to cloud.

‍

Rajwardhan Singh - 00:01:44

During the pandemic, we started ingesting a hundred terabytes of data per day. We were ingesting 150 million messages every five minutes. Before the pandemic, we were completely on-premise. We were using Cloudera and Hortonworks data platform. And then, when the pandemic hit, our data center was not able to hold all this data. So we started approaching the hybrid approach where we started using cloud. We used, and now we are totally on the cloud. So, we are using Amazon EMR cluster, Apache Spark, Apache Flink structure, streaming jobs, and we are also using tabular format data like Apache Hudi and Delta Lake. And we built a very cost-efficient Lakehouse architecture in June.

‍

Rajwardhan Singh - 00:02:54

Going back, for our audience, Zoom is not just a meeting platform. We provide, we are the AI-first work platform. We provide a lot of products apart from meeting, like we have Zoom Docs, Zoom Webinars, Zoom Contact Center, Zoom Chat, email, calendar, Workvivo, and a lot of products we are offering. Zoom is now an AI-first platform and not just a meeting. We are offering a lot of products. So please go through this slide and please reach out to me in a chat if you want to know more about any other products like Zoom Phone, Teams, Chat, Meeting, any products.

‍

Rajwardhan Singh - 00:03:59

Now coming back to this, the high-level architectures, if you'll see, this is the very, very high-level architecture of our data systems. We have divided our entire data systems into four parts. One is our data source. Then we have an ingestion platform, we call it data ingestion platform.

‍

Rajwardhan Singh - 00:04:15

Then we have a transformation platform. I will deep dive more on this transformation platform because this is exactly our Lakehouse architecture, where we are doing the magic. And then we have a data serving layer. Our total use case is based on our customers. Mostly product team, DSML team, engineering team are our customers. We have multiple data sources like third-party vendors or our Salesforce, we have five story object, and then we have OTP DB, a MySQL database. Then we have telemetry data where plant events and server logs are there. We built our own ingestion platform, and we do have our own managed streaming services which are ingesting all the batch jobs and the real-time jobs.

‍

Rajwardhan Singh - 00:05:05

Once this job is ingested using our ingestion platform in our data lake, then what we did, like our data lake, we created a Lakehouse and we did data modeling on top of our lakehouse. We divided our data lake, lakehouse into three layers: we call it bronze layer, silver layer, and gold layer.

‍

Rajwardhan Singh - 00:05:25

When it comes to bronze layer, we store the parquet original data, and it contains the full DB and the CDC logs in the bronze layer. This layer is mostly the append-only layer. The data written for this bronze layer is very sorted. We can say 15 to 30 days as per the use case. Mostly it supports the real-time frequency data. It also has engineering and self-serve driven template.

‍

Rajwardhan Singh - 00:06:03

Once we have raw data in the bronze layer, then we build on top of bronze layer that is called silver layer. Silver layer is mostly storing all our data in delta format. It has desired schema table, column naming conventions. It is template-driven data. We are storing in silver layer. It is self-serve enabled. All the role-based and policy-based access control is given in the silver layer. We are also storing audit logs. Some of the data we are storing in bronze layer is getting deleted, and all these audit logs we are also storing in silver layer.

‍

Rajwardhan Singh - 00:07:01

Now comes the more refined layer, that's the gold layer. Most of our customers use our gold layer. Gold layer is mostly business-ready and transformed data that is being used by our customers. It is also in delta format, template-driven and self-serve enabled data. We are storing in gold layer, and it supports mostly hourly and daily frequency.

‍

Rajwardhan Singh - 00:07:25

When it comes to raw layer, that is the bronze layer, it contains real-time frequency data, but in gold layer, we have hourly or daily frequency data. It also focuses mostly on the micro outage, not on the macro outage data. Once we store our data in these three layers, we have a data serving layer. We are using Trino, Databricks, Apache Superset, and also API. Using this data serving layer, our customers use this data.

‍

Rajwardhan Singh - 00:07:56

There are many corner cases in our Zoom data platform. Some of our engineering team customers want to access the raw layer because they use API, they access the raw layer, then do customization on the data, and build dashboards or reports on top of that.

‍

Rajwardhan Singh - 00:08:12

This is the overall architecture, also the data retention period. In the bronze layer, the data retention is 15 to 30 days. In the silver layer, we are storing the data for 15 months. In the gold layer, the data retention period is longer. This is the overall architecture for the Zoom data platform. Thank you. Please connect with me on LinkedIn. If you have any questions, you can ask me if time permits now, or you can ask in the chat. I will try to answer as much as possible.

‍

Adam - 00:09:09

We have a question. How do you manage to reprocess data that's older than one month if needed? You're saying that the bronze layer seems to have pretty short retention?

‍

Rajwardhan Singh - 00:09:20

Yeah, if we need to reprocess older data, then we have to backfill the data again. We do have data in our data source. If the data doesn't exist in our data lake, then we have to go back to the engineering team and backfill those data into our data lake. Then we customize those data as per the requirement from our customers.

‍

Adam - 00:09:46

But that's where the assumption is that you do have access to that raw data in the first place.

‍

Rajwardhan Singh - 00:09:50

Yeah, definitely.

‍

Adam - 00:09:51

Is that always the case? You can retrieve it again?

‍

Rajwardhan Singh - 00:09:55

Yes. We can retrieve always because our data is always there in our data source. We do have data sources like DynamoDB, MySQL, file storage, which is managed by different teams. But in data lake, we store only data which is used very frequently by our customers. If customers need older data for the bronze layer, then we have to work with the engineering team and backfill those data to our bronze layer.