Data Mesh and Governance at Twilio

May 21, 2025

Speaker

Aakash Pradeep

Principal Software Engineer

Twilio

At Twilio, our data mesh enables data democratization by allowing domains to share and access data through a central analytics platform without duplicating datasets—and vice versa. Using AWS Glue and Lake Formation, only metadata is shared across AWS accounts, making the implementation efficient with low overhead while ensuring data remains consistent, secure, and always up to date. This approach supports scalable, governed, and seamless data collaboration across the organization.

Transcript

AI-generated, accuracy is not 100% guaranteed.

Adam - 00:00:06

For folks tuning in, Nash and I were talking about what it takes to migrate and engineer a completely new, next generation data lakehouse migration at Twilio. We were thinking about what it takes to get there. It's clear to everybody that there's a limitation with the previous paradigm of how things were built. It isn't that somebody did something wrong, it's just we couldn't quite see past the horizon as it was when we were building the last version, is it?

‍

Aakash Pradeep - 00:00:43

Exactly. Yeah. And that's how we evolve also, right? We move from Hadoop to Spark to someone saying, okay, we need real-time now. Spark is also good. We need more for AI applications, looking to read and those kinds of frameworks. It's always like new problems come up and then you see how much you can change the existing one. If not, then you have to build something new.

‍

Adam - 00:01:03

Yeah. Ash, I'll be back in 15 minutes. The stage is yours.

‍

Aakash Pradeep - 00:01:10

Good afternoon everyone. My name is Aakash. I'm a principal engineer at Twilio. Today I'm going to talk about data mesh and governance. In this talk, we'll look into what problem we are going to solve, what data mesh is, and how it helps us solve that problem. We'll look into some of the implementation, strategies, high-level architecture, and some challenges around that.

Before I get into that, a little bit about Twilio. Twilio is a cloud communication platform. We provide programmable APIs for all communication channels to reach out to your customers. We also like to call ourselves a customer engagement platform.

With that, let's look into the problem we had before we got into data mesh.

If you look at this architecture, we have data from messaging, billing, and different kinds of domains at Twilio. They provide their data to a central team, the data platform, to manage it in a data lake, manage governance on top of that, manage GDPR compliance, and provide a lot of transforms to make those data available for consumption from Looker and Tableau. This was working well for the last few years, but as Twilio organically grew a lot, we started to see some problems.

One is because there is a central team handling this, though this team has automated a lot of onboarding and self-servicing, we still saw that whenever there is a very administrative schema change, a new data set comes up with GDPR requirements, or a new quality concern arises, like this data doesn't have good quality, how to fix that. Because you're part of the central team, you may not have a good understanding of all the domain data. So now you are working with the domain, and that's causing a slowdown in response and insights.

As Twilio grew, they did a lot of acquisitions. Now the ask is to get those data also global, to make them queryable and build insights on top of that. That means the central team has to build new systems to get those data available, causing more delay in making that available.

Basically, if you look at these examples, the central team started to be a bottleneck and a scaling issue for making data more available for querying or insights.

As you can see in the picture here, there is always an ask for more ETLs, more pipelines. We always say yes, we can build that, and then the required team files Jira, and we go scrum by scrum. Because it depends on the capacity of the central team, we started to see this kind of thing happening in the last six to one year, causing us to be a bottleneck. We anticipated that as Twilio grows more, this will become more prominent. So we started to look into solutions for these challenges and started looking into data mesh as a possible service.

You can see how we compare from monolithic to domain-oriented pipelines. The best thing data mesh can do is decentralize the data architecture so each product and domain team owns the data, manages data quality, GDPR, and takes away a lot of responsibility from the central team so they can move faster based on their speed and requirements.

You can analyze this similarly to how microservices have evolved. You have a monolithic codebase, want to scale a particular system, break it down into microservices, and scope it down. Similarly, we have to bring similar concepts in data where a team owns and solves their data and follows common standards.

Instead of a central team, you divide the responsibility and let domain teams take more part.

Since we are talking about data mesh, we can't go without talking about these four principles.

First is domain-oriented ownership. Data producers now own and manage their data rather than the central team.

The second is data as a product. You can't just treat data as available for consumption. You have to think of data as a product in a consumable form. Someone is responsible to annotate it, make it discoverable, make it novel, and define objectives so consumers know how to consume it.

Basically, think in terms of consumers and how they will consume your data and expose it that way.

The third is self-serve data infrastructure. If you're taking the data mesh way, it's more like not everyone starts from scratch. You have a kiosk of tools that teams can use rather than worrying about the tools and maintaining them. Someone else manages and maintains those tools. You may have domain-specific tools, but most common tools are platform tools for self-service.

The last and most important is federated governance. Earlier, Twilio used role-based access control with a central team defining roles like data analyst and table access. As Twilio grew, these rules became hard to manage with many permutations and combinations, becoming messy.

We started to look into federated governance, thinking not in roles but in terms of datasets, how they should be accessed, their sensitivity. The domain understands this and should be responsible. We are pushing these decisions to teams rather than a central team that may not have domain information.

Now, let's talk about how we implemented this data mesh studio. We have three main actors: data producer, central data platform, and consumer.

Data producers manage and own their data. They are responsible for compliance, maintaining high quality, freshness, meeting SLOs and SLAs, and defining access policies and sensitivity. When I talk about data access policy, they define the policy but do not implement it. The central data platform builds tools to manage access and define flow, but the data owner defines the policy.

We merged data ownership and data product concepts into the data producer role.

The central data platform produces tools required across teams and provisions and manages tools like Presto, Data Gates, Looker, and others used by consumers. They maintain tools for data discovery so not every team has to do that. They provide tools for cataloging and adding quality attributes to datasets.

Some pipelines like CDC pipelines are generic enough to be managed by the data platform and used by domains to move data from source to mesh. High-level data modeling at Twilio is the responsibility of the data platform, but domain-level data modeling is the domain data producer's responsibility.

The last actor is the data consumer, mostly responsible for consuming data, building ETLs, and producing new data. They produce a small amount of data but are the high consumers of business data.

When implementing data mesh, we took some constraints. Since we are mostly AWS, data sharing across domains happens only through S3 for data living in S3. No one should access S3 directly; they should follow the abstraction of AWS Glue Catalog, which has all information for shared data.

Role-based access management required a central team to govern, but we are moving that away and pushing it to domain data owners to define data access.

We thought that Lake Formation-based stack access control would be a good way to do that so they can define tags on their datasets.

Another constraint is to make data more queryable using Parquet, Delta, Iceberg formats rather than write-efficient formats because sharing data requires performance.

We decided never to share data copies because sharing data copies adds complications, compliance issues, potential errors, and quality issues. Avoid data copies as much as possible and only share metadata across domains, keeping data access at the source to maintain quality and compliance.

Also, when sharing data between domains, it should be a non-grant table, meaning if domain A shares data with domain B, domain B shouldn't spread that data further. Anyone else needing the data should go to the original source.

Now, the high-level architecture. The website shows data producers on the left, central team in the middle, and consumers on the right.

We are in a hybrid world. The original pipeline with the central data platform still exists. We are pragmatic because some small teams don't have the capacity to build their own data lake, so they still produce data to us, and we manage that.

We are pushing data mesh to domains that can manage their own. Each domain here is an AWS account with its own S3 and Glue Catalog where the catalog reads and defines access policies in Lake Formation.

If a team is ready with a table in a supported format and wants to share it, they connect to a tool we built that enforces access, quality, and SLO policies. This tooling enforces all strategies. Once configured, the tool sets sharing from one account to another.

There are two flows. One is teams who want to access shared data can go to the data hub, discover data, check quality, and request access. The tool provisions access after approval.

The other flow is human users requesting data access through ServiceNow, which connects to an API controlling approval flows. Once approved, the dataset becomes available to the user.

Most data mesh responsibilities are on the domain side, producing and controlling data availability for consumers.

Some challenges remain. Fine-grain access control is still a big problem. Some tools provide it, but there are many APIs involved, and cross-account access is problematic with Lake Formation. We are working on approval flows and evolving the process to be more streamlined, introducing data console and data steward roles.

That's pretty much what I had to cover. Thank you. Any questions?

‍

Adam - 00:16:22

Nice Ash. We have a few questions. First, James says, really good presentation Aakash. Do you use data contracts? If so, what tools do you use to store them, what data do you store, and how do you ensure contracts are honored?

‍

Aakash Pradeep - 00:16:44

Awesome question. We are thinking around it and working on it. Mostly, we plan to use Data Hub to maintain contracts. When saving happens, contracts will be created and stored in Data Hub.

‍

Adam - 00:17:05

Another from Nel. Do you centralize data infrastructure pieces? Does ingestion have a central team or does each team do their own ingestion?

‍

Aakash Pradeep - 00:17:17

We push that to domains. We have a central team managing some CDC pipelines. We live in a hybrid world. Some old systems replicate from DB using CDC pipelines, data comes from Kafka to mesh. Those still work. The strategy is to push ingestion to domains. Domain teams are independent to choose tools or AWS managed services. We provide CDC partners as tools, but domains can pick others if they want.

‍

Adam - 00:18:05

You're building a platform for them, and they can choose to use it if it's the best tool for the job.

‍

Aakash Pradeep - 00:18:18

Exactly.

‍

Adam - 00:18:20

Jason asks, did you start with a central data lake and then slowly move towards a distributed model?

‍

Aakash Pradeep - 00:18:27

Exactly. For small orgs, central works well. As Twilio grew to enterprise level, bottlenecks appeared with the central platform. It slowed domains. We adopted a hybrid culture with central for small teams and use cases, and bigger domains run their own systems.

‍

Adam - 00:19:12

Amh asks, what exact tools or services are you using for governance?

‍

Aakash Pradeep - 00:19:20

We mostly use Lake Formation-based controls. All table definitions and access are in Lake Formation. We built a tool on top for multi-account settings. We have custom implementations for approval flows using ServiceNow to orchestrate approvals and access.

‍

Adam - 00:20:04

Anand asks if it's a centralized approach on top of a decentralized model.

‍

Aakash Pradeep - 00:20:14

It's more like evolution. We start with central and live in a hybrid world. We push for decentralization so domains can govern. The hypothesis is domain teams know data and governance best. Central helps build platform tools, but governance strategy and compliance go to domain teams.

‍

Adam - 00:21:08

James asks, how have you measured ROI and success aligned with business objectives to demonstrate value for changing architecture?

‍

Aakash Pradeep - 00:21:20

Great question and hard one. Success is in customer satisfaction, speed of data movement, how much new data we onboard. One measure is how many new people onboard and new dashboards facilitated. We tie success to customer success, like AI teams building new insights and products enabled by our tools. It's a mix of these factors.

‍

Adam - 00:22:25

One last from Vasant, maybe a couple more if I can squeeze in. Do you have FinOps or managed cost attribution for the platform?

‍

Aakash Pradeep - 00:22:35

We are looking into that. If you move to domains, cost attribution is easier. Data mesh pushes data quality and maintenance to domain teams, so cost attribution for maintaining datasets is on domains. Central team cost attribution is complicated due to multi-tenancy. With data mesh, it's easier to quantify domain costs for data availability. Let me know if you want to follow up.

‍

Adam - 00:23:27

We gotta get going for the next stage, but I want to say the conversation with you both earlier and now seeing that system diagram inspires me. I want to sit down with you for two or three hours, just zoom in on each piece, what exactly you mean, and to what extent domain owners have tools and ability. There are many questions to unpack. This was as long as we can make it now, but there should be a part two. Thank you very much for joining us today.