Data Mesh and Governance at Twilio

calendar icon
May 21, 2025
Speaker
Aakash Pradeep
Principal Software Engineer

At Twilio, our data mesh enables data democratization by allowing domains to share and access data through a central analytics platform without duplicating datasets—and vice versa. Using AWS Glue and Lake Formation, only metadata is shared across AWS accounts, making the implementation efficient with low overhead while ensuring data remains consistent, secure, and always up to date. This approach supports scalable, governed, and seamless data collaboration across the organization.

Transcript

AI-generated, accuracy is not 100% guaranteed.

Speaker 0    00:00:00    
<silence>

Speaker 1    00:00:06    
Uh, for folks tuning in, Nash and I were talking about like what it takes to just to, to migrate and engineer a, like a completely new, like next generation, uh, data lakehouse migration at, at Twilio. And we are thinking about like, what does the, what does it take to, to, to get there? Um, and so it's clear to everybody that there's just a limitation with the previous kind of like, paradigm of how things were supposed to be built. It isn't that somebody did something wrong, it's just we couldn't quite see past the horizon as it was when we were building this. The last version, is it?  

Speaker 2    00:00:43    
Exactly. Yeah. And that's how we evolve also, right? From, we move from Hadoop to a spark, to someone that's like, oh, okay, we need a real time now. Like a spark is also good. We need more for AI applications we are looking to read and those kind of frameworks. It's always like, new problems come up and then you see that, okay, how much I can change the existing one? If not, then you have to build something new.  

Speaker 1    00:01:03    
Yeah. Ash, I'm gonna, I'll be back in 15 minutes. The stage is yours.  

Speaker 2   00:01:10    
Uh, good afternoon everyone. My name is acas. I'm a principal engineer at tulio. Today I'm gonna talk about data mass and governance. Uh, and in, and this talk probably we'll look into, uh, what problem we gonna solve interior, how, what is data mass and how it help us to solve that problem. Uh, we are gonna look into some of the implementation, uh, and the strategies and high level architecture and some of the challenges around that. So let's get into that. So first, uh, before I get into let's, uh, little bit into about <inaudible>, TIO is a cloud communication platform. I provide a programmable, API for all the communication channel to reach out to your customer. Uh, we also like to call ourself as a, uh, customer engagement platform. So with that, let's look into what was the problem we had, uh, before we into database.  

Speaker 2   00:02:03    
So if you look at, this is the architecture. Uh, we have data perform where messaging billings and different kind of domains, tulio, they, they're providing their data to a central team data platform to manage it in a data lake, manage the governance on top of that, uh, and manage the, the GDPI compliance kind of things, and also provide a lot of transforms by lot of transforms and make those data available for contention from Luer and Tableau. This was working well for like last few years, but as Twilio organically growing a lot, we started to see some problems with that. One is like, because there is a central team handling, though the, this team has kind of automated lot of, uh, uh, onboarding, self servicing and all that, but still we saw that, uh, whenever there is a very administrative schema changes happening, there is a new data set come up with the GDPR requirement.  

Speaker 2   00:02:58    
Uh, there is a new quality concerns comes up that, okay, this data doesn't have a good quality, how to fix that. Now, because you're part of central team, you may not have good understanding of all the domain data. So, so now you are working with domain and that's causing a, slowing your, uh, response, slowing for the insights. Uh, and basically, uh, and as Tulio grown up, they did a lot of equities. And also now the ask is to get those data also global, to get, make them queryable, to build an insight on top of that. But that means that now central team has to Spanish means and get, like, build new systems to get those data also available. And that will cause more delay, uh, in, in making that available. So basically the, if you look at that, these all examples are is that central team is started to be a bottleneck and a scaling issue for making data more available for curable purpose or for insights.  

Speaker 2    00:03:53    
And as you can see in the picture here, the there is always ask for more ETLs, more, uh, pipeline over there. We always say, yes, we can build that, and then the required team will file Jira and we'll go on a scrum by scrum. Uh, and because it will depend on the validity of central team here. So we kind of started to see this kind of, uh, in last six and one year, we kind of see that these kind of things are happening and we are causing, like we are being a bottom lake for a lot of these things. And we anticipated that in future as two years growing more and more it'll become more prominent. So we started to look into the solution for these kind of challenges. So we, uh, looking into data mass and we started, okay, this may be a possible service.  

Speaker 2    00:04:36    
And so look into that. So you can see how like we kind of compare between how from monolithic it can go to a domain oriented pipeline. And the best thing the database can in with is that decentralized, the data architecture in such a way that each product and domain team itself started to own the data, do the data quality GDPR and all those kind of things. And, and taking away a lot of the responsibility from the central team so that they can move faster based on their speed and requirement around. And this all like, and you can analyze like, uh, this with a similar way that microservice have been, uh, has been given culture that, okay, you have a mono systems code base and you wanna scale a particular system and also break down in microservice, and you concentrate on each one of them. Uh, and a scope it down in such a way, similar way, we have to bring similar concept concepts in the data also, where you can have a team owns and solve their data and follow the common standards there.  

Speaker 2   00:05:32    
Uh, and instead of being central team, you look into that, okay, uh, divide that, that whole, uh, responsibility and take domain teams take more part on that. Since we are talking of data, ma, we, we can't go without talking these four principles here. Uh, first principle here is a domain oriented ownership. Uh, it means that the, the data producers now are owning the data and managing it and maintaining it, uh, rather than the central team. The, the past concept here is a data as a product. So you can't just treat the data as like, okay, this is, this is available consumption and all that. Rather you have to think in a product way. The data itself become a product which is in congenital form. So there will be someone who's responsible to annotate it to make it discoverable, make it very novel and define some other objectives on top of that so the consumer can know, okay, how, how to, uh, look into, out how to consume it.  

Speaker 2    00:06:26    
Basically, it asking you to think in terms of consumers, how they gonna consume your data, uh, and expose that way. Uh, self-serve data infrastructure, uh, is like if you're taking the duty way, it's more like, you know, so not everyone sort of start from the scratch, rather that the, it should be more like you have a kiosk of tools and these teams can take those tools and see how to use them and getting, rather than worrying about those tools and mentoring that someone else is managing and like, uh, mentoring that tools around. So, but that's not be the one. Literally you will have your own domain, can have their own kind of, you know, specific center. But these are the most common tools around someone can, a platform tool supposed to survey. And the last is the most important one is the federated governance. So earlier when the Tulio was there, if you look in the last architecture here, it is mostly a role-based access control.  

Speaker 2   00:07:19    
We had that. Uh, and that means there is a central team who is defining that, okay, there will be a data analyst role and these other tables which be available in that role. But as trulio every, like, as enterprise grows, these rules started to hard to manage and you will have so many permutation combination that you start to be a little messier. So we started to look into, uh, this one that how our governance would be more federated means it's, it's issue of thinking in the role rather things in terms of this dataset, how it should be accessed, what should be the sensitivity of it domain understands, and they should be supposed to do that. So we are putting, uh, we are pushing that also decisions around to the teams rather than a central team, which may not have that domain information.  

Speaker 2    00:08:06    
So, so now we are gonna talk about how we implemented this data macing studio. So we basically have three actors. There are others also, but these are the main three actors. Uh, one is the data producer, the center data platform, and the consumer side. And now I'm talk about the responsibility of each of their, so data producer is the one who is managing their data, they're owning their data and they are the one who is responsible to make it compliant, maintaining high quality data, maintain, making sure that it's re like it's having the freshness, meeting the SLO and SLA and also defining how someone can access here, uh, what kind of sensitivity it has. And when I talk about the data access policy, they only are responsible for defining their policy, not implementing it. It'll be responsibility of that data platform center data platform to see what kind of tools it can build to make the access management and, and define the flow, how it's gonna happen, but how someone gonna access the tables and what should be the policy around that will be the data owner's responsibility here.  

Speaker 2    00:09:12    
Data producer responsibility. And basically we kind of merged the, the two conceptual data ownership as well as the data product. So that's the role which data producer has been defining central data platform. As we're talking about already. It's kind of the team who's producing those tools that require across all the teams and as well as provisioning and managing tools like Presto, data, gates, Looker, and all, which is can be used by restaurant. You use, uh, maintaining a tools for data discovery so that not every team has to do that, but rather there a central platform teams which is doing, providing a tools for, uh, discovering data, uh, cataloging it and and providing a tool to, you know, the add the quality of attributes to your data sets as well. We have seen that some of the pipelines like CDC pipeline and all, which is very well defined, is a generic enough that can be managed by data platform itself and can be used by domain to do their, uh, like use it for moving their data from source to ness and kind of things. Uh, similar way, the high level data modeling. So as a tulio level, data modeling is the responsibility of data platform. But at the demo level, data modeling is going to the domain data portal here. And the last here is the other actor is last actor is the data consumer who are mostly responsible for like consuming this data, building the ETLs and producing a new data. It's very small amount of data they produce around, uh, but they're mostly the high consumers of all these <inaudible> business data around.

Speaker 2    00:10:40  
We were implementing data measure, we kind of took some <inaudible> and constraints and some of their certain and constants here. Very simple here, and I'm trying to be very pragmatic here, uh, because u is mostly AWS o so we have seen this constant that, you know, the data setting across the domain will only happen through S3, uh, uh, for only those data who are living in S3 and the S3 should be the one, uh, data format here. And no one should directly access S3 directly rather than they should follow the abstraction of AWS blue catalog. And that means that blue catalog should have all the information for those who wanna be shared data should be shared around. And as I was talking earlier, the role based access management, we are already seeing that it, it requires a central team to govern on that. And we are moving that away and we are trying to push that on, on, on the domain data owners now that okay, how should be data access and all that.  

Speaker 2    00:11:37    
And for that, uh, we thought that, uh, tac the Lake <inaudible> based, stack based access control would be a good way to do that. So they can define their tax on their data sets itself. Uh, and the other constraint we had is that make data in a much more better queryable format like par K part data iceberg rather than something which is right efficient because you're trying to share data whether so makes your, that is performant. At the same time, we took some strategy decision that, you know, never share the data because the moment you share the data, you add more complication in, uh, in extra copy of data means there's a compliance issue, there can be more potential of error happening, there can be quality issues and all those kind of things. So avoid the data copy as much as possible. Uh, in fact never do that and only share the metadata across domains, uh, and keep accessing the data in the source itself so that you don't have any still data, you have high quality data and you're not getting to the compliance source around the other.  

Speaker 2    00:12:38    
Also strategy took is when you share the data between domains, sure that that's non-grant table and that means that if domain A sharing data with domain B, domain B shouldn't be, uh, like spreading that data again to any, anyone else, domain D should be contented with that. If somebody need their data, they need to go to the oral source. Yeah. So with that, I will get into the high level architecture here. So the high level architecture, if you look at the website, they are the data producers and the right, the central team is here and then the right, right boxes here is like displaying the consumer idea. So as you can see, we are very much in the hybrid world as of now. We have the original pipeline, which is a central data platform that's still happening and we are very pragmatic here in terms of, we know that there is a small teams also there who doesn't have that much, you know, the uh, like, like have the capacity to build their own data lake and all those.  

Speaker 2    00:13:33    
So they're still producing data to us and we are kind of managing that, but we are kind of in habit world. We are kind of pushing that to the data miss here. So a lot of these domains who can manage their own, and you can think each domain here, boxes here as a AWS account. So there's a very multi AWS account architecture here. So each AWS account has their own S3 have their own blue catalog where the catalog read so so that someone can access it and they have their own access policy to find in lake forms. Now if this team is ready that, okay, this table is in <inaudible> format, I want to like share it with <inaudible> to you, they're gonna come and connect to a tool. We have built database tool, uh, which kind of enforce that, that, you know, whatever data you wanna share, you should have the, the access thing around this quality SOO and all big thing through that.  

Speaker 2   00:14:20    
That's all this whole, uh, tooling, uh, allow you to do that. So it it kind of enforce that all kind of strategy. And once this all configuration being provided, this tool, this tool does the setting, uh, from this account to some other account on whichever one domain to other domain around at the same time. If you look at this one, there is a two flow here. One flow is that these are the different teams who are also AWS account, they wanna access this share data. They can go in data hub and see that, okay, what kind of discover the data, look into their quality and come back and say that, oh, I want, I wanna access this data. This tool will again provision it, but it makes you that you go through the approval process and all that similar way, the other way to access these human users who gonna be accessing this data for that we built a, a pipeline here, which they go and request ServiceNow saying that, oh, I want to access this data and it goes and connect with the A API here, which provides controls all the flows, how the approval flow can go into that.  

Speaker 2    00:15:16    
Once all the approval flow and mechanism happens, this, uh, this data set become available for this user to condu it. And you can see here that now it's this, this data mass there is most of the things is on the domain side, they are producing and they're controlling how I'm gonna convert it and making it available for rest of the <inaudible> to consume it. Some of the challenges we had here, uh, fine grain access controller say is still a big problem. Uh, like some of the tools here bu provide, but uh, still, uh, there's a lot of vending APIs and all goes into that. So cross account becomes a very problem with a lake form and we are still working on something with the approval flow and all there though we have a concept here that, uh, you know, the data owners can define and they need to approve it. Uh, but we are still, uh, in process, we are thinking through it how to evolve even more and make it more streamlined. So introducing data console, data steward, those kind of things around. Yes. So that's pretty much I had to cover. Uh, thank you. Any questions?  

Speaker 1    00:16:22    
Nice Ash? Uh, yeah, I think we do have a few questions, so let me start to take them one by one. Okay, first one, James is saying really good presentation Akash, do you use data contracts? Uh, and if so, what tools do you use to store them and what data do you store in them and how do you ensure that the contracts are honored?  

Speaker 2    00:16:44    
Awesome. Yeah, that's a very good question and we are in fact thinking around it. Uh, uh, we are working on it and mostly we are thinking that we are gonna use the data hub to maintain that contract. And whenever the saving happens and all that, they will come with the data contracts and that definition will leave in the data hub, accurate data hub as we are, as of now, we are using that.  

Speaker 1    00:17:05    
Mm-hmm. We got another one from, from Nel. Uh, do you centralize the data infrastructure pieces? So that is like, does ingestion have a central team or each team does their own ingestion?  

Speaker 2    00:17:17    
No, so we posted that to domain. So we have a central team here, which is like coming from, uh, some of the CDC pipeline. And so we are living in hybrid world. Some of the old existing system was from like, from db we replicating. So it's following the CDC pipeline data is coming from Kafka to <inaudible>. So those are still working, but as of now, the strategy is to push that to the domain and domain team is independent enough to see that, okay, which tool or which, uh, tool or what AWS managed service they can use. At the same time we kind of provide a lot of these CDC partners and all, as I told, we provide them as a tool. So if that fits in their build, they can use that, but at the same time they're very independent to go and pick up something else which may might help them.  

Speaker 1   00:18:05    
Yeah, you're building like a platform for them and they can choose to use it. If it, if that tool is the best tool for the job, then they're welcome to use it. But you are building tools that sort of keep the different domains as needs in mind.  

Speaker 2   00:18:18    
Exactly.  

Speaker 1   00:18:20    
We got one from Jason here. Uh, did you guys start with the central data lake and then slowly move towards distributed model?  

Speaker 2    00:18:27    
Exactly. So in fact that will, I will suggest if the org is small and off central is generally works well. Okay. But as, uh, Twilio has grown up to enterprise level, uh, big enterprise now, so, uh, that started to be, that's where we started to see the bottlenecks with the central platform that okay, central platform need to do all those and that's where the slow days as well as now you are on the central team to provide a lot of capsid and which kind of slows the domains. And that's how we are kind of adopting, uh, for, uh, a hybrid culture here where we have still a central. So for the small teams and for small use cases, they can use that, but for the bigger domains and which has a big data itself, they can run their own systems.  

Speaker 1    00:19:12    
Uh, we got another one here from amh. What exact tools or services are you using for governance?  

Speaker 2    00:19:20    
Okay, so, so as of now we are mostly, uh, are using the Lake Cent based controls here. Uh, so the, all the definition for a table, how someone can access and all those, those are available in, in the Lake cent. We are using this tool is also built on top of that to the multi account setting. And on top of that, uh, we have our own, uh, custom implementations to have the approval flow and all those gonna go through. Uh, and for the maintaining the approval flow and all we are reading ServiceNow to, to orchestrate the approval flows and how someone can access that.  

Speaker 1    00:20:04    
We got one more here, Anand, I think he was responding to something you were saying earlier. Uh, is it a centralized approach on top of a decentralized model,  

Speaker 2    00:20:14    
Centralized approach on decently approach on top of,  

Speaker 1    00:20:18    
I think there probably was context, you know, when they first submitted this, I'm, I'm not sure exactly what this, uh, Anant, if you wanna specify what exactly  

Speaker 2    00:20:27    
Yeah. And it's more like evolution for us. We are starting with Central now though we are living in hybrid world where some of things are in the central, uh, but we are trying to push for decentralized as much as possible, uh, so that it, it domain can govern. And that's what the hypothesis is around here. The strategy is that domain teams knows the data, they know how to govern it, they know how to, whether the GDPR and sensitivity of that, yeah, a central team can help them. And that's what the role we are trying to play to help them to as a platform to build that tools. But the governance and the strategy, how gonna be access GDPR compliant quality and all those goes to the domain team.  

Speaker 1   00:21:08    
Okay. We got one more here from James. How have you measured the ROI and the success measures aligned with delivery of business objectives to demonstrate business value and benefits for changing architecture in such a way?  

Speaker 2   00:21:20    
Oh, that's awesome question. <laugh> and very hard one also. Uh, so the success here is, is in, in terms of customer satisfaction, uh, the speed of data which we can move around, uh, how much data new, uh, I can onboard. So, uh, one way to measure it also, like since we have this tool and this concept, how many new people have onboarded and how many new dashboards and all this getting facilitated mm-hmm <affirmative>. So that was one of the measurement we took around. Uh, on top of that, uh, we kind of like tied up ourselves, our success with the customer success, who are the other product teams who are using our tools to build and build more insights. Like one success would be the AI teams who are building, uh, more, uh, new insights and like making more products, uh, just because we kind of enable them. So that also is kind of, uh, success for us. So those kind of things, uh, are there, I don't have a straight answer for that, but it, it's kind of mix of all those kind of round things.  

Speaker 1    00:22:25    
Okay. Maybe one last one from Vasant here, maybe a couple more if I can squeeze another one in. Do you have finops or managed cost attribution for the platform in some way?  

Speaker 2   00:22:35    
We are looking into that, uh, as of now. Uh, and that's what the other goal also that if you move to the domain, then the cost attribution can be done easily. So right now, if you look around that, that this data clinging making data quality around and all those kind of pushed to this domain team. So now the cost attrition for maintaining that dataset is on the domain and can be easily been around that, okay, this domain is costing this much to us. Whereas central team has become very complicated that, you know, like you, you have to attribute each and it's become a multi-tenant, uh, product there. So it's very hard to quantify and put a like a cost around that. But this strategy also with database, it's kind of pushing data to their domain, so it's easy to quantify that this much. How, how much is this team is costing for just making this data available. I don't know whether I answered the question, but let me know if you have any follow up.  

Speaker 1    00:23:27    
Yeah, yeah. Well I think it's, we gotta get going for, for, uh, the next stage. But I will tell you one thing. I feel like the conversation with you both in the beginning and, you know, before we hit, you know, go live and now just seeing that diagram, that system diagram that you put as like, as like the final version of V two inspires in me. Like, I feel like we need to just, I wanna like, like sit down with you for like two, three hours. <laugh>. Just <laugh>. Yeah. It, it really on each one of these pieces, you know, just like to zoom in, what exactly do you mean by this? And like to what extent does the domain owner actually have the tools and the ability, like there's so many different questions and we, I I feel like we should unpack this somewhere. Uh, this was as, as, um, as long as we can make it right now, but there should be a part two. Gosh, thank you very much Sure. For joining us today.