Powering Amazon Unit Economics at Scale Using Apache Hudi™
Understanding and improving unit-level profitability at Amazon's scale is a massive challenge, one that requires flexibility, precision, and operational efficiency. It's not only about the massive amount of data we ingest and produce, but also the need to support our evergrowing businesses within Amazon. In this talk, we'll walk through how we built a scalable, configuration-driven platform called Nexus, and how Apache Hudi™ became the cornerstone of its data lake architecture.
Transcript
AI-generated, accuracy is not 100% guaranteed.
Speaker 0 00:00:00
<silence>
Speaker 1 00:00:07
I've got not one, but two speakers coming up in this next session. Let's bring them onto the stage right about now. Let's grab you Abby. Shaking Jason. Yay. Did you guys win some headphones?
Speaker 2 00:00:21
<laugh>? Nah, <laugh>. We, uh, felt,
Speaker 1 00:00:24
Felt a little bit like cheating
Speaker 2 00:00:25
<laugh>.
Speaker 1 00:00:30
Oh man, I gotta go figure that out. While you all are given a talk, I'm gonna have to go sort through a whole lot of chat data. Alright, I'll be back very soon. I'll see you all in a bit.
Speaker 2 00:00:42
Thanks everyone for joining in. Uh, this is Abha and this is Jason. We are both, uh, senior engineers here at Amazon. Um, today we are gonna be talking about, um, how we power, uh, Amazon Unit economics using hoodie and a confi driven framework called Nexus. Um, some intro on where we belong within Amazon, we come under the worldwide Amazon stores division. And, uh, under that we are part of an organization called Profit Intelligence. Um, the goal of profit intelligence to is to provide accurate, timely, and granular profitability data for Amazon stores globally. Uh, this includes one of them being contribution profit. Um, contribution profit is a, a very standard business metric that that is computed across different companies. Um, one simple example is, uh, if you ever bought a speaker from Amazon, uh, our team pretty much computes the exact profit Amazon made, including, uh, the various cost and revenue segments like shipping cost, uh, fees, and so on.
Speaker 2 00:01:53
Um, this profitability data allows a large number of, um, automated systems, uh, to make billion plus stations every day. This includes pricing forecasting, um, and used by finance teams. Yeah, that's, that's our intro on where we belong. So some history of, uh, our team with Data Lake. Uh, we've been dealing with computing contribution profit for more than 15 years, uh, which in essence means that we've been dealing with big data processing pretty much since the beginning. Um, we initially just did the processing and published the data into a central data warehouse. Uh, but as the business requirements evolved, we eventually moved to owning our own data and maintaining our own data lakes, uh, for better control, uh, making, uh, solving the business requirements more easily. Uh, we have gone through a lot of, uh, ations of data lakes. Um, we, we have custom SE data lake with custom query language, unstructured Redshift, ETLs on, uh, top of this SD Data Lake. Um, recently we have been, we moved to, uh, consuming streaming sources as well via Flink. Um, and the latest citation is Nexus, which I will talk in more detail, uh, at each of these citations. The main thing is they were changing business requirements that drove these iterations. Um, given that we are kind of the, uh, owners for all the retail logic, um, to keep up with the business requirements and, uh, Amazon growth, uh, we have to make sure, uh, our system is able to handle those requirements.
Speaker 2 00:03:37
Um, in this, I'm just showing a quick preview of, um, of how we started ingesting streaming sources. And the main thing to here look at is we are using Flink, uh, for streaming, uh, both in streaming mode and batch mode. Um, and we have our, our own in-house, um, absurd data lake, which, which we built, uh, using Spark Blue, and we still have, uh, the Redshift ETLs. Um, and in this citation we started isolating the business logic to, uh, to the, in on the input side. And we had a little bit of control, but we still, because we are still using a custom, um, solution for our data lake, uh, we had scaling issues and we, uh, some logic was still in the Redshift ETL. So, um, it was hard to make business changes.
Speaker 2 00:04:32
Our latest citation of, uh, not just Data Lake, but overall system is Nexus. Uh, nexus is a config driven data processing platform that allows customers to express their business logic. And customers here are being the actual, uh, different business owners within Amazon retail, they are able to express their business logic as simple declarative configurations. Uh, the expectation is that they only interact with this configuration layer and Nexus as a system is able to go from this configuration to, uh, being able to produce the outputs by generating the relevant workflows, uh, jobs. Um, this kind of brings us closer to, uh, a self service world where, you know, the business owners own their own business logic and they're able to make, uh, changes on their own without relying on the engineering team. Uh, this, this kind of solves our, the Amazon's growth and changing business requirements. Um, at the high level, there are four main modules. Um, I'll, I'll go over each of them briefly.
Speaker 2 00:05:37
We'll start with the Nexus flow. Um, the main responsibility of Nexus Flow is handling all the orchestration responsibilities. Um, each, each of the Nexus modules operates at its own abstraction level. Nexus Flow operates at what we call as a workflow abstraction. Um, workflow here is basically a declarative configuration representing the workflows. Um, users are able to either define it by hand or in most of our cases, it is generated from the higher order of config that, that, uh, the business owners define in the configuration layer. Um, nexus Flow hand handles this workflows in, in multiple layers, the logical layer and the physical layer. Um, logical layer is where majority of the magic happens in terms of building a logical tag, doing, uh, dependency inference, uh, attaching runtime information. Physical air is just a lightweight abstraction, uh, or step functions, um, where the actual jobs execute.
Speaker 2 00:06:35
The overall theme of all the Nexus modules is it is very extensible. Uh, let's say we have to add a new task type. Uh, it uses federated model where you just add a new implementation and everything works outta the box. Um, next we'll go Nexus, ETL. Um, nexus, ETL, uh, model deals with, um, it represents pretty much the compute or the ETL data processing aspects of Nexus platform. We can treat it as a, a library that is able to go from a, a job abstraction, which is job defined as configuration to an executable spark job, um, jobs. Uh, um, in, in the slide, I'm showing an example of a job, and, uh, job is defined as a list of operators, which kind of, uh, should be very intuitive from, uh, what we see here. Um, the operators can be either built in Spark transforms or, or we can use, uh, custom, um, user defined functions. Um, currently, uh, currently this Nexus, ETL we use, uh, spark primarily, but the concept should apply to other data processing platforms as well. Next, we'll go to, uh, nexus Data Lake. This is the storage layer. Um, most of our tables are, uh, pretty much hoodie tables. Uh, this is where user defined config on the head order, uh, con on the configuration layer kind of drives even the catalog management, including table creation, schema inference and schema evolution. Sometimes, uh, we even provide configuration, which gives, uh, hints on what distribution keys to use. Um, all these configuration, uh, go hand in hand with the Nexus flow config Nexus, ETL config, all of them together, uh, kind of work together. Um,
Speaker 2 00:08:23
The is all access flow. There are user plans to kind, uh, add control, play and APIs to query the data. Yeah, that, that's on Data Lake. Next, uh, learning who using Data Lake.
Speaker 4 00:08:46
Okay, cool. Um, yeah, uh, thanks <inaudible> for going through, uh, of Data Lake. What is Nexus? A lot of people you want to know, since it's in the title. Um, we built, we build Nexus data, uh, data Lakes data store specifically, uh, uh, for those of you who is, is similar to Iceberg, uh, like we talk about, but, uh, yeah, it had, comparing to Iceberg has a lot more, more functionalities, um, that we are utilizing. Um, however, they also, uh, come along with some learnings that we're, uh, dealing with here. Uh, the first one, one is, uh, concurrency. So originally when we designing our table structure, uh, we thought, oh, a couple jobs writing to the same hoodie table. Uh, this way when there's a schema update happens in one of the jobs. Uh, for example, we want to move one column from one to another, um, because we have a monolithic monolithic hoodie table, it, it would just, the transition just happens seamlessly, which is true.
Speaker 4 00:09:57
Uh, however, when we're actually running this in a shadow environment, we discovered it the because hudi use, uh, this thing called opti optim, basically assuming for multi rider scenarios, it will, uh, opt optimistically, uh, determine the insertion will go through, uh, if it, if there are two jobs writing to the same file it, uh, our actual shadow, the we sailors because there are a lot of two hours right? Writing to the same file in this case. Um, so we pivoted to this new <inaudible> table structure, <inaudible>, and then using the a format flow, uh, we will handle it orchestration layer to determine when each of these tables are have been completely asserted, uh, will run this joint job, that kind of the table output incremental query, and that would, uh, produce this combined hudi table. Okay. Uh, another part of learning that we, uh, got was, uh, how Hudi manages metadata table.
Speaker 4 00:11:09
Um, so it is enabled via this hoodie metadata enable config. Um, so what metadata table is is basically a smaller version of the table that sits inside your hoodie table that kind of keeps track of, uh, information like, uh, in indexing, uh, just like metadata information. Originally when we started off running the jobs we synchronous cleaning, we realized, uh, as we a bit of time, uh, so we thought, oh, just asynchronously. So we have another workflow that running that keeps cleaning off the table. Um, so later on, later on we discovered the, uh, despite the data being cleaned, the metadata table were not being cleaned, uh, which we filed a GitHub issue for the hoodie. People were very keen on responding. Um, we got to know is physically fixed and later version where we're just using the older version basically to ensure the metadata table gets properly cleaned. We switched back our hoodie use synchronous screening, uh, process. We also discovered a bug on our end. The, uh, why the file listing while while cleaning took a while just because file listing. Another, uh, section of learning we got, uh, was, uh, related to hoodie costs. Um, so most of our tables are copy on right tables. Um,
Speaker 4 00:12:33
All them our a uh, our update pattern for is part the rest 30 to 60 partitions, and then we help
Speaker 4 00:12:46
Kind of evenly spread across to 90 partitions. So with that <inaudible> we have in call table, we discover about 70% of the costs, uh, requests, uh, and then pull requests and get requests combined for about 80 per 80% of the costs. So even though we, due to customer requirement, we don't really delete our data, uh, we found most of our costs related to who they are are basically, uh, S3 requests rather than Esri source. We do see that that eventually changes as we, uh, store more data. Um, so we looked at a couple saving strategy, uh, one of 'em being S3 intelligence here. It's like a feature from aws. The backend automatically figures out if your data is frequently accessed or infrequently accessed or very infrequently accessed. Um, we also try the more aggressive table cleaning, uh, which is a feature that who has that we mentioned before. It kind of cleans your data, uh, for you. And then we also try EMR, all the scaling optimize our compute <inaudible>. Um, so as part of this migration project, uh, we migrated so many, uh, into <inaudible>, probably about 300. We will do a lot of them basically do the same thing. Um, but we, which is based, um, we maintain about 1200 tables increasing, um, based on business needs and all of the, each of the table gets updated about five to 15 times a day. Um, the total data size, uh, for our data lake is about four petabytes. With about one petabyte gets added and deleted every month, clean every month. Um, our daily, for our daily data size, we every day we ingest
Speaker 4 00:14:43
Hundred. Uh, there are days where they either answer and as part of using Apache Hudi, uh, rather than in-house build data lake, um, we say about, uh, one year of the developer days because the hudi hoodie is really just something we can just like pick it up and, and start using. Okay. Um, that's probably about it for the talk today. Um, we have a full version of the talk that we did for, uh, Hudi Community Sync, um, that you guys can scan the QR code and the, uh, the content of the talk is also converted to a block. Thanks to the wonderful one Health people, um, which is all also in the QR code.
Speaker 1 00:15:34
You guys were awesome. I gotta keep things moving. For anyone that has questions, drop 'em in the chat and I am assuming you guys are gonna be hanging around there and you can answer some questions that come through. Thank you fellas.