Moving fast and not causing chaos

May 21, 2025

Speaker

Joseph Machado

Senior Data Engineer

Data engineering teams often struggle to balance speed with stability, creating friction between innovation and reliability. This talk explores how to strategically adapt software engineering best practices specifically for data environments, addressing unique challenges like unpredictable data quality and complex dependencies. Through practical examples and a detailed case study, we'll demonstrate how properly implemented testing, versioning, observability, and incremental deployment patterns enable data teams to move quickly without sacrificing stability. Attendees will leave with a concrete roadmap for implementing these practices in their organizations, allowing their teams to build and ship with both speed and confidence.

Transcript

AI-generated, accuracy is not 100% guaranteed.

Speaker 0 00:00:00
<silence>

Speaker 1 00:00:06
And we have Joseph with us. Can you come onto the stage? Can you?

Speaker 2 00:00:11
Hey, thanks for having me, Adam. And um, Joseph,

Speaker 1 00:00:14
How you doing there? Joseph, where are you coming? Where are you calling in from? Um,

Speaker 2 00:00:17
Jersey City, New Jersey. Speaker 1 00:00:18 Jersey City. We're not too far away. Not too far away.

Speaker 2 00:00:21
Not too far away. Other side of the country?

Speaker 1 00:00:24
No, I'm in, I'm in New York, so I'm not too Oh,

Speaker 2 00:00:26
Okay.

Speaker 1 00:00:27
Gotcha. The other folks, yeah, the other folks are a little bit further. Uh, I'm gonna give you the stage and without further ado, Joseph, take it away.

Speaker 2 00:00:37
Sounds good. Uh, thank you. Um, hello everyone. My name is Joseph. I'm a senior data engineer over at Netflix. Uh, I wanted to take this time to talk about some of the things that I've learned in my career over the past 10 years, almost decade now. Um, and whenever I have, um, ignored these things, it has come back to bite me, and hence the name not causing chaos. All right, so the first, the most important thing as a data engineer is data modeling. Your data model is your product. Without good data model, no amount of optimization, no amount of fancy tools will help, uh, because that is what your end user is gonna consume. And when we talk about data model, there are a lot of nuances, but at a high level, there are primarily five types of tables. The fact which represent actual events that happen in real life, so like, uh, order or some event on a website, those are considered facts.

Speaker 2 00:01:36
It's a, it's something that happens. And then there are dimensions which are represent business entities, so consumer product seller, merchant, et cetera. So these are the people who interact, um, with the business. It, it can also be things like date. Um, and then there are bridge tables, which are, uh, mapping tables. Many to many relationship maps are stored in the bridge tables. And a a kind of recent, um, one is one big table where you typically take a fact table and left, join every dimension that is in that fact. And what this allows, um, you to do is serve it as a table, and people don't have to worry about joints. Uh, people don't have to kind of do the joints multiple times. Um, it's just basically we are sacrificing space to improve performance. And then finally, there's the summary table. You can think of it as data marts. These are essentially the final stage or final layer, if you will, of tables that you present to end users. Typically, people who use this are using it via a BI tool, um, so that when they access the summary table, they're not, uh, incurring costs because it's already aggregated.

Speaker 2 00:02:53
The next most important thing is data quality. And we always wanna make sure our data is valid before stakeholder uses that data, because unlike backend engineering or other pos, when our data is out and people start using it, it's hard to change, um, the actions that have been, uh, taken by the business user based on your incorrect data. When there are too many data quality issues, it causes a loose, um, uh, in trust by the stakeholders. And then you get questions like, Hey, is this uh, data accurate? I don't think so, et cetera, et cetera. But on the other hand, you also don't want too many data quality checks, and you, that means you'll get a lot of alerts and your on-call engineers are going to suffer. So finding the balance is key. As for implementation, you typically want to do something called a right audit publish pattern where you create your data, but before you export it, expose it to your consumers, you do your data checks, and then if only if it passes, you expose them. If it doesn't, um, you raise an alert and fix that issue.

Speaker 2 00:04:05
I wanna quickly touch upon like, uh, some types of data quality checks. The standard ones are like table constraint checks. These are checks where you make sure that your columns don't have nulls, they're unique, they have all values populated, et cetera. Then there are business rule checks. These are specific to the type of data you're working with. So something like start date cannot be, um, after end date, things of that nature. Then there are schema checks, so you don't break downstream processes. Metrics, variance check is an important one where you monitor key metrics over time and make sure it doesn't vary too much. So if, if you see an outlier, that's something, uh, of, but you do have to account for seasonalities. For example, on days like Thanksgiving, um, there might be a spike in revenue. Then finally, there's a reconciliation check. This type of check is typically used when you wanna make sure that you're not losing any data, uh, due to any code bug.

Speaker 2 00:05:09
So what you do is you take your output, you compare the number of rows against the input, uh, driving table. Typically, you'll have one driving table in your inputs. Alright, the next one is functional programming patterns for pipelines of when you write your PI pipelines as functions, especially the transformation part, it, it reflects how humans think. When we think of data pipelines, we think take table A, take table B, join them. So you, we think, in a series of steps. And functional programming enables us to represent that in code, um, more subsequently, and it ideally should be atomic, but the definition of atomic is a little tricky. Um, it basically means it should do one thing and only one thing, but that one thing you need to decipher yourself based on your code base and your team's, uh, expertise. And finally, uh, a key part is that your function, especially the transformation and loading function, should be, um, item potent.

Speaker 2 00:06:14
What that mean is no matter how many times you run the function of the same input, it should not result in duplicate data or partial data. This is typically done with overriding partitions. When you think of data pipelines, if you don't have this, any, um, issues will require you to clean up the output and all the outputs generated as part of your pipeline and then restart it. And it becomes an extremely tedious process. However, if you make your pipeline side important, you don't have to worry about that. You could just reunite it as many times as you want. Um, you will, and, and you can address that. You won't have duplicate data.

Speaker 1 00:06:50
Joseph, real quick. Yeah. Can I step back? Uh, do you mind putting your, uh, your presentation in full presentation mode?

Speaker 2 00:06:59
Sure.

Speaker 1 00:07:00
The meaning of the slides maybe slideshow top right? Yeah.

Speaker 2 00:07:10
Oh, there we go. Sorry. Is that better?

Speaker 1 00:07:13
Thanks.

Speaker 2 00:07:14
Yeah, sorry about that. Thank you. Yeah, I wanted to give a quick screenshot of, uh, functional programming or well, as functional as you can get in Python. Here, you can see how we can combine transformations in order that will also make it easy to test and easy to debug, especially if they're item potent. You could just whip up a, um, ripple and just test one function at a time. It makes, it just makes it so much easier to debug and also to rerun, especially with back fulls. Um, and then m for coding, also known as back in the day, Google, uh, copy paste from Stack Overflow. When you are using LLMs for coding, you need to be very clear on what you want done and how you want it done. If you do not specify how it's going to use a very, um, unconventional way or a way that is not, uh, in line with your, um, organization, and also please make sure to review the code before using it.

Speaker 2 00:08:19
I wanted to go over a specific example. I went up the script, um, a few, few hours ago where I wanted to create a Python script to generate TPCH data. TPCH is just a, a benchmarking data set. People u uh, most companies used to benchmark their query performance, and there are utilities to do this. When I give a prompt just saying what I needed to do, uh, it wrote like this huge script, like 300 lines, and it didn't work. However, when I specified what and how on the left hand side, I said, do this, use these tools, accept these as input parameters. It it produce perfectly working code, which wa uh, it reviewed and it it did run. So always specify the what and how, uh, especially when you're doing auto generation. All right. The next one is testing your code. Make sure to test your code.

Speaker 2 00:09:14
Um, it might take a few hours, but it'll, I can assure it'll save you like 10 to hundreds of times, um, in debugging and maintenance, especially if it's a pipeline that runs, um, frequently. And test also serve as working documentation because it represents the reality of your code. And when you write tests, don't just test for the happy path. Test for at least one happy path and multiple failure paths that will show the next person are you six months from now, what not to do with a function. And then finally, this. For advanced use cases, if you have multiple tests, run them in parallel. They're usually, um, they could usually be independent. Finally, um, automate repeater task. If you find yourself doing the same thing over and over again, uh, make sure that you automate it because you are prone to make mistakes and you could essentially save a lot of time.

Speaker 2 00:10:12
Two, two of my favorite ways are make files and async scripts. So make files are essentially a way where you can alias this long commands with simple commands and you can chain those commands. Basically, it's, it's a way of aliasing complex commands and it makes things easy to run. And this one is, um, async coating. So what I generally do is if I am running a pi, uh, a test, uh, pipeline, and it takes like five minutes, I'm not going to stay on the, uh, terminal and see what runs and what doesn't. So I just run it in the background and I do my own thing and it'll not notify me when I'm done, when it's done. And you can think of it, think about doing this as a pipeline as well. When you deploy a pipeline and it builds, it takes a long time. You just run it in the background and you will be notified when it's done.

Speaker 2 00:11:08
Finally, a lot of the problems, not the tech part, but the approach part and the design part have been solved. Um, and I recommend you read these three books, data Warehouse Toolkit, designing Data and Applications, fluent Pythons. The concept in these books still hold. The tech has gotten so much better with cheap storage table format, but using, and, and when you adopt these, um, underlying like foundational concepts, it'll make your pipelines more resilient and you wouldn't have to deal with so many issues when you're on call. Okay, so that was it. Um, do we have time for questions?

Speaker 1 00:11:53
Uh, we, first of all, this was excellent. Thank you very much. And you could see in the chat, the chat is bustling with, uh, with activity and with excitement. So Joseph, thank you very much for that. We, uh, we don't really, but let's, lots of people want the slides and people have been sharing the podcast that you recently had appeared on. Okay. I'll just ask maybe one last thing here. Miguel is asking, what would you say is the optimal balance between data completeness and data availability? Mm-hmm <affirmative>. How do you mitigate sacrifices made when wanting to serve data is real time as possible?

Speaker 2 00:12:26
Uh, again, uh, I'm gonna say the most computer science answer possible, it depends, but I'll give you a specific example. If you're dealing with revenue finance stuff, you want data completeness. However, if you're dealing with directional numbers, so think of charts that just go like this. You are, you, you're typically fine with, uh, sacrificing completeness. Um, so, so it's a balance. Uh, but, but the key thing is, uh, it's good that you're asking the question because that needs to be brought up with the stakeholder, and you guys need to make that decision together.

Speaker 1 00:12:59
Joseph, thank you very much for joining us. Uh, everybody wants the slides. We will send them and package them, uh, into the videos once they're all released. But for people that are tuning into the audience, I will just say, if I had had this slideshow or had listened to your presentation 10 years ago, it would've saved me probably years of frustration and agony.

Speaker 2 00:13:22
And that the hard way.

Speaker 1 00:13:24
Oh my goodness. So many of these things. First of all, just a, it is brilliant distillation, uh, of just, of wisdom that feels hard won, right? Like, I, it's, I, I, I, I hear the pain <laugh>, <laugh> that is underlying all of this. And, uh, and I, I share some of it, and for anybody who can help to, who could use this insights to avoid some of this pain. Joseph, thank you very much for helping them.

Speaker 2 00:13:52
Thank you for having me. Have a good rest of your day.