Moving fast and not causing chaos
Data engineering teams often struggle to balance speed with stability, creating friction between innovation and reliability. This talk explores how to strategically adapt software engineering best practices specifically for data environments, addressing unique challenges like unpredictable data quality and complex dependencies. Through practical examples and a detailed case study, we'll demonstrate how properly implemented testing, versioning, observability, and incremental deployment patterns enable data teams to move quickly without sacrificing stability. Attendees will leave with a concrete roadmap for implementing these practices in their organizations, allowing their teams to build and ship with both speed and confidence.
Transcript
AI-generated, accuracy is not 100% guaranteed.
Adam - 00:00:06
And we have Joseph with us. Can you come onto the stage? Can you?
Joseph Machado - 00:00:11
Hey, thanks for having me, Adam.
Adam - 00:00:14
How you doing there, Joseph? Where are you calling in from?
Joseph Machado - 00:00:17
Jersey City, New Jersey.
Adam - 00:00:18
Jersey City. We're not too far away.
Joseph Machado - 00:00:21
Not too far away.
Adam - 00:00:24
No, I'm in New York, so I'm not too far.
Joseph Machado - 00:00:26
Okay.
Adam - 00:00:27
Gotcha. The other folks are a little bit further. I'm gonna give you the stage and without further ado, Joseph, take it away.
Joseph Machado - 00:00:37
Sounds good. Thank you. Hello everyone. My name is Joseph. I'm a senior data engineer over at Netflix. I wanted to take this time to talk about some of the things that I've learned in my career over the past 10 years, almost a decade now. Whenever I have ignored these things, it has come back to bite me, hence the name "not causing chaos."
The first, most important thing as a data engineer is data modeling. Your data model is your product. Without a good data model, no amount of optimization or fancy tools will help because that is what your end user is going to consume. When we talk about data model, there are a lot of nuances, but at a high level, there are primarily five types of tables.
The fact tables represent actual events that happen in real life, like an order or some event on a website. Those are considered facts. It's something that happens. Then there are dimensions which represent business entities, so consumer, product, seller, merchant, etc. These are the people who interact with the business. It can also be things like date.
Then there are bridge tables, which are mapping tables. Many-to-many relationship maps are stored in the bridge tables. A kind of recent one is one big table where you typically take a fact table and left join every dimension that is in that fact. What this allows you to do is serve it as a table so people don't have to worry about joins multiple times. We're basically sacrificing space to improve performance.
Finally, there's the summary table. You can think of it as data marts. These are essentially the final stage or final layer of tables that you present to end users. Typically, people who use this are using it via a BI tool so that when they access the summary table, they're not incurring costs because it's already aggregated.
The next most important thing is data quality. We always want to make sure our data is valid before stakeholders use that data because unlike backend engineering or other POS, when our data is out and people start using it, it's hard to change the actions that have been taken by the business user based on your incorrect data.
When there are too many data quality issues, it causes a loss of trust by the stakeholders. Then you get questions like, "Hey, is this data accurate? I don't think so," etc. On the other hand, you also don't want too many data quality checks because you'll get a lot of alerts and your on-call engineers are going to suffer. Finding the balance is key.
As for implementation, you typically want to do something called a write-audit-publish pattern where you create your data, but before you expose it to your consumers, you do your data checks. Only if it passes do you expose it. If it doesn't, you raise an alert and fix that issue.
I want to quickly touch upon some types of data quality checks. The standard ones are table constraint checks. These are checks where you make sure that your columns don't have nulls, they're unique, they have all values populated, etc. Then there are business rule checks. These are specific to the type of data you're working with, like start date cannot be after end date.
Then there are schema checks so you don't break downstream processes. Metrics variance check is important where you monitor key metrics over time and make sure it doesn't vary too much. If you see an outlier, that's something to look at, but you do have to account for seasonalities. For example, on days like Thanksgiving, there might be a spike in revenue.
Finally, there's a reconciliation check. This type of check is typically used when you want to make sure that you're not losing any data due to any code bug. You take your output and compare the number of rows against the input driving table. Typically, you'll have one driving table in your inputs.
The next one is functional programming patterns for pipelines. When you write your pipelines as functions, especially the transformation part, it reflects how humans think. When we think of data pipelines, we think take table A, take table B, join them. We think in a series of steps. Functional programming enables us to represent that in code more succinctly.
It ideally should be atomic, but the definition of atomic is a little tricky. It basically means it should do one thing and only one thing, but that one thing you need to decipher yourself based on your code base and your team's expertise.
Finally, a key part is that your function, especially the transformation and loading function, should be idempotent. What that means is no matter how many times you run the function with the same input, it should not result in duplicate data or partial data. This is typically done with overriding partitions.
If you don't have this, any issues will require you to clean up the output and all the outputs generated as part of your pipeline and then restart it. It becomes an extremely tedious process. However, if you make your pipeline idempotent, you don't have to worry about that. You can just rerun it as many times as you want without duplicate data.
Adam - 00:06:50
Joseph, real quick. Can I step back? Do you mind putting your presentation in full presentation mode?
Joseph Machado - 00:06:59
Sure.
Adam - 00:07:00
The meaning of the slides maybe slideshow top right?
Joseph Machado - 00:07:10
Oh, there we go. Sorry. Is that better?
Adam - 00:07:13
Thanks.
Joseph Machado - 00:07:14
Yeah, sorry about that. Thank you. I wanted to give a quick screenshot of functional programming or as functional as you can get in Python. Here, you can see how we can combine transformations in order that will also make it easy to test and easy to debug, especially if they're idempotent. You could just whip up a ripple and test one function at a time. It makes it so much easier to debug and also to rerun, especially with backfills.
Then m for coding, also known as back in the day Google copy paste from Stack Overflow. When you are using LLMs for coding, you need to be very clear on what you want done and how you want it done. If you do not specify, it will use a very unconventional way or a way that is not in line with your organization. Also, please make sure to review the code before using it.
I want to go over a specific example. I wrote a script a few hours ago where I wanted to create a Python script to generate TPCH data. TPCH is just a benchmarking data set. Most companies use it to benchmark their query performance, and there are utilities to do this. When I gave a prompt just saying what I needed to do, it wrote this huge script, like 300 lines, and it didn't work.
However, when I specified what and how on the left-hand side, I said do this, use these tools, accept these as input parameters. It produced perfectly working code, which I reviewed and it did run. So always specify the what and how, especially when you're doing auto generation.
The next one is testing your code. Make sure to test your code. It might take a few hours, but I can assure it will save you 10 to hundreds of times in debugging and maintenance, especially if it's a pipeline that runs frequently. Tests also serve as working documentation because they represent the reality of your code.
When you write tests, don't just test for the happy path. Test for at least one happy path and multiple failure paths that will show the next person, six months from now, what not to do with a function. For advanced use cases, if you have multiple tests, run them in parallel. They are usually independent.
Finally, automate repetitive tasks. If you find yourself doing the same thing over and over again, make sure that you automate it because you are prone to make mistakes and you could save a lot of time. Two of my favorite ways are makefiles and async scripts.
Makefiles are essentially a way where you can alias long commands with simple commands and you can chain those commands. It's a way of aliasing complex commands and it makes things easy to run.
Async coding: what I generally do is if I am running a pipeline test and it takes like five minutes, I'm not going to stay on the terminal and see what runs and what doesn't. I just run it in the background and do my own thing. It will notify me when it's done. You can think of this as a pipeline as well. When you deploy a pipeline and it builds and takes a long time, you just run it in the background and you will be notified when it's done.
Finally, a lot of the problems, not the tech part, but the approach and design part have been solved. I recommend you read these three books: Data Warehouse Toolkit, Designing Data and Applications, Fluent Python. The concepts in these books still hold. The tech has gotten so much better with cheap storage and table formats, but when you adopt these underlying foundational concepts, it will make your pipelines more resilient and you won't have to deal with so many issues when you're on call.
That was it. Do we have time for questions?
Adam - 00:11:53
First of all, this was excellent. Thank you very much. You can see in the chat, the chat is bustling with activity and excitement. Joseph, thank you very much for that. We don't really have time, but lots of people want the slides and people have been sharing the podcast that you recently appeared on.
I'll just ask maybe one last thing here. Miguel is asking, what would you say is the optimal balance between data completeness and data availability? How do you mitigate sacrifices made when wanting to serve data as real time as possible?
Joseph Machado - 00:12:26
Again, I'm going to say the most computer science answer possible: it depends. But I'll give you a specific example. If you're dealing with revenue or finance stuff, you want data completeness. However, if you're dealing with directional numbers, so think of charts that just go like this, you are typically fine with sacrificing completeness.
So it's a balance. The key thing is it's good that you're asking the question because that needs to be brought up with the stakeholder, and you guys need to make that decision together.
Adam - 00:12:59
Joseph, thank you very much for joining us. Everybody wants the slides. We will send them and package them into the videos once they're all released. For people tuning in, I will just say, if I had had this slideshow or listened to your presentation 10 years ago, it would've saved me probably years of frustration and agony.
Joseph Machado - 00:13:22
And that the hard way.
Adam - 00:13:24
Oh my goodness. So many of these things. First of all, it is a brilliant distillation of wisdom that feels hard won. I hear the pain that is underlying all of this. I share some of it, and for anybody who could use these insights to avoid some of this pain, Joseph, thank you very much for helping them.
Joseph Machado - 00:13:52
Thank you for having me. Have a good rest of your day.