Cross paradigm compute engine for AI/ML data
AI/ML systems require realtime information derived from many data sources. This context is needed to create prompts and features. Most successful AI models require rich context from a vast number of sources.
To power this, engineers need to manually split their logic and place it in various data processing “paradigms” - stream processing, batch processing, embedding generation and inference services.
Today practitioners need to spend tremendous effort to stitch together disparate technologies to power for *each* piece of context.
While at Airbnb, we created a system to automate the data and systems engineering required to power AI models both for training / fine-tuning and for online inference.
It is deployed in critical ML pathways and actively developed by Stripe, Uber, OpenAI and Roku (in addition to Airbnb).
In this talk I will go over use cases, the Chronon project overview, and future directions.
Transcript
AI-generated, accuracy is not 100% guaranteed.
Speaker 0 00:00:00
<silence>
Speaker 1 00:00:06
The next talk for us is, uh, Nikhil. Nikhil, are you around? Can you hear us? Are you,
Speaker 2 00:00:13
Hey Adam, I can hear you. Can you guys hear
Speaker 1 00:00:15
How you doing? And where are you dialing in from?
Speaker 2 00:00:19
I'm dialing in from East Bay. Um, Fremont.
Speaker 1 00:00:22
From Fremont. Nice. We'll be there in just a few days for another big conference coming up. So Nikhil, I'm gonna clear up the stage and I'm gonna step down. I'll be back in 10 minutes. Nikhil floor is yours.
Speaker 2 00:00:37
Hello everyone. Thanks for taking the time today. I'm going to talk to you about K Chronon. K Chronon is a data platform we built for ML and AI use cases while at AirBnB, also jointly with Stripe and we open sourced it like last year. So a little bit about me. Uh, I'm Nick Hill Hil. I was working on ML intra stuff at AirBnB for a long time. Before that, I was doing stream processing, deleted stuff at Facebook. We built our own stream processing engines while there. And before that I was doing ML intra, again, back at Amazon and Walmart labs. So currently we have founded this company called Zipline AI to help people use Chronon. So what is Chronon? So Chronon is something that, um, serves two main purposes. It turns raw data into training data and it also helps serve features. And because there is a lot of com commonality in how people compute metrics, it also gets used accidentally for generating offline and online metrics, right?
Speaker 2 00:01:41
So we have, um, early adoption from a few companies. So contributors, evaluators, and adopters from a bunch of companies. It's been only a year now, so it's like really early adoption from these companies. And we work with few of them to get their deployments into production. So what do people use Chronon for? So the main one is predictive machine learning use cases. So search indexing and ranking ads, ranking feed personalization, fraud and abuse prevention, both monetary and like hate speech and that sort of stuff. And also to personalize marketing material and like to do pricing. So there's a wide variety of use cases. Um, this is what it's traditionally been used for, but in the last two, three years, we have seen a growing number of use cases that are like more, uh, using LMS and generative ai. So customer support, which used to be traditionally predictive ML, but now more generative AI oriented.
Speaker 2 00:02:46
Um, it's also used for creating virtual assistants both for shopping, travel, et cetera. It's used for rule engine. So sometimes you don't have enough time to retrain a model when new fraud patterns occur. So in that case, you want people to come up with heuristics for a day or two. These heuristics run and filter traffic out, and that's where rule engines come into play. You give your heuristic to this rule engine and it filters traffic out. And it's also used in user facing metrics, like I mentioned. So, um, high traffic landing pages, essentially like listing ratings or item ratings. You can imagine the amount of traffic these pages see is quite high and you want these ratings to be updated in real time and just regular offline business metrics like customer 360, listings 360.
Speaker 2 00:03:37
So why do people use Chronon? So the crux of, uh, Chronon engine is this thing called incrementalization. So to give you a vague intuition of, uh, what incrementalization is like, let's say you want to compute average rating of a listing, um, in the last 90 days. You're going to compute it today and let's say you want to compute it again tomorrow. You have about 89 days worth of overlap. So we are going to aggregate 90 days today and another set of 90 days tomorrow, and there is 89 days that overlap today. So if you can, uh, effectively incrementalize this computation, you're going to save about 45 x of computation or like 90 x in like extreme cases. So that's the crux of it. And that like makes the Chronon engine scale and generate like, uh, training data and serve features at large scale. And it's unified, meaning you write the feature definition once and it creates both training data from that definition and creates online serving endpoint.
Speaker 2 00:04:44
And it makes sure that the data that's being served is fresh. So if you have, um, event streams, it incorporates data from the event stream into the endpoints and it's very pluggable. So you can plug in your event streams, either Kafka pub sub or you can plug in your warehouse, which could be, um, on iceberg traditional hive hoodie. Um, it could be on um, Google BigQuery and uh, we'll be able to like pull data from that, transform that into training data and features for serving both. So under the hood we are like built on a bunch of technologies, some open source, some not. Um, we connect to these things and we pull data from, uh, these things and we have Chronon basically implements its incrementalization algorithms over two engines. One for bash processing, which is Apache Spark, another one for stream processing, which is Flink and it stores online data and indexes them into, um, right now it's an open connector so anyone can implement any connector to any database, but we have implemented connectors for these three. All right, so
Speaker 2 00:06:02
How, how is it related to Zipline? Zipline is just like a bring your own cloud offering of the Chronon platform. Um, we have things that make it production grade stuff like built-in ML observability and governance. I'm going to talk a bit more about these today and other things like experiment isolation. So in ML you want to have a very fast experimentation loop when you come up with new features and you want to isolate these experiments without affecting production. And the other thing is compute sharing. So a lot of the times these experiments do need to share compute. So we want isolation but also want to share as much as uh, we can. So I'll talk a bit more about this later as well. And we, we are integrated with, um, some of the ML platforms that exist out there like Vertex AI and SageMaker, and we do native embedding generation. That's like the delta between Chronon and Zipline.
Speaker 2 00:06:57
So about governance. Um, the first thing is compliance. The main questions that get asked here is like, which models depend on a given and still column? And then there is another aspect of governance, which is lifecycle management, which is what models depend on a given column that I'm about to deprecate or I'm about to, you know, change. So if you see these questions are like awfully similar and like what is needed to really answer these questions is a global column lineage graph. So which basically says this column's data flows across all of these pipelines and ends up in a model which goes from the source of the data all the way into the endpoint that is serving the model, regardless of how many stages there are in between. Um, one thing that makes Chronon really effective of, uh, doing this is that it's declarative. So there is no black box python code or Java code that we allow. It's more data frame like in every stage. And because it's data frame, like we can extract the lineage statically, so users don't have to tell us that this is the lineage. We can pull it out from the feature definitions that you already have written.
Speaker 2 00:08:11
So this is like roughly what gets pulled out. So we have tables and columns in between and we know how columns are connected and what filters get applied, et cetera. And like we can create this graph globally across every, um, source of data and across every fee, every model. So the other one is observability. Um, so ML observability and AI observability is pretty tricky. It's a large area, I would say much larger than the core of building features and, um, prompts itself. So there are many, um, metrics here. One is prediction drift, which basically measures how have predictions drifted, uh, relative to past. Similarly, there's label drift and feature drift. So the point of observing these things is that we know if there is something wrong going on with the model, and then there is feature important stuff which, um, determines what's the influence of a given feature on a decision. So if you want to analyze a decision why this model predicted a certain way, this is very useful. So that when someone calls and says, Hey, I'm banned, why, why, why is that? So people can look at the model <inaudible> and say, okay, this is probably why. And the other one is model DK as real world, uh,
Speaker 2 00:09:28
Preferences, changes of users, labels and predictions no longer like, uh, match. And this is what we call model dk. And then there is consistency, which means the difference between training data and serving data. So we need to be measuring all of this to know if, and these are all orthogonal. So me, we need to measure all of this independently to know if a model, uh, if an ML system is healthy or not. So the difficulty here is that there is thousands of features per model and each feature has a lot of data quality metrics around just one feature. And if you multiply, that's a lot of data quality metrics. And this is usually harder than building the feature pipelines themselves. So most often these things don't get built. So another, uh, advantage that Chronon has here is that it's declarative, which means we can derive what needs to be measured and auto compute these metrics and we can automate the observability of the all of these pipelines. This is roughly what it looks like. You just like write your feature definitions and we generate this drift metrics and all these other metrics that I showed you before,
Speaker 2 00:10:37
Right? And another cool advantage is that we can put lineage and drift side by side and say, Hey, this, um, feature is drifting, so let's see what other pipeline is producing this. And then go upstream to see where the drift is originating from and go talk to the owner of that data. Alright? And the other area that I'll quickly cover is experimentation. So we have, um, safety and isolation as a requirement. So we don't want to impact any production data and we want to use all existing data. So what this means is we should read, but we cannot override production. So let's say this is a production, um, set of pipelines and I want to make a change to one of the nodes, uh, denoted with CI want to make a change and make a copy of everything downstream and keep the purple ones untouched and at the same time reuse dependencies from the production. So this is what needs to happen and this is a very complex orchestration problem and this is something that the Zipline engine solves for.
Speaker 2 00:11:43
So recap, Chronon is, uh, good for like generating ML systems, especially generating training data and feature serving that applications can consume and metrics and Zipline on top of that adds observability, governance and experimentation. And things that I didn't talk about are like embeddings and model integration. So if you want, if you want to learn more, like please join the Slack channel so you can visit Chronon dot ai, join the Slack channel and start using the open source project. If you're interested in a managed offering, you can reach out to us at Hello Zipline ai. Thank you for your time.
Speaker 1 00:12:16
Awesome Hil, that was excellent. It felt like it went by really quickly, that reframing slide was sick. I mean, I feel like we could just spend an hour just, you know, zooming into how we, you guys are actually pulling this off. Does not sound trivial, so.