Scaling Multi-modal Data using Ray Data

calendar icon
May 21, 2025
Speaker
Jaikumar Ganesh
Head of Engineering

In the coming years, use of unstructured and multi-modal data for AI workloads will grow exponentially. This talk will focus on how Ray Data effectively scales data processing for these modalities across heterogeneous architectures and is positioned to become a key component of future AI platforms

Transcript

AI-generated, accuracy is not 100% guaranteed.

Speaker 1   00:00:07    
I am about to bring out our next speaker, Ja Kumar, who is over at any scale. Let's get him to the stage. Hey, there he is. How you doing Ja?  

Speaker 2    00:00:20    
Doing great, Ture.  

Speaker 1   00:00:21    
I'm good man. I'm excited for your talk. I'm a huge fan of Ray, so it goes without saying that. I'm looking forward to everything you're about to say. Now I'm bringing your slides on to the stage and we've got some multi-modal data with Ray.  

Speaker 2    00:00:39    
Awesome. Thanks for the great intro, Ms. Jay Kumar, I lead engineering at any scale and today I'm going to talk about scaling multi-modal data with Ray data. As you all probably know, multi-modal data is everywhere. Started with healthcare applications, you know, analyzing x-rays with sensor data and geospatial data. Multi-modal data has reached those areas too. Of course, all the industrial automations with IOT and sensors and these days with robots and self-driving car, just the amount and quantity of multi-modal data has gone into tons of petabytes. And on the market size, it's also growing exponentially. Um, uh, image from <inaudible>, it says that, you know, market cap from 1.7 billion to 68 billion in the next 10 years. I actually asked, what do you think about this? And Chad, GPD was like, sarcastically replied, ha, now I get a  

Speaker 2   00:01:43    
Chance to fail at three. Just one modality. Um, alright, keeping jokes aside, so the question for us technologists is what's the best data processing framework and architecture for multi-modal data? Let's look at the existing data processing systems. They made some core assumptions around 10 years back. That data is mostly structured in tabular bottlenecks are CPU and memory operators themselves are cheap and lightweight. Java and, and Scala other hand, AI data processing systems have a different set of code assumptions. Data is now mixed mo modality includes tensors. Bottlenecks are not just use in memory, but also GPUs. Operators are heavy and dis distributed as they need to, uh, take care of data across multiple machines. And the AI ecosystem is built in Python.  

Speaker 2   00:02:40    
So key is that Ray data is a data processing engine. So let's try to understand why. First, there is <inaudible> data type. Natively built is based on MPA and pandas. We have readers for images, video, audio, and more robust support for accelerators. GPU based scheduling support for TPUs, we have optimized GPU in feed. We can handle C-P-U-G-P-U architectures. The, because it's built on ray core, the popular compute framework, the operators are stateful. We can do things like streaming execution to reduce memory pressure. I'll get into it more in more detail. And there's ecosystem interoperability. We work with all the major data sources. Hoodie delta, iceberg parquet, patients with PY sensor flow and all the LLM serving engines. VLF for batch influences. Let's take a quick sneak peek into the APIs. You have the read and write APIs for your standard formats, connections with various, uh, um, data repositories.  

Speaker 2    00:03:48    
And then you can do transform operations on them, map, map patches, et cetera. And then you have the standard consumption APIs to take the data coming outta the transform, you have the aggregation APIs grouped by maximum standard, et cetera. Uh, all the standard ones. And the biggest delta is you have compute APIs. You can determine number of CPUs, fractional GPUs, how you want to specify the con concurrency. This is where the power of array actually comes in. So let's take a look at an example computer vision application. Here you are reading images from an S3 bucket. Then you have map patches running across multiple machines for resizing the images. You have two models, a segmentation model and then the classification model each with a different set of GPUs and different set of concurrencies. Then you write images, big cluster. This is how it looks like it, the <inaudible> image is CPUB orientation model and cation model run on GPUs and each of them can be independently scaled with just specifying the concurrency and the right results is CPU bond.  

Speaker 2    00:04:47    
The other aspect is that the GPU tasks don't have to sequentially wait till the CPU tasks are finished. So if a chunk of data is ready to be processed on GPUs, it can be streamed to the GPUs, thus increasing throughput and efficiency. Alright, let's peel onions. So let's go get into the architecture, right. So we have got a simple rate array data. API here read images. You do a map on the transform, you do an inference. So first we define the logical plan and what to do. The read and transform are on the CPUs. The inference is on the G CPUs and the right stages of the CPU. And then the physical plan defines how to do it. It can f the read and transform operators similar, uh, because they both run in the CPU operator, uh, that runs on the GPU. And the another map operator that writes back the images. One of the key concepts is streaming execution. So basic data processing unit is a data block, red data block. And these operators process data blocks and put it in a output queue because it is built on Ray and its object sources to these data blocks and object stores so that between stages they don't need to be copied. That aspect is that this map operator, the inference map operator, like just like I mentioned before, doesn't have to sequentially wait for all of the load and transform operator to finish. Mm-hmm <affirmative>.  

Speaker 2    00:06:09    
We have also added some enhancements, you know, because, uh, on any scale we take the ray data open source package and we give the, uh, infrastructure, um, and the VMs and the machines on it. So you can detect low GPU utilization, you can detect optimized configurations and make some, um, suggestions around it. The other thing that we noticed was that many times during data processing you have the ING part and the data processing part, and these were serialized. We can actually, uh, stream the, uh, 4.5 x reduction in just job startup thing. Big aspect in AI system at spot nodes are used and they can get preempted at any point of time. So your system must be handled massive. Node preemptions, there's a failure of a scheduler. This is again built in. It, uh, provides options and we have job check pointing built in so that we are able to efficiently receive here. No two fails and this job has resumed. Um, in no three  

Speaker 2   00:07:18    
I thing that we had was this transition from range based shuffle to hash based shuffle. So in a range based shuffle, what happens, we block to determine approximate reach boundaries. Each input data block is split into end partitions, individual partitions are shuffled and there, uh, combined with other partitions at the same range. So if there are n is the number of incoming blocks, M is um, the large target number of resulting partitions. You have a order of N times m many times, say for example it's thousand blocks as n ms, um, thousand partitions, you have 1 million range partitions. And that puts a lot of pressure on the node. Instead in a hash page partition method, each arriving block is hash partitioned into key value. The shards are then sent to the correspond actor. Actors can then combine these shards into a new partitions. Uh, and then they can also, uh, execute by et cetera kind of stuff. So this design allows us to start shuffling part of the operation immediately as the first input block arrives instead of differing it until the whole dataset is uh, materialized. Um, check out our, uh, ray data blog. We actually wrote, wrote a detailed blog about this and we just introduced joins yesterday, which is based on the hash based suffer.  

Speaker 2    00:08:35    
So to conclude on the architecture advantages, streaming execution, handling back pressure allows us to scale to petabytes of data because it's built on Ray core automatic task, retrain upon failures. We have a scheduler which handles C-P-U-G-P-U, heterogeneous, uh, architecture. We get high GPU utilization and the object store, that's again built in zero copy and efficient, uh, cross node data transfer. Let's take an example of a stable defusion. Right here's, uh, as you probably all are familiar with stable edition, very well, you have to pre-compute image lat and text em embeds for greater than a billion uh, samples. This is what it looks like. Um, with Red Data, again, you'll read the Parquet files, you will do a map on transforming images on the CPU, you do map batches. This is the encoder. You specify the bad size, number of CPUs, the concurrency, and then you write the parquet back. If someone wants to try this out, this entire uh, example is, uh, available and as a template you just have to, uh, get run it for free. And we were able to show that, you know, we are able to reduce, uh, the standard cost of stable diffusion by 3.5 x using Ray data.  

Speaker 2   00:09:50    
Many companies have used Ray data. We have Amazon, which was able to see, uh, reduced latency, improve cost efficiency. ROBLOX uses it. Pinterest has seen that, you know, they have Ray data on their G based models. So to conclude, the two points I want to make sure, um, that people take away from this talk is multi-modal data is growing exponentially. Ray data is uniquely positioned to process and scale multi-modal data. With that, thank you so much. If you have any questions, feel free to send an email to my email ID and more details are available on our Redox website. Ray Data has, uh, examples, lots of details, and if you want to try out sample code, uh, so that you don't have to set up your clusters, just go on to any scale.com and you'll get all the goodness for free there. Um, yeah, that's basically what I got. Thank you so much.