Scaling Multi-modal Data using Ray Data

May 21, 2025

Speaker

Jaikumar Ganesh

Head of Engineering

Anyscale

In the coming years, use of unstructured and multi-modal data for AI workloads will grow exponentially. This talk will focus on how Ray Data effectively scales data processing for these modalities across heterogeneous architectures and is positioned to become a key component of future AI platforms

Transcript

AI-generated, accuracy is not 100% guaranteed.

Demetrios - 00:00:07

I am about to bring out our next speaker, Jaikumar, who is over at AnyScale. Let's get him to the stage. Hey, there he is. How you doing, Ja?

‍

Jaikumar Ganesh - 00:00:20

Doing great, Ture.

‍

Demetrios - 00:00:21

I'm good man. I'm excited for your talk. I'm a huge fan of Ray, so it goes without saying that I'm looking forward to everything you're about to say. Now I'm bringing your slides onto the stage and we've got some multi-modal data with Ray.

‍

Jaikumar Ganesh - 00:00:39

Awesome. Thanks for the great intro, Ms. Jay Kumar. I lead engineering at AnyScale and today I'm going to talk about scaling multi-modal data with Ray data. As you all probably know, multi-modal data is everywhere. It started with healthcare applications, analyzing x-rays with sensor data and geospatial data. Multi-modal data has reached those areas too. Of course, all the industrial automations with IoT and sensors and these days with robots and self-driving cars, just the amount and quantity of multi-modal data has gone into tons of petabytes. And on the market size, it's also growing exponentially. Image from <inaudible> says that the market cap will grow from 1.7 billion to 68 billion in the next 10 years. I actually asked, what do you think about this? And ChatGPT sarcastically replied, "Ha, now I get a chance to fail at three. Just one modality."

‍

Jaikumar Ganesh - 00:01:43

Alright, keeping jokes aside, the question for us technologists is what's the best data processing framework and architecture for multi-modal data? Let's look at the existing data processing systems. They made some core assumptions around 10 years back: that data is mostly structured in tabular form, bottlenecks are CPU and memory, operators themselves are cheap and lightweight, Java and Scala on the other hand. AI data processing systems have a different set of core assumptions. Data is now mixed modality including tensors. Bottlenecks are not just CPU and memory, but also GPUs. Operators are heavy and distributed as they need to take care of data across multiple machines. And the AI ecosystem is built in Python.

‍

Jaikumar Ganesh - 00:02:40

So the key is that Ray data is a data processing engine. Let's try to understand why. First, there is native data type support built based on MPA and pandas. We have readers for images, video, audio, and more robust support for accelerators. GPU-based scheduling, support for TPUs, we have optimized GPU input feed. We can handle CPU-GPU architectures. Because it's built on Ray core, the popular compute framework, the operators are stateful. We can do things like streaming execution to reduce memory pressure. I'll get into more detail. And there's ecosystem interoperability. We work with all the major data sources: Hoodie, Delta, Iceberg, Parquet, Pandas with PyTorch, TensorFlow, and all the LLM serving engines. VLF for batch influences. Let's take a quick sneak peek into the APIs. You have the read and write APIs for your standard formats, connections with various data repositories.

‍

Jaikumar Ganesh - 00:03:48

Then you can do transform operations on them, map, map batches, etc. And then you have the standard consumption APIs to take the data coming out of the transform. You have the aggregation APIs grouped by max, min, standard, etc. All the standard ones. And the biggest delta is you have compute APIs. You can determine number of CPUs, fractional GPUs, how you want to specify the concurrency. This is where the power of Ray actually comes in.

‍

Jaikumar Ganesh - 00:04:20

So let's take a look at an example computer vision application. Here you are reading images from an S3 bucket. Then you have map batches running across multiple machines for resizing the images. You have two models, a segmentation model and then the classification model, each with a different set of GPUs and different set of concurrencies. Then you write images back to the cluster. This is how it looks like: the image read is CPU-bound, orientation model and classification model run on GPUs, and each of them can be independently scaled with just specifying the concurrency and the write results is CPU-bound.

‍

Jaikumar Ganesh - 00:04:47

The other aspect is that the GPU tasks don't have to sequentially wait till the CPU tasks are finished. So if a chunk of data is ready to be processed on GPUs, it can be streamed to the GPUs, thus increasing throughput and efficiency.

‍

Jaikumar Ganesh - 00:05:00

Alright, let's peel onions. Let's get into the architecture. So we have a simple Ray data API here: read images, do a map on the transform, do an inference. First, we define the logical plan and what to do. The read and transform are on the CPUs. The inference is on the GPUs and the write stages on the CPU. Then the physical plan defines how to do it. It can fuse the read and transform operators since they both run on the CPU, the operator that runs on the GPU, and another map operator that writes back the images.

‍

Jaikumar Ganesh - 00:05:40

One of the key concepts is streaming execution. The basic data processing unit is a data block, read data block. These operators process data blocks and put them in an output queue. Because it is built on Ray and its object stores these data blocks in object stores so that between stages they don't need to be copied. That aspect is that this map operator, the inference map operator, like I mentioned before, doesn't have to sequentially wait for all of the load and transform operators to finish.

‍

Jaikumar Ganesh - 00:06:09

We have also added some enhancements. On AnyScale, we take the Ray data open source package and we provide the infrastructure, the VMs and the machines on it. So you can detect low GPU utilization, you can detect optimized configurations and make some suggestions around it.

‍

Jaikumar Ganesh - 00:06:30

The other thing that we noticed was that many times during data processing you have the ING part and the data processing part, and these were serialized. We can actually stream the two, achieving a 4.5x reduction in job startup time. A big aspect in AI systems is that spot nodes are used and they can get preempted at any point of time. So your system must handle massive node preemptions. There's a failure of a scheduler. This is again built in. It provides options and we have job checkpointing built in so that we are able to efficiently recover when nodes fail and the job has resumed.

‍

Jaikumar Ganesh - 00:07:18

One thing that we had was this transition from range-based shuffle to hash-based shuffle. So in a range-based shuffle, what happens is we block to determine approximate range boundaries. Each input data block is split into n partitions, individual partitions are shuffled and then combined with other partitions at the same range. So if n is the number of incoming blocks, m is the large target number of resulting partitions, you have an order of n times m. For example, if it's a thousand blocks as n and a thousand partitions as m, you have 1 million range partitions. That puts a lot of pressure on the node.

‍

Jaikumar Ganesh - 00:08:00

Instead, in a hash-based partition method, each arriving block is hash partitioned into key-value shards. The shards are then sent to the corresponding actor. Actors can then combine these shards into new partitions. They can also execute by etc. kind of stuff. This design allows us to start shuffling part of the operation immediately as the first input block arrives instead of deferring it until the whole dataset is materialized.

‍

Jaikumar Ganesh - 00:08:35

Check out our Ray data blog. We actually wrote a detailed blog about this and we just introduced joins yesterday, which is based on the hash-based shuffle.

‍

Jaikumar Ganesh - 00:08:45

To conclude on the architecture advantages: streaming execution, handling back pressure allows us to scale to petabytes of data because it's built on Ray core, automatic task retrial upon failures, we have a scheduler which handles CPU-GPU heterogeneous architecture. We get high GPU utilization and the object store that's again built in zero copy and efficient cross-node data transfer.

‍

Jaikumar Ganesh - 00:09:10

Let's take an example of stable diffusion. As you probably all are familiar with stable diffusion very well, you have to pre-compute image latents and text embeddings for greater than a billion samples. This is what it looks like. With Ray data, again, you'll read the Parquet files, you will do a map on transforming images on the CPU, you do map batches. This is the encoder. You specify the batch size, number of CPUs, the concurrency, and then you write the Parquet back.

‍

Jaikumar Ganesh - 00:09:50

If someone wants to try this out, this entire example is available as a template. You just have to get it and run it for free. We were able to show that we reduced the standard cost of stable diffusion by 3.5x using Ray data.

‍

Jaikumar Ganesh - 00:10:05

Many companies have used Ray data. We have Amazon, which was able to see reduced latency and improved cost efficiency. Roblox uses it. Pinterest has seen that they have Ray data on their GPU-based models.

‍

Jaikumar Ganesh - 00:10:20

To conclude, the two points I want to make sure people take away from this talk are: multi-modal data is growing exponentially, and Ray data is uniquely positioned to process and scale multi-modal data.

‍

Jaikumar Ganesh - 00:10:35

With that, thank you so much. If you have any questions, feel free to send an email to my email ID. More details are available on our AnyScale website. Ray Data has examples, lots of details, and if you want to try out sample code so that you don't have to set up your clusters, just go on to AnyScale.com and you'll get all the goodness for free there.

‍

Jaikumar Ganesh - 00:10:55

Yeah, that's basically what I got. Thank you so much.