Apache Gluten: Revolutionizing Big Data Processing Efficiency

May 21, 2025

Speaker

Binwei Yang

Principal Software Engineer

IBM

Apache Gluten™ (incubating) is an emerging open-source project in the Apache software ecosystem. It's designed to enhance the performance and scalability of data processing frameworks such as Apache Spark™. By leveraging cutting-edge technologies such as vectorized execution, columnar data formats, and advanced memory management techniques, Apache Gluten aims to deliver significant improvements in data processing speed and efficiency.

‍

The primary goal of Apache Gluten is to address the ever-growing demand for real-time data analytics and large-scale data processing. It achieves this by optimizing the execution of complex data processing tasks and reducing the overall resource consumption. As a result, organizations can process massive datasets more quickly and cost-effectively, enabling them to gain valuable insights and make data-driven decisions faster than ever before.

Transcript

AI-generated, accuracy is not 100% guaranteed.

Demetrios - 00:00:06

Next up, we are going to be bringing on Binwei. Where are you at man? Hey, there he is. How you doing?

‍

Binwei Yang - 00:00:17

Good.

‍

Demetrios - 00:00:19

I'm good. I'm ready for your talk. Now the floor is yours. Let's hear what you got about Gluten.

‍

Binwei Yang - 00:00:26

Yeah. Hello everyone and welcome to join the session. Today I will share the Apache Gluten project. I will focus on the lakehouse support in Apache Gluten. I'm from IBM, and we have a hub. So in same, of course, it is an opposite project. We also have a LinkedIn and Slack SF workspace. Today I will cover several topics. Firstly, a simple introduction, then some performance data of Gluten and the latest performance data. Then I will focus on the fast support. Finally, we'll have a Q and A.

‍

Binwei Yang - 00:01:22

Gluten itself is a Spark plugin. It's on top of Spark. We use the Spark control log as much as possible. We offload the compute intensive work operators to the native library. If you have a Spark workload, you have a driver node, work nodes, and in each work node several regulators, and in each regulator one or more tasks. If the operator is supported by the native library, we offload it through a JNI. If not supported, we fallback to Spark. This way we can join operators or queries to the native library if supported, otherwise fallback to Spark.

‍

Binwei Yang - 00:02:17

Gluten doesn't need any extra hardware or dependency libraries because we pack all dependencies in third party and distribute them from Spark itself. We also have a way to build Gluten library statically, linking all dependencies into Gluten. So you don't need to install any third party libraries on your workload. The dependency library is minimal. You can use a single Gluten distributor and it works transparently to Spark workload. You just need to configure Gluten in Spark.

‍

Binwei Yang - 00:03:19

Currently, we have two backends: Vex, developed with Facebook, and Greenhouse backend developed by another company. We also use compute engine in Gluten. We can easily support third party engines. Current performance boost: in QPH we can boost performance three times, and on TPC-S like workload, 2.3 to 3.5 times. Data size is 500GB and instance is four excellent six ID2 large instance. All doc files and computation analysis scripts are upstream, so everyone can try on single node SF100 or F504.

‍

Binwei Yang - 00:04:31

One highlight is that the partition number is configured differently between OSS Spark and Gluten. We ensure no spill in vanilla Spark under Gluten because Gluten uses much less memory. In vanilla Spark, you need to configure larger partition number, but in Gluten you can configure smaller partition number.

‍

Binwei Yang - 00:05:00

The Gluten roadmap in 2025 focuses on lakehouse support. In Q1, we passed the Iceberg UT, putting Iceberg Spark UT to Gluten and passing all Iceberg UTs. Not all Iceberg features are supported in Gluten today; some fallback to Spark. We already support Iceberg reader and are working on Iceberg writer, planned for Q3. We also plan to support Hoodie Spark and Data Lake Spark. Currently, Hoodie under Data Lake is supported but UT is not passed.

‍

Binwei Yang - 00:06:29

In Q3, we will put UTs for Hoodie. Once partial UT is passed, it will be stable enough for customers to try. Many customers already use Iceberg today for their workload. We only support read now; in Q4, we will support Hoodie data write using a similar approach to Iceberg. Customers are moving from Hive to Iceberg and Data Lake, so we need to underline support.

‍

Binwei Yang - 00:07:28

We will also work on GPU support. Nvidia works with developers on GPU DF support in Inves and with accelerators offloaded to GPU. We support Iceberg reader and writer differently. Iceberg reader is part of the develop pipeline implemented in C++ Inves, passing the whole query plan including table scan and arrow operators.

‍

Binwei Yang - 00:08:35

Iceberg writer uses Iceberg Spark writer Java code, offloading part of the writer to native. The operator itself is in JVM. We use many Iceberg Java features. When data is read to native, we gather the common data pointer from Vex after the last operator and send the pointer to Spark. Spark starts the next operator, which takes the data pointer and passes it to the calculator. The calculator uses the data, returns metadata to Iceberg write operator, which handles the rest. There is no data copy or movement between off-heap and JVM, just pointer passing.

‍

Binwei Yang - 00:10:21

The design is flexible. As long as the native engine supports the annual data format and sub plan, we can offload to different libraries like Greenhouse or OP. We didn't implement put write tool in Iceberg yet. We plan to use it because currently Iceberg doesn't support write yet. If supported, we can offload put write from Gluten tool directly. We get export, data pointer from Vex, convert data from VE to Arrow format, then send Arrow data to put last and complete write.

‍

Binwei Yang - 00:11:32

We can reuse data and get better performance than Java version. Gluten is flexible. We can also push down filters or aggregators to storage. If we have a sub plan with filters that filter most data, we can push down filter and table scan to storage layer. Storage layer can use annual library, V library, or Google data product tool. We filter data under storage node, return data, gather data under compute node, order data to Vex, and feed into Vex pipeline with no data copy.

‍

Binwei Yang - 00:12:41

This is how Gluten supports push down filters now. Big Bitcoin already passed and uses Gluten but didn't use sub plan. They still use part of the circle and converter, pick cut part of the circle, gather data, and return data to Spark pipeline. We gather data pointer and fit into arrow pointer. The whole pipeline passed. Similarly, we can use Iceberg or Hoodie or anything in future by opening API from Iceberg and pushing down to storage layer.

‍

Binwei Yang - 00:13:37

When designing Gluten, we saw many opportunities. Metadata requirements to boost performance include partition number, which is essential. Smaller partition number means faster Gluten run. If partition is very small, Spark overhead is more than native library, so no performance boost. We need large partition size for each data. Gluten uses less memory, so we can input larger partition size for better performance.

‍

Binwei Yang - 00:15:00

Another example is hot columns. Gluten supports cache in local SSD for root group. We should cache hot columns. Currently, we can cache Vex but it's better to cache hot columns to get less data. We can get this info from Iceberg and cache. Another characteristic is column data. We support dictionary software for each column. We check if a column is dictionary friendly by checking first record punch. If we get source info from Iceberg or Hoodie, it would be useful to input and use dictionary for the column.

‍

Binwei Yang - 00:16:56

We need to record that. Another is pre-allocating memory for column data in native library. We often relocate memory, which means data copy. If we know average string length in advance, we can pre-allocate memory in some operators like grid for better performance. Currently, we pass first recommendation and get length string, then pre-allocate memory, but it's not perfect.

‍

Binwei Yang - 00:18:03

Root group will split the file itself support background. Root group little had split but we don't know before passing the party file if root group is split and module group info. We don't know if we need to read over next split. If we get this info in advance, we can input in Vex and catalog. Currently, customers configure height of split, but if we get info from Minstar, it would be better. Some local cancer is supported in Iceberg but we need a way to get info and feed into develop libraries. This is the Gluten lakehouse.

‍

Binwei Yang - 00:19:48

That's all. We have some channels and groups. Welcome to join and try Gluten. Issues on GitHub or Slack channel. Any questions we can answer now?

‍

Demetrios - 00:19:48

There are questions. I'll let some trickle in. I am going to kick off one because this platform is very robust I think. Is the best way that I can explain it. I am assuming the journey from zero to hero here took quite a bit of time. What are some opinions that you took along the way that you are now thinking about updating?

‍

Binwei Yang - 00:20:21

We have lots of features to implement. If more customers use Gluten, they raise issues, Slack us, and we fix them. Initially, we fixed most bugs with customers. They become familiar with Gluten and fix bugs themselves, making Gluten more robust. It's a general around the hard Gluten community, not just our project. More bugs fixed means more robust. Many companies with Spark already tried or landed Gluten and fixed lots for us.

‍

Demetrios - 00:21:43

Excellent. Next up, Pedro is asking how hard is it to migrate existing Spark workloads to Gluten? Is it plug and play or is there some migration effort?

‍

Binwei Yang - 00:21:58

It is plug and play. You just need to configure the Spark jar into the Spark plugin and starter config. Several essential configs are documented in Gluten GitHub. Once configured, you get a Gluten run.

‍

Demetrios - 00:22:17

Well that sounds nice. All right, I like it. This has been incredible. I want to thank you very much for giving this talk and sharing all about Gluten. I'm sure there is a joke in there somewhere about how some people are Gluten free, or Celiacs. You probably make that joke quite a bit.

‍

Binwei Yang - 00:22:41

Yes.

‍

Demetrios - 00:22:43

But I appreciate this Ben Run.