Apache Gluten: Revolutionizing Big Data Processing Efficiency
Apache Gluten™ (incubating) is an emerging open-source project in the Apache software ecosystem. It's designed to enhance the performance and scalability of data processing frameworks such as Apache Spark™. By leveraging cutting-edge technologies such as vectorized execution, columnar data formats, and advanced memory management techniques, Apache Gluten aims to deliver significant improvements in data processing speed and efficiency.
The primary goal of Apache Gluten is to address the ever-growing demand for real-time data analytics and large-scale data processing. It achieves this by optimizing the execution of complex data processing tasks and reducing the overall resource consumption. As a result, organizations can process massive datasets more quickly and cost-effectively, enabling them to gain valuable insights and make data-driven decisions faster than ever before.
Transcript
AI-generated, accuracy is not 100% guaranteed.
Speaker 0 00:00:00
<silence>
Speaker 1 00:00:06
Next up, we are going to be bringing on Bin way. Where are you at man? Hey, there he is. How you doing?
Speaker 2 00:00:17
Good,
Speaker 1 00:00:19
I'm good. I'm ready for your talk. Now the floor is yours. Let's hear what you got about Gluten.
Speaker 2 00:00:26
Yeah. Hello everyone and welcome to join the session and uh, today I will share the IC Gluten project. So today I will focus on the lakehouse support in the partic Gluten. And I'm ang, I'm from IBM, and uh, we have hub. So in same, of course, it is, it's a opposite, opposite project. And also we have a LinkedIn and also we have a Slack SF on S workspace. And also we have the workspace which can, we can workspace. So today I will, uh, cover several topics. Firstly, it is a simple introduction and then some performance data of the Gluten and the latest performance data. And then I will focus on the fast support. And finally we'll have a q and A. So firstly, the Gluten itself, Gluten is plugin. It's about the Spark plugin. So it's on top of the spark. It's a spark plugin.
Speaker 2 00:01:22
So what campaign is that? Uh, we use the spark control log and much as possible. And uh, we just offload the, the computer computer intensive work operators to the native library. So you, if you have a spark workload, you can driver node, you have a work node. In each work node you have several regulators, and in each regulator you can one or more tasks, right? And in each task, right? What happen is then which type, if this operator is supported by the native library or not. If it's already supported by native library, then we, for bank, we offload it to the native library through a GI. If it's not supported, then we for bank to spark <inaudible>. So in this way we can join the operators or the queries to the native library, native library if it's supported otherwise, which is for bank to <inaudible>.
Speaker 2 00:02:17
So the Gluten itself doesn't need any extra hardware or the, the independence inter dependency libraries because then we already packed all the dependency library in third party and we can distribute the third party from the Gluten, from Spark itself. Also, we have a way to build the Gluten library in static way. So in that way, all dependency libraries are already linked into the Gluten develop lip Gluten. So, and in that way we, you need to install any third party libraries on your workload. So the, the, the work, the dependency library is very minimal if we use <inaudible> in that way. You can use a single gelato distributor Gluten twist workload and it works. So you need to install any, any libraries on the work node. So it's also, of course it's transparent to the spark workload. So you need to modify any spark circles. You just need to confie uh, Gluten in the spark, the Gluten.
Speaker 2 00:03:19
So in currently we ilu, we have two backends actually. One is Vox. We work with Facebook on this Facebook ux and another one is greenhouse backhouse Bankings is your, is developed by s so another company. And also we use computer engine as well in, in, in Gluten in to compute, uh, to, to, to work with together. Also, we can easily support the third party engines and this is the current performance boost. Performance boost. And on in QP H we can boost three, three times the performance better. And on TPCS like workload, we can have two 3.35 times and uh, the data site is of 500 and the instance is four excellent six ID to large instance. So all the doc files and the computation analysis script is already upstream. So everyone can try. So you can try on single node SF 100, orf 504, no. So you can get a ratio easy.
Speaker 2 00:04:31
And here, uh, one highlight is that is different position number is configured between the ous spark and the Gluten. So until we, we can make sure that there's no spill on in Vanya Spark under the Gluten both it's because under the Gluten use much less memory we, if we draw query, so in Ally the Vanya Spark consider more memory. We need a configures, uh, larger position number including we can configure small position number. And this is the current, uh, Gluten roadmap in 2025. I wanted to highlight is that we will focus on the Lakehouse support mostly first in Q1, we already passed the Asberg ut, so we put Asberg spark UT to Gluten and we passed all Asberg uts. But uh, not all asberg features are supported in Gluten today. So some of, some of the Asberg features, Asberg features are for bank to to spark.
Speaker 2 00:05:29
So if your jam asberg and Spark asberg and uh, but we, we correctly for bank the rules and also we offloaded the rule we already supported. Now we already supported basically party writer, uh, basically Asberg reader, but we are working on the Asberg writer now. And uh, we will under Asberg, right supporting in Q3 is planning in Q3. So offload the right to Asberg to to ox and meanwhile we'll also under the hoodie spark support and uh, the data lake Spark. Currently we already supported hoodie under Data lake, but uh, it's the UT is not passed. We didn't port the UT to asberg. So we don't know how many, how many cases if, if hoodie can support or, or correctly fall backed or there is some issues. So the UT is not, it's just basic Spark basic reader is supported in Gluten for hoodie and <inaudible>.
Speaker 2 00:06:29
And in Q3 we will put uts. So once we pass the UT partial ut we can sure that the, it's somehow already stable enough where the customer can try. And of course currently the customer, there are lot of gluing customers. They already tried the berg use the berg today and just for their workload. So in their scenario they can pass so they can use, use the land, but overall the ut UT is not passed yet in Q3 will pass that. And also we only support who, right, who read and the that data read now. And uh, in Q4 we will under WHO data, right? So we will use a similar way over asberg to to under the data lake right spot. So there couple in this year we worked, uh, we will put more effort on the data lake spot. It's called the customer really, um, moving from help to, to, uh, Asberg and Data Lake.
Speaker 2 00:07:28
So we need to underline support and for the other part we also will work on some GPO support. Now, so the n video guys work with develops and the who DF support in inves and with your, there are accelerators, inves and offloaded to GP including, so there are some other part we can know under asberg supporting Gluten today. So what happened is that we are, we already support, we support Asberg reader and Asberg writer, but it is supported in different way in for Asberg reader is part of the develop pipeline. Today we implemented all as bug reader in Inves and uh, in the c plus class inves. And so we can usually pass and pass the whole corner plant, including the table scan and the all the arrow operators in the same sub plant to to s single S can, can start the whole pipeline from scan to to the other operators.
Speaker 2 00:08:35
And for the Asberg writer, it's implement in different way because then we, they use the Asberg Spark writer JAMMA code and I must possible with just offload the part from, from Asperger writer to, so if it's the, the whole packet itself is offloaded to two valves, but the operator itself, it's in Smartt VM, so we write it's, it's not the exactly the same <inaudible> support the table right support in as we write, but we use lots of Asberg Java features here and including <inaudible> features. And when the data is read to <inaudible>, what happened is that we use the, we gather the common data pointer from VEX after the last operators. Like it's, if it's a drawing or aggregated, we get data from Vex, it's a data pointer, it's not a data itself. And we send the data pointer to Spark and the spark started next operator started as operator is customized operator and then the operator take the data pointer and pass it to the calculator. And then the calculator will use the data, then the regulatory files and we return the metadata to Asberg rat operator and, and then the Asberg rat operator will handle arrest like the will handle everything else. So there is noted copy there, noted movement between off Heap and he, uh, between between the native and the GVM. So it's just the point, the point past through pac. So this is how we support Asberg reader and the writer today.
Speaker 2 00:10:21
But uh, we have the recruit itself is designed very flexible. So what, as long as the native engine can native library can support the <inaudible> can use annual data format and can support the sub plan, we can easily off upload to different labor just like so in the greenhouse or the op. So here we have, we didn't implement the put right tool in Iceberg. So what we are planning now is that we use <inaudible>, <inaudible> and uh, because of <inaudible>, if currently <inaudible> doesn't support right yet, but if it's supported then we can use the <inaudible> directly and we can offload the put right from Gluten tool directly. What happen is that we get similarly, we get the export, we get the <inaudible>, get the data pointer and then, and then from Vex and then we will need the, uh, conversion. So ve and format are compatible mostly, mostly are compatible there, there's no data data copy there, it's just a RA from uh, ox to arrow.
Speaker 2 00:11:32
And then we, once we converter data to converter data ramper from and from VE to Arrow, then we send the arrow decide to, to put last and then, and then complete right here. So we can, we can implement this way. So the implement everything in, we can use the <inaudible>, we can re use the data or, or anything. So, and, and we can re use and we can get, and we get a better performance than than Java version. Then we can get, we can reuse that. So it's the flexibility of the, the Gluten and the, the Gluten and furtherly. We can also put down the, put down the filter or has aggregator to storage. So it's another flexibility of the Gluten. So currently if we already have the substitute plan and we have some filter and the filter can filter most of the data, then, then we can have a way to put down the, the filter and the Gluten to, and the filter and the table scan to the storage layer.
Speaker 2 00:12:41
So here we can use in storage layer, we can use annual annual library or V library or even currently we work with Google data product tool don't printed directly. So in Zen way we can filter the data under storage layer and you under storage node and then return data under. And then we gather the data under work node, computer computation node under, then we can order a data to vex and feed into the VEX pipeline. Again, there's no ma data copy, there's no maam copy there. So it's just a example from tool vex format and then fitting into the VEX pipeline to, to take the rest of the operator to take the rest of, yeah, to run the rest of operators. So this is hard Gluten we, we, we can support now and uh, the big Bitcoin already passed. So the big they already use, but it's big Bitcoin didn't use the sub plan.
Speaker 2 00:13:37
Currently they still use the part of the, the circle and the converter, uh, pick cut, cut part of the circle and think that <inaudible> and gather data. And then we return the data to, to spark pipeline and uh, then we gather ferment, gather data pointer and then fitting into arrow pointer to, so the whole pipeline already passed in. Similarly, we can even, we can use asberg or we can use hoodie or we can in future, we can use any anything. So just open the API from ASBERG then put down it to, to Subrate to to the storage lab and complete the filter lab. So this is the, even the filter, the filter design and uh, when we design, when we work on the Gluten and <inaudible>, so we see lots of opportunities. So it's generally the metadata requirement from the Gluten and to, to boost the performance. For example, the first one is the partition number. The part number is essential to the Gluten because then the, for smaller partion, the Gluten run very faster. So if the partition, if it partion is very small, then the back hide, the scan, the scheduling and our all our spark will hide will be much more than the native library.
Speaker 2 00:15:00
So in that way you cannot see the performance boost. You can just see that okay, everything, everything all the time are spending on your spark. So the partition number is we need a large partition size for each data and also because the Gluten can use much less memories than <inaudible>, that way we can uh, we can input even larger partitions partition size and then we can get much better performance. And the other example is that the hot columns, so currently the Gluten vs already supported the catch in local SSD, so it can catch the root group in local SSD. But what we don't know is that waste column are hot columns. So we should catch, currently we can <inaudible> vs of course, but I think that it's makes most <inaudible> we can get list data. We, if we run Ries, we we, we can list data and input the save in the middle, the middle and save it in the this house like as spoke.
Speaker 2 00:16:01
And then we can get catalyst, next query. We can get this information from <inaudible> and ity cash. And another one is characteristic of the column column data. For example, currently we support dictionary software. So it's a feature in column of software because, and in, in this we can suffer every column one by one and then we can create a dictionary for each column. And uh, if then we have a issue is that if this column is dictionary friendly or not, we don know that currently we use a a weight and we we we check the first record punch and see if we first recogni is dictionary friendly or not. Then we can, we use it, but we can, if we can get source information from the, from iceberg or from who, then that would be very useful for us to, to input it and then to to use dictionary for this column.
Speaker 2 00:16:56
So it may be derived from some municipal dictionaries. So we, we need to record that. And uh, the other one IST in, in the native library, we need to pre locate the memory for the column data on lots of times. So in that way we need to, to read relocated the memory. It means mostly means uh, mam copy. Yeah. So if we usually locate the memory a lot, then the performance is the, the time is spent in the MAM copy mostly. So if we know the average lengths of this string in advance, then we can use this information to pre locate memory in some operators like the, like the grid, we can preload this also in that way we can, we can much better performance. So currently it we, we use similar way that we pass the first recommendation and get the <inaudible> length string and uh, then we use <inaudible> number to pre the memory, but of course it's not.
Speaker 2 00:18:03
And then the, the example is that the root group will split the <inaudible> itself support background. I and the root group little had split and, but now there is a issue and we don't know be before we run, before we pass the party file. We don't know if the root group, uh, if this split and one module group we don't rent information. And also we don't know if we have a, we can know there are several split, well more split, but we don't know if we need to read hide over this next split. So in this information, if we can get this information in advance, then we can input in the ve and catalog must. So currently we have to rely on the customer to configure, configure with height of configure of speed. So it's customer respons now, but if we can get just information from Ministar, we can get, so if this is just an example, so if we don't, uh, if we use Asberg <inaudible> and we use the customer workload, we will know learn more. So some of this, like the local cancer, it's already support in Asberg, but we need some way to get information and feeding into develop libraries. So this is the, the Gluten lakehouse. Okay, that's all. So this, again, there's with some thing here and we some <inaudible> channel here and we some group here. So welcome to join and welcome to to try the Gluten and uh, that's the issues GitHub or you know, on Slack channel. Okay, any questions that we can answer now?
Speaker 1 00:19:48
Oh, there are questions. I'll let some trickle in. I am going to kick off one because this platform is very robust I think, uh, is the best way that I can explain it. I am assuming the journey from zero to hero here took quite a bit of time. What are some opinions that you took along the way that you are now thinking about updating?
Speaker 2 00:20:21
Mm, so we have a lots of, uh, we have lots of features here and uh, we, we need to implement and uh, I think that is a, so if a customer support, so if if we, if more and more customer use Gluten, then they will fix, they will raise issues, seal box and then they ping us and then we, we fix them. Initially we fixed most of the room and the taking time with the customers themselves, they um, they become more and more familiar with the Gluten project and then they can fix the bug by themselves and then they can make Gluten more robust, more robust. So I think that is, so it's, it's a general around the hard gluing community glue actually it's not our gluing project itself. It's a hard community glue. So if the customer feel, if more pack are fixed then it's small, robust. Yeah. So I really appreciate and the customers in lot there are currently are most of companies who have Spark. They already tried or already landed the Gluten and already euro Gluten in the platform today. And then they can, they fix lots of for us.
Speaker 1 00:21:43
Excellent. Next up we are Pedro's asking how hard is it to migrate existing Spark workloads to Gluten? Is it plug and play or is there some migration effort?
Speaker 2 00:21:58
This is or plug and play. So you can, you just need to configure the spark jar and into the Spark plugin and the starter and config. There are several essential configs. It's documented in the Gluten and GitHub link and once you configure entered, then you can get a Gluten wrong.
Speaker 1 00:22:17
Well that sounds nice. All right, I like it. This has been incredible. I want to thank you very much for giving this talk and sharing all about Gluten. I'm sure there is a joke in there somewhere about how some people are Gluten free, <laugh> or Celiacs. You probably make that joke quite a bit. Yes,
Speaker 2 00:22:41
Yes. Speaker 1 00:22:43 Uh, but I appreciate this Ben Run.