Open Source Query Performance - Inside the next-gen Presto C++ engine

calendar icon
May 21, 2025
Speaker
Aditi Pandit
Principal Engineer
Amit Dutta
Software Engineer

Presto (https://prestodb.io/) is a popular open source SQL query engine for high performance analytics in the Open Data Lakehouse. Originally developed at Meta, Presto has been adopted by some of the largest data-driven companies in the world including Uber, ByteDance, Alibaba and Bolt. Today it’s available to run on your own or through managed services such as IBM watsonx.data and AWS Athena.

Presto is fast, reliable and efficient at scale. The latest innovation in Presto is a state of the art C++ native query execution engine that replaces the old Java execution engine. Presto C++ is built using Velox, which is another Meta open source project for common runtime primitives across query engines. Deployments with the new Presto native engine show massive price performance improvements with fleet sizes shrinking to almost 1/3rd of their Java cluster counterparts, leading to enormous cost savings.

The Presto Native engine project began in 2020 and since then, it has matured into production use at Meta, Uber, and in IBM watsonx.data. This talk gives an in-depth look at this journey, covering:

  • Introduction to Prestissimo/Velox architecture
  • Production experiences and learnings from Meta
  • Benchmarking results from TPC-DS workloads
  • New lakehouse capabilities enabled by the native engine

Beyond the product features, we will highlight how the open source community shaped this innovation and the benefits of building technology like this openly across many companies.

Transcript

AI-generated, accuracy is not 100% guaranteed.

Speaker 0   00:00:00    
<silence>

Speaker 1   00:00:06    
Right now we're having, uh, Amit and Aditi about to join the stage. Are you guys with us? Amit Aditi? Hello? Can you hear me okay? Yeah,  

Speaker 2    00:00:17    
Yeah, I can hear you.  

Speaker 1   00:00:18    
Okay. Awesome. Guys, take it away. I'll be back in 15 minutes, and we're gonna have a bunch of questions from the audience. I'm sure we're gonna have about five minutes, uh, of time at the end to ask you guys those questions and looking forward, very excited, uh, to hear what you guys have to share with us today and take it away.  

Speaker 2    00:00:38    
Yeah, thanks Adam. So, hi everyone. Um, I am Madi Pande. I'm a, uh, engineer at, uh, AHANA, IBM now, and, um, I'm also joined by Amed dta, who is, uh, a software engineer at Meta. And we are going to talk today about, uh, the Presto c plus plus engine, uh, which we sort of built in open source and, uh, is now, uh, available at, uh, IBM, what's next, data and meta production. So, uh, in terms of this talk, uh, we'll start with a quick introduction to Presto, and I'll give some background about the Presto need engine, uh, the c plus plus engine, uh, what it is and why we are doing it. Um, just a quick, uh, introduction to what's the next data platform and, um, uh, the Presto Native Engine offering on it. And then Amma will take it over and talk about the production experience at Meta.  

Speaker 2   00:01:30    
So, uh, introduction to Presto. So Presto is a fashion reliable open source SQL query engine for data analytics and open data lakehouse. So this project started at Meta over 10 years ago and has since expanded its, uh, region like, uh, big data companies like Uber, ByteDance, and a few look at the numbers here. They're like massive. And, um, so, uh, this engine, uh, is built in open source. Um, and, um, if you see in terms of the ecosystem, it's a distributed SQL query engine that sits between, uh, the underlying, uh, it could be the sag, uh, storage on S3, um, other cloud, uh, stores, um, or like just regular databases. Uh, it's a federated query engine, so it can query many systems. And, um, at the higher levels, you have your tools, uh, Looker superset that can be used to submit queries to the press two engine.  

Speaker 2   00:02:29    
And, um, the system is massive. Um, it has a very vibrant open source community, both in terms of users and contributors, and you can see a lot of like very, uh, cool logos here. Uh, meta, Uber, IBM, bold by dance, um, Twitter and Alibaba, you name it. Um, so going to Presto native engine, what and why? So, uh, around 2020, um, the, all the teams using sort realized that, uh, the use cases that Presto was being used for, uh, in terms of like very, uh, advanced analytics machine learning, uh, on data lakehouse. Uh, more so, um, the JAMA worker was kind of reaching its limits in terms of what it would do for those. And, um, so, um, this led to the team sort of working or sort of, or kind of going to this, uh, in this direction where, uh, a c plus plus query, uh, eval was desired.  

Speaker 2   00:03:27    
And, um, the system is built with like first class vectorization runtime optimizations in build memory management. And this gives us a lot of amazing things, uh, for our current use cases. And this was sort of in line with, uh, similar initiatives. There's Food On and Databricks, uh, Intel and Meta started gluten, uh, for spark acceleration, and this sort of went in line with a lot of what was going on in the industry also at the time. So what is, uh, Presto native engine? It is a full rewrite of the Presto worker in c plus plus. So it's a drop in replacement for what was the Presto Java worker. And, uh, it leverages, uh, Velox. So Velox is another, uh, open source project started at, uh, meta. Uh, so, uh, Velox is a library of like data processing primitives and in, in memory management. So think about like hashtag joins, uh, windows.  

Speaker 2   00:04:21    
Uh, all these things that are, are, uh, provided by Velox. So to sort of just, uh, show at a very high level visually, like kinda what's going on here, um, you have the Presto cluster, um, which is shown on the left hand side. So a typical Presto cluster has a coordinator. It talks to a higher meta store for table information petition information. But the main worker are these, uh, worker engines. So you can have multiple worker nodes, um, in your cluster. And, uh, these cla uh, these worker nodes do all the, uh, reading, execution, uh, exchanges, uh, between workers to make your sort of query happen. So if you notice there is a, uh, control API, uh, and a certain data exchange, API, uh, that, um, the workers have to honor to interact with the coordinator and between workers for exchanges. Now, in a Java based system, uh, there were a lot of, uh, limitations in and inefficiencies.  

Speaker 2   00:05:18    
So the GC pauses performance, cliffs operational difficulties, hard to control performance. And so we just flip to an architecture, which looks exactly the same except that the underlying executable for, uh, the worker is, uh, flipped to, uh, seamless class executable. So that's, that's just what we do for, uh, migrating your clusters. Uh, the, uh, presti more worker talks the same, uh, control API and data exchange API as, uh, the, uh, previous system. And so by just dropping in the CPL plus worker, all this works, um, the CPL plus worker leverages MD runtime optimization, smart io prefetching caching. It's got a, got a lot of cool stuff, um, that kind of gives it a big edge. Um, so Velox at a glance, uh, uh, this is what is, uh, used by the underlying engine. So like the, uh, task, um, fragment, uh, which comprises sql, a part of the sql uh, execution is, uh, given to drivers.  

Speaker 2    00:06:22    
We set up pipelines, operators, uh, you see all the operators for, uh, join ag. Um, the exchanges, uh, on the left hand side, and there's a massive, like, uh, inbuilt memory management, uh, which comprises pools memory manager that sort of, um, uh, is used, uh, in the rest of, um, for the operators. So what are the advantages of pre modes? Like a huge performance boost, wides, performance cliffs, and, uh, because of Velox, we were able to build, uh, reusable and extensible perimeters. Velox is also used in, uh, the, uh, spark implementation, uh, gluten, most notably. Um, so just to like kind of, uh, brag about what we were able to achieve, uh, we did a whole lot of D-P-C-D-S, uh, workload, um, uh, at, uh, IBM and your other results. If you see, uh, for a one key run native engine did almost three times better as Java, like, and, um, so the blue is the Java run, red is the native run, and you can see that the native numbers are good all throughout.  

Speaker 2    00:07:26    
Um, at a 10 K, we did almost, um, double as well. Uh, so the native, uh, finished in two hours, whereas Java took almost four hours. And again, uh, better numbers are across the board. And the, uh, most amazing thing is that we were able to do a 10 K run in like less than four hours. And, uh, Java was not able to do this at all. So, uh, this gave us like amazing improvements. Um, and in terms of, uh, what you can do with what's the next data, so what's the next data, um, is, uh, a platform or data, uh, to build data. Lakehouse, it's for enterprises, and, um, it's sort of, um, is built on the concept of like, it, it, it has openness at all layers. So, uh, it, uh, many engines are available for the user to set up. Um, these engines work, um, or so if you see the left hand side, those are all the layers in the stack.  

Speaker 2   00:08:21    
You have the engines at the topmost layer, and they work with OpenTable formats like Iceberg, um, Delta, um, parquet. Um, and we have a metadata store, uh, which is like hive compatible. Um, and, uh, your underlying storage could be any like cloud store or, uh, the, um, legacy, uh, query engines. And the infrastructure is based on Kubernetes. So this works on hybrid cloud. So I'll, um, kind of stop here and hand the stage over to Amit, who will be talking about, uh, the press native engine at Meta and, uh, where they've got there. So yeah, take it. Yeah.  

Speaker 3    00:09:02    
Yeah. So we'll be talking about the native engine in meta production. Uh, next slide. Uh, so, uh, roughly this is the history of, uh, Presti Presto inside meta. So in 2015 to 2017, we were working for a hive migration and external analytics when Presto was initially launched in 20 18, 20 19. In this time period, we were working on interactive and batch workload, and also the press project was donated to, uh, Linux Foundation at this time point. And in 20 20, 21, uh, there was a lot of work in caching, uh, working, making press work in elastic compute, uh, and the efficiency work was started. And after that, we, uh, in from 2020 on, uh, we started working on the native execution. Uh, next slide, uh, uh, in, in Presto, in at meta is used for many, so various kind of workload. Uh, uh, it has internal functions, external functions, internal, uh, connectors, and the wall times of these workload can be from few seconds to a few hours.  

Speaker 3    00:10:15    
They can be periodic workload, uh, uh, ad hoc, uh, slice and dice and machine learning workloads. There are different kind of writers, uh, exchange operations, uh, and, uh, there is a strong, we had strong need for faster execution and, and supporting those with limited hardware and, uh, when to keep the customer expectation as well. Uh, next slide. Uh, so, uh, we also had many evolving requirements like machine learning, uh, file formats, uh, and then we, uh, we had requirements to meet, uh, SQ on machine learning file formats, uh, similar to, uh, general file format. And, uh, we want to do it in the same deployment, uh, with the regular test and, uh, new formats. Next slide. Uh, so in, uh, also one of the important part of Presto was the correctness, uh, the performance and the release in deployment. Uh, uh, when we are migrating Presto to the press two c plus plus, because this is a press two is already running, uh, as one of the major engine in, in meta and, uh, while changing it, we need to make sure that it's, uh, function as good as Presto and making its promises.  

Speaker 3    00:11:36    
Uh, it's almost like changing the engine while the, uh, while the plane is flying. Uh, next slide. Uh, so, but pesto, uh, has the almost a ticket or the fixes and functionalities, and it's almost, it's the source of truth for all company platforms within meta, uh, and the, uh, there is a backward compatibility of results within Presto. Uh, so when we are developing Constitution, one of the major challenges was to make sure how this is correct and, uh, how can we make, make, uh, the verification strategy. So there is wide variety of workloads, and there are d non-deterministic workload, uh, and, uh, same query defined parameters, uh, retriable, not non retriable workload. And with the, uh, there are different libraries that got introduced, like Jason purchasing, uh, reject libraries. Uh, so, uh, we, we had to work, uh, many stages to make sure this we can replace pre Presto with the new engine. Uh, next slide.  

Speaker 3   00:12:45    
Uh, and then, uh, the main driving point for, uh, the, the complete <inaudible> was to make sure the performance metrics are really good. And in this case, we take the performance by taking the submission of the CPU time or execution time for Presto Java eval. Uh, and, and we take the submission of all the PN execution time for the c plus plus C and, and see the ratio. If it's, uh, greater than one then or better is positive, uh, greater than one, then it's, it's c plus plus system is working better. And this is how, uh, we figured out how the system is, uh, how is going to be deployed. Uh, next slide.  

Speaker 3    00:13:30    
And, and once we deployed the system, uh, over the last one and one and a half years, uh, we were able to make it much more reliable and maintainable with the really fast execution, uh, which also provided a very good customer experience. Uh, we have observed three to four times better CPU time than the legacy Java system, uh, and, uh, almost two times better execution time. And this is done with almost 60% of less hardware or growing the throughput with almost three times. And today all, uh, uh, meter 70% of workload is running in this native engine. Next slide.  

Speaker 1    00:14:15    
Awesome. Thank you very much folks. Let me see if we have any questions from the audience. I mean, you're saying a decade worth of fixes, uh, in product development for Presto internally at Meta, right? Am I understanding you right?  

Speaker 3    00:14:34    
Yeah, I think the, no, it's not internally as, because the Presto is in open source as that you were saying. So it has many functionalities within meta and outside also, so to make sure pesto can be replaced with the new native engine, we had to kind of, uh, find the parity between these, all these, uh, decade of work in the new frame platform, but within a short period of time.  

Speaker 1    00:15:02    
Yeah. How far apart are they, and like how likely are they to, to continue to diverge, right? Let's say like you, whatever version you guys need internally versus the one that's available. And so  

Speaker 3    00:15:16    
I, I think, I think the, there, there is no special internal version within Meta. There's the same version that we have in VAs, but there's some new functionalities there, but the divergence from the Java stack to c plus plus stack will be, uh, there are the machine learning, uh, file format support that are new, like nimble file formats and other that are not available in the Java stack at all. Plus the regular expression and JSON libraries, these are practically different than what we have in Java.  

Speaker 1    00:15:49    
I'm getting a question here from, uh, Colorado. How long did it take to develop this drop in replacement?  

Speaker 3    00:15:58    
I, the development is still ongoing, but I think the Velox project and the position project we started around, uh, 2022, so almost two, three years.  

Speaker 2    00:16:08    
Yeah, we started around, uh, 21, so that's when I joined the team. Yeah. Um, and we started this in open source, so this is like open source first development. It was, uh, meta, uh, folks from like ahana now at IBM. We had Intel folks and we had ByteDance folks, so we kind of started, and it's been going along for like fourish years. Yeah,

Speaker 1    00:16:30    
Yeah. Uh, we got another one from Yuri saying maybe Trino on c plus plus, I'm not sure what, uh, yeah.  

Speaker 2   00:16:41    
And so, um, of course we are not trying that. Uh, but uh, like, uh, kinda using the logs with Reno plans is, yeah, I mean the, those plans also close to Presto plans, so it's doable. Um, and I believe someone did put something out in open source, but uh, I mean, we've not tried it, so of course. Yeah.  

Speaker 1    00:17:06    
Got another one here. Um, let's see. Uh, Isha was asking does it support GPU acceleration? So do we know, and do we know how well this compares to newer systems, like, uh, data feed?  

Speaker 2    00:17:18    
Yeah, so, um, the Velox layer, uh, is being enhanced for, uh, hardware acceleration as we speak, and the bunch of hardware accelerators. So there's Nvidia working, uh, there's somebody called Ribs, uh, there's another neuro blade. Um, and so the, uh, accelerator like sort of comes at the Velox layer and, um, so like gluten, uh, Presto sort of, kind of just have to get that into their clusters, I believe. Uh, gluten, uh, with Spark has done that already and Presto is kind of getting there. Yeah.  

Speaker 1    00:17:51    
Uh, we got one more here from Yuri, another one deep in the weeds. Did you use Aero Flight connector to, to connect the clusters, uh, between each other?  

Speaker 2    00:18:03    
Uh, so we are not cla connecting the clusters together with each other, with Arrow Flight. It's not something we've tried yet. Uh, but we have a Arrow flight connector and uh, we are using it to connect to external systems.  

Speaker 1    00:18:17    
Trino versus Presto. Both are same or any difference in query engine or Presto supports open tables like Iceberg Delta table.  

Speaker 2   00:18:28    
So I didn't quite follow. So the <crosstalk>, yeah, I'm  

Speaker 1    00:18:30    
Not sure if it's, if there, yeah, so if you wanna, if you wanna repackage this as more of a question, I think it might be like Trino versus Presto. First of all, the difference in query engine there. That's the question.  

Speaker 2    00:18:47    
Yeah. So, um, things have diverged quite a bit since, uh, the split between Presto and Trino and um, like there is a lot of similar, uh, so the runtime of course is different now because of Presto Snow. Um, but in terms of the coordinator, um, we've put in a bunch of new things in Presto, like, um, history based optimization, I believe that's not in Trino. Uh, but of course there's always a bit of sharing also that goes on. And, um, the distributed execution, like the scheduler and all that is still very similar.  

Speaker 1    00:19:20    
Last one here,  

Speaker 2    00:19:21    
We, we added all the Lakehouse formats for <inaudible>.  

Speaker 1    00:19:25    
Okay, cool. I think that was the, that was the second half of that question. Last question here from your own, any metrics comparison between Presto and say Databricks Photon?  

Speaker 2   00:19:36    
Um, so those would be published directly by IBM, uh, so we are kind of not gonna talk about it yet.  

Speaker 1    00:19:45    
Stay tuned. Stay tuned. Miran Amee, thank you very much. Uh, we really appreciate it. And.