Open Source Query Performance - Inside the next-gen Presto C++ engine
Presto (https://prestodb.io/) is a popular open source SQL query engine for high performance analytics in the Open Data Lakehouse. Originally developed at Meta, Presto has been adopted by some of the largest data-driven companies in the world including Uber, ByteDance, Alibaba and Bolt. Today it’s available to run on your own or through managed services such as IBM watsonx.data and AWS Athena.
Presto is fast, reliable and efficient at scale. The latest innovation in Presto is a state of the art C++ native query execution engine that replaces the old Java execution engine. Presto C++ is built using Velox, which is another Meta open source project for common runtime primitives across query engines. Deployments with the new Presto native engine show massive price performance improvements with fleet sizes shrinking to almost 1/3rd of their Java cluster counterparts, leading to enormous cost savings.
The Presto Native engine project began in 2020 and since then, it has matured into production use at Meta, Uber, and in IBM watsonx.data. This talk gives an in-depth look at this journey, covering:
- Introduction to Prestissimo/Velox architecture
- Production experiences and learnings from Meta
- Benchmarking results from TPC-DS workloads
- New lakehouse capabilities enabled by the native engine
Beyond the product features, we will highlight how the open source community shaped this innovation and the benefits of building technology like this openly across many companies.
Transcript
AI-generated, accuracy is not 100% guaranteed.
Adam - 00:00:06
Right now we're having Amit and Aditi about to join the stage. Are you guys with us? Amit Aditi? Hello? Can you hear me okay?
Aditi Pandit - 00:00:17
Yeah, I can hear you.
Adam - 00:00:18
Okay. Awesome. Guys, take it away. I'll be back in 15 minutes, and we're gonna have a bunch of questions from the audience. I'm sure we're gonna have about five minutes of time at the end to ask you guys those questions and looking forward, very excited to hear what you guys have to share with us today and take it away.
Aditi Pandit - 00:00:38
Yeah, thanks Adam. So, hi everyone. I am Aditi Pandit. I'm an engineer at AHANA, IBM now, and I'm also joined by Amit Dutta, who is a software engineer at Meta. We are going to talk today about the Presto C++ engine, which we built in open source and is now available at IBM, What's Next Data and Meta production.
In terms of this talk, we'll start with a quick introduction to Presto, and I'll give some background about the Presto engine, the C++ engine, what it is and why we are doing it. Just a quick introduction to What's Next Data platform and the Presto Native Engine offering on it. Then Amit will take it over and talk about the production experience at Meta.
Aditi Pandit - 00:01:30
So, introduction to Presto. Presto is a fast and reliable open source SQL query engine for data analytics and open data lakehouse. This project started at Meta over 10 years ago and has since expanded its region like big data companies like Uber, ByteDance, and a few. Look at the numbers here. They're massive.
This engine is built in open source. In terms of the ecosystem, it's a distributed SQL query engine that sits between the underlying storage, it could be S3, other cloud stores, or just regular databases. It's a federated query engine, so it can query many systems. At the higher levels, you have your tools like Looker, Superset that can be used to submit queries to the Presto engine.
The system is massive. It has a very vibrant open source community, both in terms of users and contributors, and you can see a lot of cool logos here: Meta, Uber, IBM, ByteDance, Twitter, Alibaba, you name it.
Going to Presto Native Engine, what and why? Around 2020, all the teams using Presto realized that the use cases Presto was being used for, like very advanced analytics, machine learning on data lakehouse, the Java worker was kind of reaching its limits in terms of what it could do for those.
This led the team to work in the direction where a C++ query eval was desired. The system is built with first-class vectorization, runtime optimizations, and built-in memory management. This gives us a lot of amazing things for our current use cases. This was in line with similar initiatives like FoodOn and Databricks, Intel and Meta started Gluten for Spark acceleration, and this went in line with a lot of what was going on in the industry at the time.
What is Presto Native Engine? It is a full rewrite of the Presto worker in C++. It's a drop-in replacement for the Presto Java worker. It leverages Velox, another open source project started at Meta. Velox is a library of data processing primitives and in-memory management. Think about hash joins, windows, all these things are provided by Velox.
To show at a high level visually what's going on, you have the Presto cluster on the left-hand side. A typical Presto cluster has a coordinator that talks to a Hive metastore for table information, partition information. The main workers are these worker engines. You can have multiple worker nodes in your cluster. These worker nodes do all the reading, execution, exchanges between workers to make your query happen.
There is a control API and a data exchange API that the workers have to honor to interact with the coordinator and between workers for exchanges. In a Java-based system, there were a lot of limitations and inefficiencies: GC pauses, performance cliffs, operational difficulties, hard to control performance.
We flipped to an architecture that looks exactly the same except the underlying executable for the worker is flipped to a seamless C++ executable. That's what we do for migrating your clusters. The Presto worker talks the same control API and data exchange API as the previous system. By just dropping in the C++ worker, all this works.
The C++ worker leverages runtime optimization, smart IO prefetching, caching. It has a lot of cool stuff that gives it a big edge.
Velox at a glance: this is what is used by the underlying engine. The task fragment, which comprises a part of the SQL execution, is given to drivers. We set up pipelines, operators, you see all the operators for join, aggregation, the exchanges on the left-hand side, and there's a massive inbuilt memory management which comprises pools and memory manager that is used for the operators.
What are the advantages of Presto Native? Huge performance boost, wide performance cliffs, and because of Velox, we were able to build reusable and extensible primitives. Velox is also used in the Spark implementation, Gluten most notably.
To brag about what we achieved, we did a lot of DPCDS workload at IBM and other results. For a 1K run, native engine did almost three times better than Java. The blue is the Java run, red is the native run, and you can see the native numbers are good all throughout. At 10K, we did almost double as well. The native finished in two hours, whereas Java took almost four hours. Better numbers are across the board. The most amazing thing is we were able to do a 10K run in less than four hours, and Java was not able to do this at all. This gave us amazing improvements.
In terms of what you can do with What's Next Data, it's a platform to build data lakehouse for enterprises. It's built on the concept of openness at all layers. Many engines are available for the user to set up. These engines work with open table formats like Iceberg, Delta, Parquet. We have a metadata store which is Hive compatible. Your underlying storage could be any cloud store or legacy query engines. The infrastructure is based on Kubernetes, so this works on hybrid cloud.
I'll stop here and hand the stage over to Amit, who will talk about the Presto Native Engine at Meta and where they've got there. Take it.
Amit Dutta - 00:09:02
Yeah. We'll be talking about the native engine in Meta production.
Roughly this is the history of Presto inside Meta. From 2015 to 2017, we were working for a Hive migration and external analytics when Presto was initially launched in 2018-2019. During this time, we were working on interactive and batch workloads. The Presto project was donated to the Linux Foundation at this time.
In 2020-21, there was a lot of work in caching, making Presto work in elastic compute, and efficiency work was started. From 2020 on, we started working on native execution.
Presto at Meta is used for many various kinds of workloads. It has internal functions, external functions, internal connectors. The wall times of these workloads can be from a few seconds to a few hours. They can be periodic workloads, ad hoc, slice and dice, and machine learning workloads. There are different kinds of writers, exchange operations, and there was a strong need for faster execution and supporting those with limited hardware while keeping customer expectations as well.
We also had many evolving requirements like machine learning file formats. We had requirements to meet SQL on machine learning file formats similar to general file formats. We wanted to do it in the same deployment with the regular test and new formats.
One important part of Presto was correctness, performance, and release in deployment. When migrating Presto to the Presto C++, because Presto is already running as one of the major engines in Meta, while changing it, we needed to make sure it functions as well as Presto and keeps its promises. It's almost like changing the engine while the plane is flying.
Presto has almost a decade of fixes and functionalities. It's the source of truth for all company platforms within Meta. There is backward compatibility of results within Presto. When developing the new engine, one major challenge was to ensure correctness and verification strategy. There is a wide variety of workloads, some non-deterministic, same query defined parameters, retriable and non-retriable workloads. Different libraries were introduced like JSON parsing, reject libraries. We had to work in many stages to make sure we can replace Presto with the new engine.
The main driving point for the complete migration was to ensure performance metrics are really good. We take performance by measuring CPU time or execution time for Presto Java eval and compare it with the C++ execution time. If the ratio is greater than one, then the C++ system is working better. This is how we figured out how the system was going to be deployed.
Once deployed over the last one and a half years, we made it much more reliable and maintainable with really fast execution, providing a very good customer experience. We observed three to four times better CPU time than the legacy Java system and almost two times better execution time. This was done with almost 60% less hardware while growing throughput by almost three times. Today, about 70% of workload is running in this native engine.
Adam - 00:14:15
Awesome. Thank you very much folks. Let me see if we have any questions from the audience. You're saying a decade worth of fixes in product development for Presto internally at Meta, right? Am I understanding you right?
Amit Dutta - 00:14:34
Yeah, I think no, it's not internally because Presto is open source as you were saying. It has many functionalities within Meta and outside also. To make sure Presto can be replaced with the new native engine, we had to find parity between all these decade of work in the new platform but within a short period of time.
Adam - 00:15:02
How far apart are they, and how likely are they to continue to diverge? Like the version you guys need internally versus the one that's available?
Amit Dutta - 00:15:16
I think there is no special internal version within Meta. It's the same version that we have in OSS, but there are some new functionalities. The divergence from the Java stack to C++ stack includes machine learning file format support that are new, like Nimble file formats and others not available in the Java stack at all. Plus the regular expression and JSON libraries are practically different than what we have in Java.
Adam - 00:15:49
I'm getting a question here from Colorado. How long did it take to develop this drop-in replacement?
Amit Dutta - 00:15:58
The development is still ongoing, but the Velox project and the Presto project started around 2022, so almost two to three years.
Aditi Pandit - 00:16:08
Yeah, we started around 2021, that's when I joined the team. We started this in open source, so this is open source first development. It was Meta folks, AHANA now at IBM, Intel folks, and ByteDance folks. It's been going along for about four years.
Adam - 00:16:30
We got another one from Yuri saying maybe Trino on C++, not sure what.
Aditi Pandit - 00:16:41
Of course we are not trying that, but using the logs with Trino plans is doable. Those plans are close to Presto plans. I believe someone put something out in open source, but we have not tried it.
Adam - 00:17:06
Got another one here. Isha was asking does it support GPU acceleration? Do we know how well this compares to newer systems like Databricks Photon?
Aditi Pandit - 00:17:18
The Velox layer is being enhanced for hardware acceleration as we speak, with a bunch of hardware accelerators. Nvidia is working, there's RIBS, another called NeuroBlade. The accelerator comes at the Velox layer, so Gluten and Presto just have to get that into their clusters. Gluten with Spark has done that already, and Presto is kind of getting there.
Adam - 00:17:51
One more here from Yuri, deep in the weeds. Did you use Arrow Flight connector to connect the clusters between each other?
Aditi Pandit - 00:18:03
We are not connecting clusters together with Arrow Flight. It's not something we've tried yet. But we have an Arrow Flight connector and we are using it to connect to external systems.
Adam - 00:18:17
Trino versus Presto. Are they the same or any difference in query engine? Does Presto support open tables like Iceberg, Delta table?
Aditi Pandit - 00:18:28
I didn't quite follow.
Adam - 00:18:30
If you want to repackage this as a question, it might be like Trino versus Presto, first the difference in query engine.
Aditi Pandit - 00:18:47
Things have diverged quite a bit since the split between Presto and Trino. The runtime is different now because of Presto Snow. In terms of the coordinator, we've put in a bunch of new things in Presto like history-based optimization, which I believe is not in Trino. There's always some sharing. The distributed execution like the scheduler is still very similar.
Adam - 00:19:20
Last one here.
Aditi Pandit - 00:19:21
We added all the Lakehouse formats for Presto.
Adam - 00:19:25
Okay, cool. Last question here from Yuri, any metrics comparison between Presto and Databricks Photon?
Aditi Pandit - 00:19:36
Those would be published directly by IBM, so we are not going to talk about it yet.
Adam - 00:19:45
Stay tuned. Stay tuned. Amit, Aditi, thank you very much. We really appreciate it.