ODBC Takes an Arrow to the Knee
For decades, ODBC/JDBC have been the standard for row-oriented database access. However, modern OLAP systems tend instead to be column-oriented for performance - leading to significant conversion costs when requesting data from database systems. This is where Arrow Database Connectivity comes in!
ADBC is similar to ODBC/JDBC in that it defines a single API which is implemented by drivers to provide access to different databases. However, ADBC's API is defined in terms of the Apache Arrow in-memory columnar format. Applications can code to this standard API much like they would for ODBC or JDBC, but fetch result sets in the Arrow format, avoiding transposition and conversion costs if possible.
This talk will cover goals, use-cases, and examples of using ADBC to communicate with different Data APIs (such as Snowflake, Flight SQL or Postgres) with Arrow-native in-memory data.
Transcript
AI-generated, accuracy is not 100% guaranteed.
Adam - 00:00:06
And we're back this time with Matthew. Matt, are you with us?
Matthew Topol - 00:00:13
Hey. Yeah, I'm here.
Adam - 00:00:14
Good to see you. We have a whole packed day ahead of us. So I'm just gonna have you take the stage.
Matthew Topol - 00:00:22
Let's see if I can squeeze this into 10 minutes. So, I'm gonna talk about ADBC and how it's what makes it better than ODBC and what you can do with it. So first things first, Umai, I'm gonna blaze through this. I am the author of In-Memory Analytics with Apache Arrow, a PMC member on the Apache Arrow Project, and a committer expert.
So, quick primer on what Apache Arrow is, in case you're not familiar with it. It is an in-memory column oriented data format. It's spec. There are implementations in a whole mess of languages. Go, C++, Rust, Python, you name it. The entire point of it is, aside from being super performant by being column oriented in memory, it's also the exact same layout in memory as it is on the wire.
Matthew Topol - 00:01:21
So you get to avoid all of those serialization and deserialization costs. You can pass data from your systems and between nodes and machines without having to serialize and deserialize. You can just pass the raw byte through. If you're not familiar with the term columnar, which I hope everybody is by now, but if you're not familiar with it, it's just a matter of how you orient and how you store the data in memory.
If you have your table of data and you have your buffer of memory, your traditional row oriented buffer is gonna be column one, column two, column three, column one, column two, column three in order. Whereas if you are doing it in a column oriented way, you end up with all the data for a single column in a contiguous chunk of memory. This is important because it improves your IO performance.
Matthew Topol - 00:02:14
That's why, say, Parquet is column oriented. It also allows you to only read or allocate memory for the bits for the columns that you actually need to operate on. It also means that your data is already in a configuration that is ideal for vectorized processing and vectorized SIMD.
So now that we've sped through what Apache Arrow is, let's actually talk about what I'm here to talk about. The most common way people interact with databases nowadays is still unfortunately JDBC and ODBC. In your typical case, you have your application, you send your SQL query to the ODBC or JDBC API, you have a low data driver of some kind. That driver implements the database specific network protocol, sends the query to the database, retrieves the results from the database, converts that data into the format for ODBC and JDBC, and finally returns those results through the ODBC and JDBC API.
Matthew Topol - 00:03:24
One drawback there is the fact that if your database and your system is column oriented, now you are forcing it to do a whole conversion from transposition of rows to columns and columns to rows just to pass your data through. People use it because ODBC and JDBC aren't going anywhere. It's been around for decades. It's gonna keep going around. If you're dealing with your standard traditional transactional use base use cases, it does what you need.
But nearly all analytical database systems nowadays are column oriented for all the reasons I mentioned earlier: improved IO, SIMD vectorization, and just when it comes to analytic processing, column oriented is the way to go. Which means that if you're using ODBC and JDBC, you now have to do a conversion just to pass the data over.
Matthew Topol - 00:04:24
And the conversion is also a copy, which is expensive and slow. If you want to avoid that conversion, then you need to have specific SDK integrations with each and every system you want to integrate with. That just is complex. Look at all the connectors that Trino and others have to implement just to be performant.
There's another way, and that's entirely the point of what I'm talking about here. ADBC, Arrow Database Connectivity, is a similar concept to ODBC. You have a single API that your application interfaces with and swappable drivers. The big difference is that the data being passed through is all Arrow based. It's column oriented the whole way through so that your column oriented data doesn't need to get transposed.
Matthew Topol - 00:05:30
If we look at that diagram earlier, you send the query to the API, you load an ADBC driver. It sends it to the driver, and the driver talks to the database. Now, if the data system returns data as Arrow already, there's no conversion, no copy. Just pass it right through.
At the API level, you're just passing pointers to the raw buffers because the data that comes back as Arrow is already the exact same layout in memory the way you want it to be. That's what Arrow is good for: interoperable transposition. If the data system does not support column oriented Arrow data, then the driver does the conversion to Arrow, which means that your application side only ever has to deal with Arrow data for your column oriented handling.
Matthew Topol - 00:06:23
Nowadays, the primary thing you're gonna be doing is shoving that into data frames, which are column oriented. So you avoid copies and transposes on both sides and everything becomes massively faster.
It's specification, you have zero copy data movement from the Arrow C interface. You make a BI compatible releases, and we have a bunch of drivers currently already open source. Snowflake, DuckDB implement the interface, BigQuery, Postgres, SQLite all have ADBC open source drivers that you can go see. The whole thing is versioned, and we have bindings in a large number of languages already.
At its core, it's a header file, a C interface. It's vendor neutral, column oriented. If your system already returns Arrow, like Snowflake or DuckDB already do, then it is a significant performance improvement.
Matthew Topol - 00:07:40
If you're familiar with Flight SQL, you might be wondering what's the difference here? Where does ADBC complement Flight SQL? The point is that it is the client side API as opposed to Flight SQL being the protocol.
Just to give a quick example of the performance benefits we're seeing here: DuckDB implements the ADBC interface, they benchmark their ODBC driver versus their ADBC interface. Remember local data, which means there's no network collaboration here, no network traffic confounding this benchmark. That difference in time here is purely the difference of that copy and convert.
Matthew Topol - 00:08:08
If you want more information, go see the docs. Check out my book, the second edition has a whole chapter on ADBC. Check out everything for columnar. We're building out ADBC drivers and new connectors for everything we can to improve performance everywhere that needs connectivity to databases.
So, wow, I managed.
Adam - 00:08:33
Did it. I think you did it.
Matthew Topol - 00:08:36
Woo. I have never done that talk that fast.
Adam - 00:08:39
It was excellent. I'm sure folks would have tons of questions had we had time for those. Matt, thank you very much for sticking with us. I am very excited to be reading your book. I already saw it on your LinkedIn when I was researching. I was like, I think this is exactly the kind of book that I want to read. So, man, thanks again.