ODBC Takes an Arrow to the Knee
For decades, ODBC/JDBC have been the standard for row-oriented database access. However, modern OLAP systems tend instead to be column-oriented for performance - leading to significant conversion costs when requesting data from database systems. This is where Arrow Database Connectivity comes in!
ADBC is similar to ODBC/JDBC in that it defines a single API which is implemented by drivers to provide access to different databases. However, ADBC's API is defined in terms of the Apache Arrow in-memory columnar format. Applications can code to this standard API much like they would for ODBC or JDBC, but fetch result sets in the Arrow format, avoiding transposition and conversion costs if possible.
This talk will cover goals, use-cases, and examples of using ADBC to communicate with different Data APIs (such as Snowflake, Flight SQL or Postgres) with Arrow-native in-memory data.
Transcript
AI-generated, accuracy is not 100% guaranteed.
Speaker 0 00:00:00
<silence>
Speaker 1 00:00:06
And we're back this time with Matthew. Matt, are you with us?
Speaker 2 00:00:13
Hey. Yeah, I'm here.
Speaker 1 00:00:14
Good to see you. We have, uh, an, a whole packed day ahead of us. So I'm just gonna, uh, have you take the stage.
Speaker 2 00:00:22
Let's see if I can, uh, squeeze this into 10 minutes. So, um, so I'm gonna talk about, uh, ADBC and how it's, you know, what makes it better than ODBC and what you know you can do with it. So first things first, Umai, I'm gonna blaze through this, you know, um, I am the author of In-Memory Analytics with Apache Arrow, a PMC member on the Apache Arrow Project, and, uh, a committer expert. So, quick primer on what Apache Arrow is, in case you're not familiar with it. Uh, it is an in-memory column oriented data format. It's spec. There are implementations in the whole mess of languages. You know, go c plus plus rust, Python, you name it. You know, the, the entire point of it is, aside from being super performant by being column oriented in memory, it's also the exact same layout in memory as it is on the wire.
Speaker 2 00:01:21
So you get to avoid all of those serialization and deserialization costs. You can pass data from your systems and between nodes and machines without having to serialize and deserialize. You can just pass the raw bite through, you know, and if you're not familiar with the term columnar, which I hope everybody is by now, but if you're not familiar with it, it's just a matter of how you orient and how you store the data in memory. If you have your table of data and you have your buffer of memory, you know, your traditional row oriented buffer is gonna be, you know, column one, column two, column three, column one, column two, column three in order. That way. Whereas if you are doing it in a column oriented way, you end up with all the data for a single column in a contiguous chunk of memory. This is important because it improves your IO performance.
Speaker 2 00:02:14
That's why say Parquet is column oriented. And it also allows you to only read or allocate memory for the bits for the columns that you actually need to operate on. Utilize. It also means that your data is already in a configuration that is ideal for vectorized processing and vectorized sim d. So now that we've sped through what Apache Arrow is, let's actually talk about what I'm here to talk about there. And the most common way people interact with databases nowadays is still unfortunately JDBC and ODBC. In your typical case, you have your application, you send your SQL query to the ODBC or JDBC, API, you have low data driver of some kind. So that driver implements the database specific network protocol, sends the query to the database, retrieves the results from the database, converts that data into the format for ODBC and JDBC, and finally returns those results through the ODBC and JDBC API.
Speaker 2 00:03:24
And one drawback there is the fact that if your database and your system is column oriented, now you have to be, now you're forcing it to do a whole conversion from transposition of rows to columns and col columns to rows just to pass your data through. So, you know, people use it because, well, ODBC and JBC aren't going anywhere. It's been around for decades. It's gonna keep going around. And if you're dealing with your standard traditional tr um, transactional use base use cases, it does what you need. But nearly all analytical database systems nowadays are column oriented for all the reasons I mentioned earlier, improved IO, sim, d vectorization, and just when it comes to analytic processing, column oriented is the way to go, which means that if you're using ODBC and JDBC, you now have to do a conversion just to pass the data over.
Speaker 2 00:04:24
And the conversion is also a copy, which is expensive and slow. If you wanna avoid that conversion, then you need to have specific SDK integrations with each and every system you want to integrate with. And that just is complex. And let me look at all the connectors that Trino and others have to implement just to be performance. And so there's another way that's entirely the point of what I'm talking about here. So ADBC Arrow database connectivity, it is, it is, um, a similar concept to ODBC. You have a single API that your application inter interfaces with and swappable drivers. The big difference is that the data being passed through is all Arrow based. It's column oriented the whole way through so that your column oriented data doesn't need to get transposed. So if we look at that, that diagram earlier, you can see that you send the query to the API just get it before it, you loaded an ADBC driver.
Speaker 2 00:05:30
So it sends it to the driver, and the driver talks to the database. Now, if the data system returns data as Arrow already, there's no conversion, there's no copy. Just pass it right through. You know, in fact, at the API level, you can, you're just passing pointers to the broad buffers because the data that comes back as Arrow is already the exact same layout and memory the way you want it to be. Because, 'cause that's what Arrow is good for. That, that interoperable transposition. If the data system does not support, uh, column oriented Arrow data, then the driver does the conversion to Arrow, which means that your application side only ever has to deal with Arrow data for your column oriented handling. And nowadays, the primary thing you're gonna be doing is you're gonna be shoving that in the data frames, which are column oriented.
Speaker 2 00:06:23
And so you avoid copies and transposes on both sides and everything becomes massively faster. And I said the it, it's specification, you have the zero copy data movement from the Arrow OC interface. You make a BI compatible releases, and we have a bunch of drivers currently already open source. There's Snowflake duct DB implements the interface, BigQuery, Postgres, SQL Light, all have a DB open source ADBC drivers that you can go see, you know, and the whole thing is versioned. And we have bindings in a large number of languages already. And at its core, it, it's a header file, you know, it's a C interface. It fits in terms of when you're dealing with, it's a vendor neutral, API, it's column oriented. And if your system already returns Arrow, like say Snowflake or BRE already do, then it is a significant performance improvement. Now, if you're familiar with Flight SQL, you might be wondering like, what's the difference here? What, you know, where does ADBC complement Flight SQL? And the point is that it is the client side, API as opposed to the Flight SQL being the protocol. And just to give a quick example of the performance benefits we're seeing here, duct DB implements the ADBC interface, they benchmark their ODBC driver versus their ADBC interface. Now remember local data, which means there's no network, you know, collaboration here, there's no network traffic. That's, you know, confounding this benchmark that that difference in time here is purely the difference of that copy and convert.
Speaker 2 00:08:08
So if you want more information, go see the docs. Check out my book, the second edition has a whole chapter on ADBC. Check out, you know, everything for columnar. We're building out ADBC drivers and building out new connectors for everything that we can to improve performance everywhere that needs to do connectivities to databases. So, wow, I managed You
Speaker 1 00:08:33
Did it. I think you did it.
Speaker 2 00:08:36
Woo. I have never done that talk that fast.
Speaker 1 00:08:39
It was excellent. And I'm sure folks would have tons of questions had we had time for those questions. Uh, but Matt, thank you very much for, uh, sticking with us. I am very excited to be reading your book. Thank you. And I, and I, I already saw it on your LinkedIn when I was researching. I was like, I think I want, this is, this is exactly the kind of book that I wanna read. So, man, thanks again.