From Kafka to Open Tables: Simplify Data Streaming Integrations with Confluent Tableflow
Modern data platforms demand real-time data—but integrating streaming pipelines with open table formats like Apache Iceberg™, Delta Lake, and Apache Hudi™, has traditionally been complex, expensive, and risky. In this session, you’ll learn how Confluent’s data streaming platform—with unified Apache Kafka® and Apache Flink®—makes it simple to stream all your data into Iceberg tables and Onehouse with Tableflow. Built for open lakehouse architectures, Tableflow lets you represent Kafka topics and their associated schemas as open table formats in just a few clicks, eliminating the need for custom, brittle integrations and batch jobs. See how Confluent enables faster delivery of real-time data products ready for use across open data systems.
Transcript
AI-generated, accuracy is not 100% guaranteed.
Speaker 0 00:00:00
<silence>
Speaker 1 00:00:06
Look at that. We're back and we've got a keynote. This is our last keynote of the day from the folks at Confluent. I'm gonna bring them onto the stage right now. Ka soon and Yash, where you all at? Hey, how's it going, fellas?
Speaker 2 00:00:22
It's going well.
Speaker 1 00:00:23
Y'all have gotta talk. I'm going to share your screen right now. It is on the stage. I'm gonna get outta here, and I'll be back in about 15 minutes to ask you some questions.
Speaker 2 00:00:39
Good evening folks, and good afternoon, somebody. Uh, I'm Ashari. I'm a senior product marketing manager here at Confluent. Uh, and I'm joined by Kasan, who is a senior product manager at Confluent. And we are going to talk about, uh, simplifying streaming integrations with Confluent Tableflow, essentially from Kafka to open table formats such as Iceberg and Delta then and Delta Lake. In case you haven't heard of Confluent, uh, we are f we are found founded by the original creators of Apache Kafka, and we are happy to see that we have, we are being used by 75% of the Fortune 500 companies, just in case. If you want to sign up for Confluent Cloud, explore the product and see its features and functionalities, please go ahead and scan the QR code and just get started. Today. We are giving $400 worth of free credits for the first, first 30 days of your usage. Before going di before going deep, deep into the topic, let's understand the context of operational and analytical divide historically.
Speaker 2 00:01:49
So, data across organizations is typically split across, uh, two estates. The operational estate and analytical estate. Uh, so operational estate is essentially all your operational labs, such as SA SaaS, ERP and so on and so forth. And Apache Kafka has become the defac defacto standard for organizing all the operational data, all the operational realtime data. Uh, and on the other side we have analytical estate. Uh, and this is basically all the data lakes, data warehouses, uh, and data lake houses. And if you look at it here, uh, Apache Iceberg, Delta Lake, and even Hoodie are becoming the standard open table formats for analytics, but converting the real time Kafka streaming data into, uh, your data lake or your data warehouse, basically the analytical estate is very painful and time consuming. There's a lot of duplication. There are brittle pipelines, pipelines, and then there's endless maintenance. Let's take a look at what it involves for converting Kafka topics into the open table formats that we are talking about.
Speaker 2 00:03:03
So there's this whole, uh, setup, which you have to do to take out, uh, the Kafka streaming data out of your Kafka to topics. It can be any of the connectors you want to use. It can be your own connector or the Confluent connector. And then you have your total ingest pipeline, which is basically converting Kafka topics into universally acceptable formats such as Pocket. And then there is schema, evolution type conversion, compaction, uh, and so on and so forth. And then obviously syncing metadata into the catalog of your choice. And all of this is just to convert Kafka topics into draw formats or branch tables for that matter. And then you also have to do a lot of prep after the ES stage, right? You have to do business specific rules and logic, the CDC, methodization deduplication filtering and so on and so forth to actually make the data ready for analytics. That is essentially converting that into silver and gold table formats. Here we go. We are introducing Tableflow, which basically converts or which basically represents Kafka topics as open table formats such as Apache Iceberg or Delta Lake, and a few clicks to feed any data warehouse, data lake or analytical analytics engine.
Speaker 2 00:04:22
Imagine a solution, uh, which basically automates the entire ingest and prep prep process, which you have talked about earlier. So that's essentially what Tableflow is and how does Tableflow do that? So we, before going a little bit, before going a, before going more into the topic, so we basically have something called Cora, which is basically a cloud native engine, uh, developed by Confluent. And we are using the storage layer of Cora to actually convert Kafka streams as tables and then, uh, subsequently to Iceberg or Delta tables, and then storing it in either conference managed storage or the storage of your choice. It can be as simple as an S3 bucket
Speaker 2 00:05:08
Once, uh, that this is the automatic process I've been talking about. So all the, all the prep process and all the <inaudible> process, it's all automated by Tableflow. And once the Kafka topics are converted into Iceberg and Delta Lake tables, uh, we can use the, we can use any catalog sync of your choice. For example, AWS glue, uh, the inbuilt Iceberg Rest catalog, Apache Polaris or Databricks Unity catalog so that you can feed, uh, the commercial data warehouses and data lakes such as one house, Amazon, Athena, snowflake, Databricks, uh, Starburst and Mio. Or this is also compatible with any third party open source engine such as Spark W, dp, Trino and employee. Uh, this is the entire process we have been talking about. So convert taking a Kafka topic, which is basically your real time operational data, and then automating the entire process just with just few clicks. And then using the catalog of your own choice and then accessing the data through any commercial or OSS third party engines. Please take a note that, uh, the Unity catalog is coming soon, but trust everything is G eight. And this is what our customers have been talking saying about Tableflow. Uh, busi is one of our, uh, early access customers and they basically provide transportation solutions to businesses, and they're using Tableflow, uh, for realtime analytics using Apache Iceberg and Snowflake. And they see a lot of potential. And currently they have been using, uh, this as part of their production realtime workloads, and they're put there and they're also ensuring a more efficient and cost effective data architecture.
Speaker 2 00:06:56
Here is a snapshot of all the Tableflow partners we have. We are, we have commercial ecosystem partners, system integrators in compatible technologies, and we are very happy to announce that one house is one of our ecosystem partners, including Mio, endpoint Starburst, as well as the commercial partners such as AWS Databricks and Snowflake. And we have all the system integrators and compatible technologies that I've discussed earlier. With this, I would like to pass on this to custom, uh, who will be walking through the functionalities of Tableflow. Uh, and then, uh, we'll walk you through the entire process of converting realtime operational data into open table format, such as Iceberg over to you.
Speaker 4 00:07:44
Sure, yes. Let me share my screen.
Speaker 3 00:07:50
Game on.
Speaker 4 00:07:51
Alright, thank you. All right. So let me walk you through a quick demo of using, uh, table Show. Uh, so, uh, as Josh mentioned, uh, so it is a feature that is enabled in the content cloud. So here I have a conent Cloud, uh, Kafka cluster, and, uh, I have stored my, uh, streaming data in one of these topics. So I'm using this oldest topic, uh, which I continuously ingest my streaming data too. And, uh, this is one of the sample event paid. And, uh, so this is the, uh, ARO schema associated with the topic. So the destination table will be created based on the, uh, aro that you have here. Now, uh, two enable Tableflow, it's, uh, uh, you just need to click on Enable. And with that you have to see table format. So we are currently supporting is, is, uh, as the available feature and Delta Lake <inaudible>.
Speaker 4 00:08:59
So for this demo I'm using, and uh, also you can select your storage option. Uh, you have the option of selecting custom storage or use <inaudible> Managed Storage. But, uh, for this demo, I'm just saying custom storage. So since we are using custom storage, you need to also configure, uh, provider you to access from CONT and cloud to your own three buckets in WSU roles and, uh, I policies. So you need to have them, uh, created in your account. So once you've created the provider integration, uh, you can also provide the S3 bucket name. So this is the storage that we will be using, uh, store all your, uh, Iceberg tables that you can review the default configuration and launch Tableflow. So that is all you have to do to cut an existing copper topic into an Iceberg table. Uh, it is automatically getting materialized into.
Speaker 4 00:10:08
And, uh, also Tableflow comes with a builtin Iceberg risk catalog. Uh, so you can the Iceberg Risk catalog end section of the topic. So this, here you have the Iceberg Risk Catalog endpoint, and, uh, any Iceberg risk catalog impeded engine, uh, you can use, uh, like you can directly. Now for this demo, I'm using Amazon Athena Pie Spark, uh, and I'm going to configure my, uh, notebook configuration to point to my, uh, uh, Iceberg risk catalog endpoint. And also you need to obtain the required API and, uh, secret, uh, to access this risk catalog. So that can be done through cloud console.
Speaker 4 00:10:58
So after configuring everything you can, uh, and then, uh, you are ready to start curing the, uh, topics or tables that we have just, uh, enable Tableflow, right? So for that you need the cluster id. Uh, so cluster ID directly maps to an Iceberg namespace. So therefore you can, uh, navigate to your notebook and list all the tables that are part of this namespace. So here you can see, uh, orders table or topic getting listed under the same namespace. And now we can start querying the orders table. Now, part of the table for materialization, it takes gaping type mapping, uh, that covers, it can also perform, uh, table maintenance as well. So garbage collection and all things are completely <inaudible>. Now, here you can view that we have seen from the contact console earlier. Now, if you're not using, uh, Iceberg Risk Catalog, or if you have, uh, or if you already have a Iceberg, uh, catalog such as AWS glue that you're using with other, uh, data lakehouse project, you can also integrate with that. So to do that, you can create a catalog integration from your KA cluster level. So basically cover cluster directly, uh, maps to a, a database, uh, in your, uh, glue data catalog. And then, uh, a topic maps to a table in the glue data catalog. So we are here, I have already configured glue data catalog integration for my copper cluster, therefore, I should be able to discover these tables from, uh, my glue data catalog. Let me navigate to the glue data catalog, and under the table section I can try to discover our orders table
Speaker 1 00:13:00
One, one thing. Can you try to make it a little bit bigger?
Speaker 4 00:13:04
Sure,
Speaker 1 00:13:05
Thanks.
Speaker 4 00:13:10
Okay. Now, uh, now my order, this topic, uh, rec reside this, uh, uh, cluster, and I should be able to map that here. And you view data, which will take you to the Athena, it'll take you to the, uh, Athena SQL Notebook. Now, in this case, we are using, uh, Athena. Athena is using Blue Data Catalog to connect to your Iceberg cables. So unlike the previous case where we connected directly to the Iceberg Risk catalog, so therefore any, uh, blue, uh, compatible compute engine can, uh, start consuming data in this way. Now here we have the extension to the demo where we have, uh, one house, uh, which directly consume this table log. To do that we have creating a new, uh, cluster in one house. And, uh, then this is the sample script that we are going to involve in house. And, uh, this is the value point to the Iceberg Risk catalog endpoint of Tableflow, and also provided the, uh, Iceberg Risk catalog credentials. And as part of the script, we are the table that we have created and also then convert it to a store in a separate S3 location. Now, to run this, uh, script, go back to the jobs section in, uh, one house, create a new job,
Speaker 4 00:14:53
And then select type as Python and, uh, to the cluster that we have just created. And also, uh, you need to point to the Tableflow, uh, Python script that I showed you earlier, and you can successfully create the job and, uh, just go inside the, and then, uh, execute it. Then you can view the status of the job execution. It is successfully executed and completed, and therefore you can go to the, uh, your workspace and you should be able a new table created, uh, by this script. So basically in this use Kafka topic, uh, we worded a Kafka topic to an Iceberg table, and then, uh, ran a query inside one house to convert it back another hoodie table. So you can basically include any, uh, processing logic as part of the script that I show. That basically concludes the demo. Uh, yeah, sure.
Speaker 2 00:16:03
Sounds good. Sounds good. Thank you, Kain for such a nice demo. And in case if you just want to read, uh, and learn more about Confluent, we have a nice blog about Tableflow. And then we also have a very short explained video, uh, on Tableflow on YouTube. And then, uh, please scan the q uh images, uh, which are shown on the slide to just get started with Confluent Cloud and also read and watch more about Tableflow in case you need any further information. Thank you so much.
Speaker 1 00:16:39
Sweet fellas, there's a lot of questions coming through, so I'm gonna just start firing away. Correct me if I'm wrong here. If the ingestion rate is very high, we might end up with several small Parquet files. How is the compaction performed? Compaction, uh, not sure if that's a word, but I like it. Is this something automatic or something the user would configure? EIE compaction frequency? Is that a word that I just am not aware of, or is that It is a word, isn't it? They're, they're using it like they're very confident it's a word. All right, so did you get that question?
Speaker 4 00:17:20
Yeah, question, question for real time. We continu it takes care of compacting all these, uh, data files user, uh, configurable, uh, but we will optimize it, uh, based on all the parameters that we have, uh, internally and, uh, create, uh, a competitive for you don't have anything or can set.
Speaker 1 00:17:53
All right. Your, your audio kind of broke up there, but it was basically like the compaction happens on your side of the fence, is what I understood.
Speaker 4 00:18:03
Yeah, it's sitting handled by.
Speaker 1 00:18:05
Excellent. All right. So when looking at Tableflow, how, and does it come from a streaming source like Confluent? I found Iceberg frequently chokes with my streaming use cases. Do you have plans to add hoodie as well? Since we use some of both?
Speaker 4 00:18:28
Maybe in the future, based on the demand, we might also consider plans. Uh, we don't have hoodie.
Speaker 1 00:18:34
All right. So right now, no. Uh, if, sorry man, your audio's a little bit choppy, so I wanna make sure that was clear. Right now. No, but if enough people ask for it in the future, yes. Later.
Speaker 4 00:18:49
Yeah. Yeah.
Speaker 1 00:18:51
All right. What is the current latency? This is where you get to show off some numbers,
Speaker 4 00:18:59
So, uh, depends on the, uh, did you hear me
Speaker 1 00:19:05
Properly? All right. I think I got, uh, basically it's, um, <laugh>, sorry, I'm just gonna try and replay this back for you. It is potentially 15 minutes, but it all depends on the amount of data and the shape of the data and all that fun stuff. Yeah,
Speaker 4 00:19:26
Correct.
Speaker 1 00:19:27
All right, cool. Oh, folks would love to know what the GitHub URL is and how does Tableflow for Delta Lake compare with Databricks DLT?
Speaker 4 00:19:43
Yeah, so there are, these are two options of, uh, creating, uh, <inaudible>. So with the <inaudible>, uh, table for support, uh, talk, uh, and Kafka, uh, infrastructure. Uh, but it is, and then that's another option as well. So there's, uh, it's completely up to the user to decide that. Uh, but if you are like, uh, if you want to keep the direct parity between a Kafka topic and a table, uh, con might be the best option. But if you're already using DLTs, and, uh, if that is more easier to use, then DLT would be the best option to convert a Kafka topic to a table.
Speaker 1 00:20:35
What capabilities would you highlight in Tableflow for analytical loads?
Speaker 4 00:20:43
Yeah, it's, again, the purpose of Tableflow is to create, uh, destination or target, uh, tables, right? And we try to optimize it for all the analytical work currently. Uh, something that we is partitioning, we plan to, uh, include in the upcoming releases. So keep optimize it for analytical workload with compaction and all the take, uh, uh, at the same time controllers like, uh, partitioning and various ways of, uh, optimizing these, uh, tables for your analytical workloads.
Speaker 1 00:21:21
Excellent. Fellas, this was wonderful. I wanna give you all a huge shout out and really appreciate your participation in this.