From Kafka to Open Tables: Simplify Data Streaming Integrations with Confluent Tableflow

calendar icon
May 21, 2025
Speaker
Kasun Indrasiri Gamage
Senior Product Manager
Confluent
Yashwanth Dasari
Senior Manager, Product Marketing & GTM Strategy
Confluent

Modern data platforms demand real-time data—but integrating streaming pipelines with open table formats like Apache Iceberg™, Delta Lake, and Apache Hudi™, has traditionally been complex, expensive, and risky. In this session, you’ll learn how Confluent’s data streaming platform—with unified Apache Kafka® and Apache Flink®—makes it simple to stream all your data into Iceberg tables and Onehouse with Tableflow. Built for open lakehouse architectures, Tableflow lets you represent Kafka topics and their associated schemas as open table formats in just a few clicks, eliminating the need for custom, brittle integrations and batch jobs. See how Confluent enables faster delivery of real-time data products ready for use across open data systems.

Transcript

AI-generated, accuracy is not 100% guaranteed.

Demetrios - 00:00:06  

Look at that. We're back and we've got a keynote. This is our last keynote of the day from the folks at Confluent. I'm gonna bring them onto the stage right now. Kasun and YashwanthYashwanth, where you all at? Hey, how's it going, fellas?  

Yashwanth Dasari - 00:00:22  

It's going well.  

Demetrios - 00:00:23  

Y'all have gotta talk. I'm going to share your screen right now. It is on the stage. I'm gonna get outta here, and I'll be back in about 15 minutes to ask you some questions.  

Yashwanth Dasari - 00:00:39  

Good evening folks, and good afternoon, somebody. I'm Yashwanth. I'm a senior product marketing manager here at Confluent. And I'm joined by Kasun, who is a senior product manager at Confluent. And we are going to talk about simplifying streaming integrations with Confluent Tableflow, essentially from Kafka to open table formats such as Iceberg and Delta Lake. In case you haven't heard of Confluent, we are founded by the original creators of Apache Kafka, and we are happy to see that we are being used by 75% of the Fortune 500 companies, just in case. If you want to sign up for Confluent Cloud, explore the product and see its features and functionalities, please go ahead and scan the QR code and just get started. Today, we are giving $400 worth of free credits for the first 30 days of your usage. Before going deep into the topic, let's understand the context of operational and analytical divide historically.  

Yashwanth Dasari - 00:01:49  

So, data across organizations is typically split across two estates. The operational estate and analytical estate. The operational estate is essentially all your operational labs, such as SaaS, ERP and so on and so forth. Apache Kafka has become the defacto standard for organizing all the operational real-time data. On the other side we have analytical estate. This is basically all the data lakes, data warehouses, and data lake houses. Apache Iceberg, Delta Lake, and even Hudi are becoming the standard open table formats for analytics, but converting the real-time Kafka streaming data into your data lake or your data warehouse, basically the analytical estate, is very painful and time consuming. There's a lot of duplication. There are brittle pipelines, and then there's endless maintenance. Let's take a look at what it involves for converting Kafka topics into the open table formats that we are talking about.  

Yashwanth Dasari - 00:03:03  

So there's this whole setup, which you have to do to take out the Kafka streaming data out of your Kafka topics. It can be any of the connectors you want to use. It can be your own connector or the Confluent connector. Then you have your total ingest pipeline, which is basically converting Kafka topics into universally acceptable formats such as Parquet. Then there is schema evolution, type conversion, compaction, and so on. And then obviously syncing metadata into the catalog of your choice. All of this is just to convert Kafka topics into raw formats or branch tables for that matter. Then you also have to do a lot of prep after the ETL stage. You have to do business specific rules and logic, the CDC, methodization, deduplication, filtering and so on to actually make the data ready for analytics. That is essentially converting that into silver and gold table formats. Here we go. We are introducing Tableflow, which basically converts or represents Kafka topics as open table formats such as Apache Iceberg or Delta Lake, and with a few clicks to feed any data warehouse, data lake or analytical analytics engine.  

Yashwanth Dasari - 00:04:22  

Imagine a solution which basically automates the entire ingest and prep process, which you have talked about earlier. So that's essentially what Tableflow is. How does Tableflow do that? Before going more into the topic, we basically have something called Cora, which is a cloud native engine developed by Confluent. We are using the storage layer of Cora to actually convert Kafka streams as tables and then subsequently to Iceberg or Delta tables, and then storing it in either Confluent managed storage or the storage of your choice. It can be as simple as an S3 bucket.  

Yashwanth Dasari - 00:05:08  

Once that this is the automatic process I've been talking about. So all the prep process and all the ingestion process, it's all automated by Tableflow. Once the Kafka topics are converted into Iceberg and Delta Lake tables, we can use any catalog sync of your choice. For example, AWS Glue, the inbuilt Iceberg REST catalog, Apache Polaris or Databricks Unity catalog so that you can feed the commercial data warehouses and data lakes such as Onehouse, Amazon Athena, Snowflake, Databricks, Starburst and Mio. This is also compatible with any third party open source engine such as Spark, Dremio, Trino and Presto. This is the entire process we have been talking about. So converting a Kafka topic, which is basically your real-time operational data, and then automating the entire process just with a few clicks, then using the catalog of your own choice and then accessing the data through any commercial or OSS third party engines. Please take a note that the Unity catalog is coming soon, but trust everything is GA. This is what our customers have been saying about Tableflow. Busi is one of our early access customers and they provide transportation solutions to businesses. They are using Tableflow for real-time analytics using Apache Iceberg and Snowflake. They see a lot of potential. Currently they have been using this as part of their production real-time workloads, and they are ensuring a more efficient and cost effective data architecture.  

Yashwanth Dasari - 00:06:56  

Here is a snapshot of all the Tableflow partners we have. We have commercial ecosystem partners, system integrators and compatible technologies, and we are very happy to announce that Onehouse is one of our ecosystem partners, including Mio, Endpoint, Starburst, as well as the commercial partners such as AWS, Databricks and Snowflake. We have all the system integrators and compatible technologies that I've discussed earlier. With this, I would like to pass on this to Kasun, who will be walking through the functionalities of Tableflow and then we'll walk you through the entire process of converting real-time operational data into open table format, such as Iceberg. Over to you.  

Kasun Indrasiri Gamage - 00:07:44  

Sure, yes. Let me share my screen.  

Demetrios - 00:07:50  

Game on.  

Kasun Indrasiri Gamage - 00:07:51  

Alright, thank you. Let me walk you through a quick demo of using Tableflow. As Yashwanth mentioned, it is a feature that is enabled in the Confluent Cloud. Here I have a Confluent Cloud Kafka cluster, and I have stored my streaming data in one of these topics. I'm using this orders topic, which I continuously ingest my streaming data to. This is one of the sample event payloads. This is the Avro schema associated with the topic. The destination table will be created based on the Avro that you have here. To enable Tableflow, you just need to click on Enable. Then you have to select the table format. We currently support Apache Iceberg as the available feature and Delta Lake.  

Kasun Indrasiri Gamage - 00:08:59  

For this demo I'm using Iceberg. You can also select your storage option. You have the option of selecting custom storage or use Confluent Managed Storage. For this demo, I'm just saying custom storage. Since we are using custom storage, you need to configure provider to access from Confluent Cloud to your own S3 buckets with AWS roles and IAM policies. You need to have them created in your account. Once you've created the provider integration, you can also provide the S3 bucket name. This is the storage that we will be using to store all your Iceberg tables. You can review the default configuration and launch Tableflow. That is all you have to do to cut an existing Kafka topic into an Iceberg table. It is automatically getting materialized.  

Kasun Indrasiri Gamage - 00:10:08  

Tableflow comes with a built-in Iceberg REST catalog. You can use the Iceberg REST catalog endpoint of the topic. Here you have the Iceberg REST catalog endpoint, and any Iceberg REST catalog compatible engine you can use directly. For this demo, I'm using Amazon Athena via Spark, and I'm going to configure my notebook configuration to point to my Iceberg REST catalog endpoint. You also need to obtain the required API key and secret to access this REST catalog. That can be done through the cloud console.  

Kasun Indrasiri Gamage - 00:10:58  

After configuring everything, you are ready to start querying the topics or tables that we have just enabled Tableflow for. You need the cluster ID. Cluster ID directly maps to an Iceberg namespace. You can navigate to your notebook and list all the tables that are part of this namespace. Here you can see the orders table or topic getting listed under the same namespace. Now we can start querying the orders table. Part of the table materialization takes care of type mapping. It can also perform table maintenance such as garbage collection. These are completely automated. Here you can view what we have seen from the Confluent console earlier.  

Kasun Indrasiri Gamage - 00:12:30  

If you're not using Iceberg REST Catalog, or if you already have an Iceberg catalog such as AWS Glue that you're using with other data lakehouse projects, you can also integrate with that. To do that, you can create a catalog integration from your Kafka cluster level. The Kafka cluster directly maps to a database in your Glue data catalog, and then a topic maps to a table in the Glue data catalog. I have already configured Glue data catalog integration for my Kafka cluster, so I should be able to discover these tables from my Glue data catalog. Let me navigate to the Glue data catalog, and under the table section I can try to discover our orders table.  

Demetrios - 00:13:00  

One thing, can you try to make it a little bit bigger?  

Kasun Indrasiri Gamage - 00:13:04  

Sure.  

Demetrios - 00:13:05  

Thanks.  

Kasun Indrasiri Gamage - 00:13:10  

Okay. Now, my orders topic resides in this cluster, and I should be able to map that here. You can view data, which will take you to the Athena SQL Notebook. In this case, we are using Athena. Athena is using Glue Data Catalog to connect to your Iceberg tables. Unlike the previous case where we connected directly to the Iceberg REST catalog, any Glue compatible compute engine can start consuming data in this way.  

Kasun Indrasiri Gamage - 00:14:00  

Now here we have the extension to the demo where Onehouse directly consumes this table log. To do that, we have created a new cluster in Onehouse. This is the sample script that we are going to run in Onehouse. This points to the Iceberg REST catalog endpoint of Tableflow, and also provides the Iceberg REST catalog credentials. As part of the script, we read the table that we have created and convert it to store in a separate S3 location.  

Kasun Indrasiri Gamage - 00:14:53  

To run this script, go back to the jobs section in Onehouse, create a new job, select type as Python and the cluster that we have just created. You need to point to the Tableflow Python script that I showed you earlier. You can successfully create the job and execute it. You can view the status of the job execution. It is successfully executed and completed. You can go to your workspace and you should see a new table created by this script. Basically, in this use case, we converted a Kafka topic to an Iceberg table, then ran a query inside Onehouse to convert it back to another Hudi table. You can include any processing logic as part of the script that I showed. That basically concludes the demo.  

Yashwanth Dasari - 00:16:03  

Sounds good. Thank you, Kasun, for such a nice demo. In case you want to read and learn more about Confluent, we have a nice blog about Tableflow. We also have a very short explained video on Tableflow on YouTube. Please scan the QR images shown on the slide to get started with Confluent Cloud and also read and watch more about Tableflow if you need any further information. Thank you so much.  

Demetrios - 00:16:39  

Sweet fellas, there's a lot of questions coming through, so I'm gonna just start firing away. Correct me if I'm wrong here. If the ingestion rate is very high, we might end up with several small Parquet files. How is the compaction performed? Compaction, not sure if that's a word, but I like it. Is this something automatic or something the user would configure? Compaction frequency? Is that a word that I just am not aware of, or is that it? It is a word, isn't it? They're using it like they're very confident it's a word. All right, so did you get that question?  

Kasun Indrasiri Gamage - 00:17:20  

Yeah, question for real time. We continuously take care of compacting all these data files. User configurable, but we optimize it based on all the parameters that we have internally and create a compaction for you. You don't have to set anything.  

Demetrios - 00:17:53  

All right. Your audio kind of broke up there, but it was basically like the compaction happens on your side of the fence, is what I understood.  

Kasun Indrasiri Gamage - 00:18:03  

Yeah, it's handled by us.  

Demetrios - 00:18:05  

Excellent. All right. So when looking at Tableflow, how, and does it come from a streaming source like Confluent? I found Iceberg frequently chokes with my streaming use cases. Do you have plans to add Hudi as well? Since we use some of both?  

Kasun Indrasiri Gamage - 00:18:28  

Maybe in the future, based on the demand, we might also consider plans. We don't have Hudi now.  

Demetrios - 00:18:34  

All right. So right now, no. Sorry man, your audio's a little bit choppy, so I wanna make sure that was clear. Right now, no, but if enough people ask for it in the future, yes. Later.  

Kasun Indrasiri Gamage - 00:18:49  

Yeah.  

Demetrios - 00:18:51  

All right. What is the current latency? This is where you get to show off some numbers.  

Kasun Indrasiri Gamage - 00:18:59  

So, depends on the...  

Demetrios - 00:19:05  

Did you hear me properly?  

Kasun Indrasiri Gamage - 00:19:05  

Yes.  

Demetrios - 00:19:05  

All right. I think I got it. Basically it's potentially 15 minutes, but it all depends on the amount of data and the shape of the data and all that fun stuff.  

Kasun Indrasiri Gamage - 00:19:26  

Correct.  

Demetrios - 00:19:27  

All right, cool. Folks would love to know what the GitHub URL is and how does Tableflow for Delta Lake compare with Databricks DLT?  

Kasun Indrasiri Gamage - 00:19:43  

There are two options of creating tables. With the Databricks DLT, and Kafka infrastructure. It's completely up to the user to decide. If you want to keep the direct parity between a Kafka topic and a table, Confluent might be the best option. But if you're already using DLTs and that is easier to use, then DLT would be the best option to convert a Kafka topic to a table.  

Demetrios - 00:20:35  

What capabilities would you highlight in Tableflow for analytical loads?  

Kasun Indrasiri Gamage - 00:20:43  

The purpose of Tableflow is to create destination or target tables. We try to optimize it for all the analytical work currently. Partitioning is something we plan to include in upcoming releases. We keep optimizing it for analytical workload with compaction and at the same time controllers like partitioning and various ways of optimizing these tables for your analytical workloads.  

Demetrios - 00:21:21  

Excellent. Fellas, this was wonderful. I wanna give you all a huge shout out and really appreciate your participation in this.