AI Agents for ETL/ELT Code Generation: Multiply Productivity with Generative AI
This talk delves into the revolutionary potential of AI agents, powered by generative AI and large language models (LLMs), in transforming ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform) processes.With an emphasis on automating code generation, streamlining workflows, and enhancing data quality, this session explores how AI-driven solutions are reshaping the landscape of data engineering. Attendees will learn how these intelligent agents can reduce manual coding, eliminate errors, improve operational efficiency, and meet compliance requirements, all while accelerating development timelines.We will also cover key use cases where AI agents facilitate real-time data transformation, ensure data governance, and promote seamless deployment across cloud environments, giving businesses a competitive edge in today's data-driven world.
Transcript
AI-generated, accuracy is not 100% guaranteed.
Speaker 0 00:00:00
<silence>
Speaker 1 00:00:06
Let's keep it rocking. We've got Koti coming up for another Lightning talk from Southwest. Where you at? Koti? Hey, there he is, sub dude.
Speaker 2 00:00:19
Hey, thanks for having me.
Speaker 1 00:00:22
I'm gonna jump off stage because I wanna stay on time. I see your screen is right here. Let's get rocking and rolling.
Speaker 2 00:00:31
Hello everyone. This is Koti Darla. I'm the tech lead data engineer in Southwest Airlines. I'm super excited to talk, talk about AI agents for ETL core generation, which helps multiply productivity with the generative ai, uh, latest technologies. And, um, uh, just to give you a very high level introduction, why this matters, uh, in today's fast face data landscape, uh, time mission critical. So a ai AI agents can dramatically reduce the effort to, uh, generate the insights by, uh, automating complex data workflows. And, uh, the generate ai LLM driven approach transforms raw data interaction, will insights in real time. Um, so I listed here the, uh, current ETL challenges, but before I go, uh, talk about them, I want to just give you a very high level, um, brief on what is ETL For those who don't know. Uh, ETL has three phases, extract, transform, and load and extract.
Speaker 2 00:01:31
We extract the data from various data sources, and, uh, then in the transform step, we do the standard and cleansing and, um, uh, data and converting it to the design, uh, output format, and then load them into, um, databases. And, and, um, here are some of the, uh, manual ETL bottlenecks, even with the low code or no code and coding sometimes, uh, for the, for especially, uh, company specific requirements, uh, which leads to extended development cycles, and also slow time to mark. Also, development cost is more and in terms of training and onboarding costs, and also scarcity of our skilled resources for specialized tools. So with that, uh, a paradigm shift with all the new current, uh, current, uh, developments in the technology. Um, so we can build, generate AI powered automation in the where it creates, um, and manages data pipelines based on the, uh, natural language input, which handles both structures, semi-structure, and also, uh, streaming.
Speaker 2 00:02:38
Uh, we can build end-to-end, uh, pipeline automation from data ingestion to the transformation. Uh, we are also, as the l lms are evolving, it is very important to check the quality of the data pipeline, uh, generation with LLMs. So we also need quality analyst agents. And also, uh, once the quality checks are done, the business users can query visualization insights, uh, using simple natural language, uh, on open table format. So these agents, uh, uh, built on top of, um, uh, LLMI powered, uh, automation, uh, can seamlessly deployed on AWS Azure or GCP or any cloud, uh, for Ensure during the flexibility and scalability. Um, so I, I'm just, uh, uh, giving you here the high level conceptual architecture overview. So the user, uh, all he needs is, um, different end application where, where the user, uh, can prompt, uh, what he wants to create from the front, the data pipeline perspective.
Speaker 2 00:03:43
So, uh, the front end can be created using any frontend, um, uh, base or angular, uh, any of this, uh, comfortable with you. And, and also you can use Python as a backend, uh, where you can use flask as a framework. So here, if you see, um, user, uh, is, um, prompting here, uh, maybe it's, it's not visible to you, but, um, I, I can read help. Um, so create a pipeline to load the attached order json, uh, into hoodie. So the user wants to create a hoodie table, um, reading the order, order json file, uh, and the cleanse, the, he's giving more instructions to the, uh, agents that, um, uh, cleanse the data by removing records with, uh, missing payment status, and also standardize the, uh, dates to Y-Y-M-M-D-D format and also map some of the columns and ensure the pipeline can handle incremental updates.
Speaker 2 00:04:38
So this is what a user wants to do, uh, for creating the pipeline. And, um, once he hits the submit, right, so it, the request goes to the LLM, so we can use lang chain, uh, Python, uh, open source Python library to interact with LLMs and submit this request to LLM. And, uh, LLM can look at this request and, uh, see order, order json, and, um, it can act as a data modeler and, uh, models this data. Uh, and um, also, uh, also, we, we, we can work with the hugging face embeds and, uh, chromo vector, uh, databases to, uh, really, uh, the natural language can be represented into numerical forms. And, um, here, uh, when the user sub of prompt, it can create a spark or link, uh, code to read the data from the, uh, cloud stories, uh, stories are even from the Kafka streaming or any source.
Speaker 2 00:05:36
And, uh, also creates the, um, hoodie table. In this case, hoodie table user can ask for the iceberg or Delta Lake or even for the, um, Redshift or MySQL, any, any databases that a user wants to, that all he needs to do is, uh, uh, ask in the, in the, uh, prompt. So I know the LMS are still evolving, and, uh, there would be, uh, there would be, um, a chance of quality issues or whatever the code it generates, but, uh, we can always look at a multimodal approach where, uh, we can, uh, um, take the output of one, um, give it to the other for verification validation. So that is one thing we can do. And, and, uh, also we can have the quality assistance and, um, quality analytic schema, validations, uh, duplication detection, and then out layer flagging, also automated test case generation using metadata if in this case, right, if you see the screenshot here, um, when user prompted to create a pipeline, it created a pipe here, um, the order json and the input from there.
Speaker 2 00:06:47
And, uh, it, it, it created tables like orders and, uh, customers and then, uh, order items and also quality, uh, is, and, uh, was looking at, okay, what, what, how many fields are there in this table? And are these matching with the, uh, other Jason that we get from the filer in the kaka topic? Whatever the user gives, right? So here it is saying, failed and failed, then we have to go back to a data pipeline isn't, and, uh, fine tune this by giving more input to, uh, uh, to the, uh, through the prompt. And here in this case, customers, it, it, uh, table it passed in both, uh, schema validation perspective and also counts perspective and raw mismatches and field mismatches. So this way, uh, we can get on the, on the data pipeline that gets created, uh, using, uh, once we get that, um, uh, quality check done, then we can, um, uh, the users, end users can, uh, go, go ahead and, uh, use the visualization agent where they can, uh, enter the simple queries using natural language, where here, in this case, visualize sales by category.
Speaker 2 00:07:54
Then it, um, uh, looks at the backend in the hoodie table berg they created, right? So from there, it pulls the data and creates a bar chart. And, uh, this is the, um, real time dashboards where you business users can create directly with the natural line, and also no need to write any SQL or, uh, wait for data teams. And with these key takeaways, um, so we discussed about manual ETL is, uh, resource heavy for today's needs, accelerate delivery, reduced cost and <inaudible> users, and also realtime visualization, cross cloud deployment options, open format compatible architecture. We, we took, we, uh, uh, we went through. And also the final, this is, uh, uh, just conceptual view and, uh, based on your needs, and you can, um, uh, play with the tools and, um, create a data pipeline isn't our quality, isn't our visualization is isn't. I think that's all I have for this session.
Speaker 1 00:08:56
Oh, yeah. Now, uh, there are a few questions coming through and because I'm gonna ask them very fast, and hopefully you can answer them very fast, we, and we have a break right after this that we can eat into what LLM is being used.
Speaker 2 00:09:13
Yeah. Here. I, I used, uh, Gemini.
Speaker 1 00:09:16
Nice. Good choice.
Speaker 2 00:09:18
Yeah. But, but we can use any choice. Uh, it, it's not, uh, hard and fast rule that we can use any, any LLM here in this case. But yeah, for, for just for, um, um, just to play with that, I used, um, Gemini,
Speaker 1 00:09:30
Is this capacity open sourced?
Speaker 2 00:09:35
Um, it is just a proof of concept. I, I didn't really do anything there. Yeah, it's, it is a conceptual view.
Speaker 1 00:09:42
Very cool. All right, Koti. Thank you man.