AI Agents for ETL/ELT Code Generation: Multiply Productivity with Generative AI
This talk delves into the revolutionary potential of AI agents, powered by generative AI and large language models (LLMs), in transforming ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform) processes.With an emphasis on automating code generation, streamlining workflows, and enhancing data quality, this session explores how AI-driven solutions are reshaping the landscape of data engineering. Attendees will learn how these intelligent agents can reduce manual coding, eliminate errors, improve operational efficiency, and meet compliance requirements, all while accelerating development timelines.We will also cover key use cases where AI agents facilitate real-time data transformation, ensure data governance, and promote seamless deployment across cloud environments, giving businesses a competitive edge in today's data-driven world.
Transcript
AI-generated, accuracy is not 100% guaranteed.
Demetrios - 00:00:06
Let's keep it rocking. We've got Koti coming up for another Lightning talk from Southwest. Where you at? Koti? Hey, there he is, sub dude.
Koti Darla - 00:00:19
Hey, thanks for having me.
Demetrios - 00:00:22
I'm gonna jump off stage because I wanna stay on time. I see your screen is right here. Let's get rocking and rolling.
Koti Darla - 00:00:31
Hello everyone. This is Koti Darla. I'm the tech lead data engineer in Southwest Airlines. I'm super excited to talk about AI agents for ETL core generation, which helps multiply productivity with the generative AI latest technologies. Just to give you a very high-level introduction why this matters, in today's fast-paced data landscape, time is mission critical. AI agents can dramatically reduce the effort to generate insights by automating complex data workflows. The generative AI LLM-driven approach transforms raw data interaction with insights in real time.
I listed here the current ETL challenges, but before I talk about them, I want to give you a very high-level brief on what ETL is for those who don't know. ETL has three phases: extract, transform, and load. In extract, we extract the data from various data sources. Then in the transform step, we do the standard cleansing and data conversion to the desired output format, and then load them into databases.
Here are some of the manual ETL bottlenecks even with low code or no code and coding sometimes, especially for company-specific requirements, which leads to extended development cycles and slow time to market. Development cost is higher in terms of training and onboarding costs, and there's also scarcity of skilled resources for specialized tools.
With that, a paradigm shift with all the new current developments in technology, we can build generative AI-powered automation that creates and manages data pipelines based on natural language input, which handles structured, semi-structured, and streaming data.
We can build end-to-end pipeline automation from data ingestion to transformation. As LLMs are evolving, it is very important to check the quality of the data pipeline generation with LLMs. So we also need quality analyst agents. Once the quality checks are done, business users can query visualization insights using simple natural language on open table formats.
These agents built on top of LLM-powered automation can be seamlessly deployed on AWS, Azure, or GCP or any cloud to ensure flexibility and scalability.
I'm just giving you here the high-level conceptual architecture overview. The user only needs different end applications where they can prompt what they want to create from the data pipeline perspective.
The front end can be created using any frontend base like Angular or anything comfortable for you. You can also use Python as a backend with Flask as a framework.
Here, if you see, the user is prompting, maybe it's not visible to you, but I can read it: "Create a pipeline to load the attached order JSON into Hudi." The user wants to create a Hudi table reading the order JSON file and cleanse the data by removing records with missing payment status, standardize the dates to YYYY-MM-DD format, map some columns, and ensure the pipeline can handle incremental updates.
This is what the user wants to do for creating the pipeline. Once they hit submit, the request goes to the LLM. We can use LangChain, an open-source Python library to interact with LLMs and submit this request. The LLM can look at this request, see the order JSON, and act as a data modeler to model this data.
We can also work with Hugging Face embeddings and Chroma vector databases to represent natural language into numerical forms. When the user submits the prompt, it can create Spark or Flink code to read data from cloud stores, Kafka streaming, or any source.
It also creates the Hudi table. The user can ask for Iceberg, Delta Lake, Redshift, MySQL, or any database they want by specifying it in the prompt.
I know LLMs are still evolving, and there could be quality issues with the generated code, but we can use a multimodal approach where we take the output of one and give it to another for verification and validation.
We can also have quality assistance with schema validations, duplication detection, outlier flagging, and automated test case generation using metadata.
If you see the screenshot here, when the user prompted to create a pipeline, it created tables like orders, customers, and order items. The quality agent looked at how many fields are in the table and if they match with other JSON files from the Kafka topic or whatever the user provides.
Here it is saying "failed" in some cases, so we have to go back to the data pipeline and fine-tune it by giving more input through the prompt. In this case, the customers table passed both schema validation and counts perspective with no mismatches.
This way, we can get a data pipeline created. Once the quality check is done, end users can use the visualization agent to enter simple queries using natural language. For example, "Visualize sales by category."
It looks at the backend Hudi table created and pulls the data to create a bar chart. This is a real-time dashboard where business users can create visualizations directly with natural language, no need to write SQL or wait for data teams.
Key takeaways: manual ETL is resource-heavy for today's needs. This approach accelerates delivery, reduces cost, and empowers users with real-time visualization, cross-cloud deployment options, and open format compatible architecture.
This is just a conceptual view, and based on your needs, you can play with the tools to create data pipelines, quality checks, and visualizations. I think that's all I have for this session.
Demetrios - 00:08:56
There are a few questions coming through, and I'm gonna ask them very fast, hopefully you can answer them quickly. We have a break right after this. What LLM is being used?
Koti Darla - 00:09:13
I used Gemini.
Demetrios - 00:09:16
Nice. Good choice.
Koti Darla - 00:09:18
But we can use any choice. It's not a hard and fast rule. We can use any LLM here. For just playing with it, I used Gemini.
Demetrios - 00:09:30
Is this capacity open sourced?
Koti Darla - 00:09:35
It is just a proof of concept. I didn't really do anything there. It's a conceptual view.
Demetrios - 00:09:42
Very cool. All right, Koti. Thank you man.