May 13, 2024

Optimizing JobTarget’s Data Lake: Migrating to Hudi, a Serverless Architecture, and Templated Glue Jobs‍

Optimizing JobTarget’s Data Lake: Migrating to Hudi, a Serverless Architecture, and Templated Glue Jobs‍

Apache Hudi-based data lakes, together with templated AWS Glue jobs and a serverless architecture, are critical to JobTarget’s strategy for managing scale, resources, and performance in the face of rapid data growth. In his talk at Open Source Data Summit 2023, Soumil Shah, Data Engineering Lead at JobTarget, shared insights, practical examples, and related source code. View the full talk or keep reading for a detailed overview. 

As part of their mission of helping job seekers and employers connect, JobTarget has been accumulating data at a rapid rate; the amount of data they store doubled just between June and October last year. As a consequence of the data growth rate and its impact, they’ve decided to migrate to an Apache Hudi-based architecture. Data Engineering Lead Soumil Shah shared why they made this decision, and how they aggressively optimized their Hudi data lake by implementing an AWS Glue-based ingestion framework called LakeBoost, which is now available as an open source project.

As a result of the migration and adoption of the AWS Glue framework, Shah’s team saw improvements across the board. Data ingestion is now automated, making it a much faster and more streamlined process. Data is efficiently deduplicated on ingestion, with its uniqueness properties maintained as the system scales. The system makes more efficient use of storage and compute resources, which translates into much faster querying and a reduction in both storage and compute-related costs.

JobTarget Data Systems are Experiencing Rapid Growth

Figure 1: Rate of data growth at JobTarget

JobTarget provides a comprehensive suite of recruitment, advertising, job distribution, and analytics tools. Data analytics is a core product offering, helping companies make data-driven decisions around recruiting. To power their analytics, JobTarget ingests and processes vast amounts of job advertising, distribution, and other third-party data, with their underlying dataset approximately doubling in size in four months

This rapid rate of growth has presented Shah’s team with a broad set of challenges related to data visualization, quality, security, analysis, and real-time processing. Any kind of work done at this scale and rate of change can be difficult.

Migrating to an Apache Hudi-based Data Lake

The team at JobTarget saw Apache Hudi as the perfect tool to address the growth challenges at JobTarget: 

  • It provides ACID transactions and other traditional database features on top of a data lake.
  • It supports table structures and isolated transactions.
  • It supports both streaming and batch processing.
  • It’s popular and widely adopted across the industry, including by high-profile large companies like Uber, Amazon, and ByteDance.
  • It’s compatible with open source file formats and all cloud storage platforms.
  • It integrates with multiple popular query engines to support performant analytics.

Shah said JobTarget has seen many benefits to adopting Apache Hudi. Hudi’s support for performant querying has allowed JobTarget to provide their clients with faster and more accurate insight extraction and reporting capabilities. They were able to effectively address data duplication issues by building on top Hudi’s precomb key features and global indexes. And they significantly improved their data management and organization capabilities using Hudi’s data management framework. 

JobTarget has also achieved significant cost savings. Hudi’s systems support the Parquet file format for storage, which is more efficient in both storage and compute costs than the previous systems in use at JobTarget. With Hudi, they’ve also been able to maintain only the most recent version of files in a transactional data lake, significantly optimizing their storage costs.

“Because now we don't need to write infrastructure code, we can ingest data from any source in a much faster way. And we are building a large Apache Hudi transactional data lake for our customers. Customers can simply query the lake, using Redshift Spectrum, Athena, Spark SQL, or whatever their favorite query tool is. We have maximized efficiency by using this approach.” — Soumil Shah, Data Engineering Team Lead at JobTarget

Using the AWS Glue-based Framework LakeBoost

Shah delved into the specifics of a core part of the Apache Hudi pipeline implementation: the now open-sourced AWS Glue-based ingestion framework LakeBoost. 

Figure 2: High-level architecture for the Hudi-based data pipelines at JobTarget

At a high level, the data team at JobTarget uses templated Glue jobs to move data through four stages:

  1. The team ingests raw data, as produced by each application, into an app-specific AWS bucket (equivalent to a Bronze zone).
  2. A Hudi-managed Silver zone, represented as a data team-controlled AWS bucket, stores clean data.
  3. A Hudi-managed Gold zone, represented as a data team-controlled AWS bucket, stores enriched data.
  4. The team outputs data via Hudi-managed access controls, in various formats, to various clients.

During each stage transition, Glue jobs perform different tasks. Raw data from the first stage is cleaned, deduped, transformed to Parquet format, prepared for performant querying and manipulation, and then stored in the Silver zone. Once the data is cleansed, it is enriched in the Gold zone, where Glue jobs connect it to dimension and fact tables. The final result, once ready for consumption, is exposed to consumers through Apache Hudi-managed interfaces, which can be included in catalogs such as the AWS Glue Catalog. Data team clients then consume the output via a variety of query engines and tools.

Figure 3: Detailed architecture for JobTarget’s AWS Glue job-based ingestion framework

At each point, as data moves across the four stages, transactional data is ingested into one or more Hudi tables. To support doing this work at scale, JobTarget implemented an easy-to-use framework (see Figure 3).

The framework exposes an API to engineers and allows them to define data ingestion jobs programmatically, without ever having to write or understand any kind of infrastructure-related code. Data processing starts with a programmatic request to an API—a request that includes a data source, some metadata - such as what schedule the data should be processed on, how it is indexed, how it should be deduplicated, whether and how it should be transformed - and a data destination. 

Based on that request, the framework takes care of everything that might need to happen, invisibly to the engineer. This includes using templated tools to automate ingestion (built on top of AWS Glue); a serverless architecture to support auto-scaling and automated management of processing jobs,using AWS Lambda; and incremental, scalable processing using Apache Hudi change data capture (CDC). 

The framework also provides all supporting ops infrastructure, including retry logic, scheduling, and monitoring and alerting. For example, Glue job failure events are automatically transformed into emails, notifying the correct folks in case a job has run into trouble. 

Figure 4: A common type of job one might build with the JobTotal’s framework

To illustrate just how powerful the ingestion framework is, Shah also highlighted one of the most common types of transitions performed when bringing data into the Gold zone. In order to enrich and standardize events data to JobTarget-specific buckets, external events are regularly mapped to JobTarget names, and to internal JobTarget-specific formats and definitions of calendar dates. This kind of transition should be familiar to most data engineers—and it’s very easy to define, build, and run at scale using the framework.

Conclusion

JobTarget has reaped significant benefits by migrating to an Apache Hudi-based implementation for their data lakehouse. Their optimization efforts, using streamlined AWS Glue jobs, automated Lambda functions, and an efficient mechanism for handling failure, have allowed them to:

  • Streamline their system. The system now ingests data from any source with significantly less manual effort, can be managed programmatically without support from data or infrastructure team members, and provides a unified interface for all consumers of the data. Whether they use Redshift or Spark or any other engine when consuming, consumers just have to deal with one source interface. 
  • Query large data sets quickly. Hudi’s efficient file management system (removing old files and keeping only the most recent ones), the move to the Parquet format for internal storage, and clustering have made querying faster.
  • Deduplicate with ease. Hudi precomb key-powered deduping of data at ingestion and global indexes help keep data unique across partitions and zones. Update and upsert operations are much faster as a result.
  • Process data efficiently. By relying on Hudi’s support for CDC and incremental processing, they were able to significantly reduce processing time and resource use.
  • Save costs. More efficient storage, processing, and management of files translate into lowered costs across the board.

If you’re interested in details of the implementation of JobTarget’s templated Glue job-based ingestion framework, its source is open and available on GitHub. For a more detailed technical description of their work and other related information, including a deep dive into the developer journey with JobTarget’s framework, see the complete talk.

Read More:

Subscribe to the Blog

Be the first to read new posts

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.