December 13, 2023

Overhauling Data Management at Apna

Written by:

Floyd Smith

This blog post is a brief summary of the Apna talk from Open Source Data Summit 2023, a virtual event held on November 15th, 2023. Onehouse served as founding sponsor. Visit the event page to register and view the talk.

Ronak Shah, Head of Data, and Sarfaraz Hussain, Sr. Data Engineer, shared the results from their recent implementation of the Universal Data Lakehouse at Apna.

Key results included:

The architecture makes the new data platform central to all data-related operations
There are fewer potential points of failure
Data that was ingested by batch jobs, typically on a daily cadence, is now delivered via streaming in near real-time
SLAs for data freshness can be finely tuned, from a minute to a full day as needed
Data is more consistent and more easily available, with fewer issues
Many sources of cost are eliminated
Storage and compute are fully decoupled
Easier to handle time travel and near real-time synchronization
Users can continue to use BigQuery SQL for queries - and can now easily add new compute engines such as Spark and Presto

Ronak Shah introduced the talk, describing Apna as the largest and fastest-growing site for professional opportunities in India, with 33 million diverse users. Users create profiles, search for jobs, and connect through the online community. The site features an AI/ML-powered smart job-matching engine.

*Figure 1. Apna is a large presence and a rising star in its home market.*

The original architecture is a data warehouse built by Apna on top of Google BigQuery, which provides an easy warehousing experience. It uses structured transactional data that comes in via CDC from Postgres and MongoDB, ingested using Hevo, along with clickstream sources brought in through Mixpanel. Updates are batched, with frequencies ranging from hourly to daily.

Apna experienced many issues with the original architecture:

Batch processing hurts data freshness and consistency
Time travel is only supported for seven days
Near real-time sync for CDC data via Hevo is possible, but too costly due to the complex ETL operation (an UPSERT), so sync frequency was set to 1-3 hours to reduce cost
When processing large batches of Hevo change streams, a large number of partitions are created for a temporary external BigQuery table; this is costly and time-consuming to merge/UPSERT; also, the process sometimes fails with no notification
There is limited alerting for issues with Hevo; operators had to check in the UI to see whether problems occurred
The use of multiple third-party systems along with in-house code makes development and maintenance difficult
Multiple interacting systems creates multiple potential points of failure, so operations for the production system are challenging

A diagram of a diagramDescription automatically generated — *Figure 2. The Apna architecture uses CDC and streaming sources.*

Apna saw these limitations as an opportunity to create something better. A new system could better meet the needs of a diverse set of users who are bringing new and existing use cases to the data in BigQuery:

Most current usage of the data is via a SQL interface to BigQuery
Analysts and product managers track the performance of product features using both CDC and Clickstream data
Data scientists derive features to train models, also using both CDC and Clickstream data
Data products built on the data include a job feed service (mostly using CDC data) and growth campaigns and other data products
There is growing demand to use Spark for data pipelines, ML, and ad-hoc data engineering analysis
Data scientists need point-in-time queries, which are currently limited to seven days

In response to these varied and growing demands, Apna is building a new platform, Apna Analytics Ground (AAG – which is also the word for fire in Hindi). The new platform is built on the data lakehouse architecture, which unifies the flexibility of a data warehouse for updates and SQL queries with the advanced analytics capabilities of a data lake.

AAG uses the medallion architecture (link), with multiple inputs unified in Kafka to Onehouse, to a bronze table, through incremental ETLs using Onehouse services to a silver table.

Sarfaraz Hussain provided details for AAG. AAG wanted to implement their data lake using open source Apache Hudi™; they decided to use the Onehouse managed service to speed up development of their lakehouse, while using Hudi open source for dbt and Spark downstream. Onehouse offers managed ingestion from sources like Postgres, MongoDB, and Kafka, and it automatically streams data from these sources into managed Hudi tables on GCS or S3.

A diagram of a companyDescription automatically generated — *Figure 3. The new Apna flow uses Onehouse for bronze and silver tables.*

For AAG, transactional data comes from Postgres via a Debezium Kafka connector and from Mongo via the Mongo Atlas connector. Apna has written a logging service to bring data from their web and mobile applications into a Kafka topic. This results in one topic per table; for instance, if there are 100 tables in the Postgres database, there will be 100 Kafka topics with the changelogs from Postgres.

The bronze layer is in Google Cloud Storage in Hudi format. It serves as long-term storage with months or years of raw data – an exact append-only data log, with no filtering nor transformations. Analytics users do not have access to the bronze layer. The relevant bronze tables can always be used to recreate downstream tables from any point in time, using Onehouse incremental processing capabilities, in case of any errors.

AAG then uses Onehouse streams to create tables in the silver layer, queryable for analysts and all downstream users. The silver streams do a lot of work: schema validation, timestamp validation, and transformations such as flattening of complex struct fields, deriving new columns where needed, adding current timestamp values, and other steps as needed.

A diagram of a cloud computing systemDescription automatically generated — *Figure 4. Previous query options are preserved and advanced analytics options are added.*

Clickstream data is received as a complex JSON object with events of multiple types, so it’s flattened and fanned out. Data received from CDC goes through schema validation; then it is filtered, flattened and so on, and finally materialized into an up-to-date version of the source table from MongoDB or Postgres. At the same time, it writes into the Hive metastore and also exposes the same data in BigQuery/BigLake as external tables. This allows for data checking and validation with BigQuery, with no need to write and deploy a Spark job to do this work.

Exposing data in BigQuery using the BigQuery-Hudi sync preserves familiarity for analysts in the organization, with the ability to migrate users to Presto or Spark if desired over time.

dbt on Hudi/Spark is then used to create a gold layer on top of the silver layer, with Presto serving as an ad hoc query engine which can query data from the silver and gold layers as needed. This allows Apna to use managed services for part of their data needs, while leveraging the power of open-source for other use-cases.

In the talk, Sarfaraz provides many additional specifics about handling records that fail schema matching in a quarantine table; partitioning data on ingestion date to help in reprocessing whenever needed; and how compaction and clustering are used to maintain silver tables, including for very large tables with billions of records. Aggregated tables, such as feature stores for machine learning or precomputed tables for business intelligence, are kept up-to-date as desired, usually refreshed on a daily cadence. All of this avoids a lot of heavy joins.

There are many benefits to the new architecture:

Data is available in near real-time
SLAs are set for data coming in, tuned per table, from one minute to a full day as needed
Data is more consistent and more available, with fewer issues
Custom alerts and monitoring are provided using Grafana, facilitated by Onehouse services
Hudi external transformers deliver SCD2 support
Storage and compute are fully decoupled, allowing (for example) clusters to be shrunk during slow night-time hours to save cost
There is now the ability to leverage any compute engines such as Spark and Presto, with no vendor tie-in

Authors

No items found.