This blog post is a brief summary of the Apna talk from Open Source Data Summit 2023, a virtual event held on November 15th, 2023. Onehouse served as founding sponsor. Visit the event page to register and view the talk.
Key results included:
Ronak Shah introduced the talk, describing Apna as the largest and fastest-growing site for professional opportunities in India, with 33 million diverse users. Users create profiles, search for jobs, and connect through the online community. The site features an AI/ML-powered smart job-matching engine.
The original architecture is a data warehouse built by Apna on top of Google BigQuery, which provides an easy warehousing experience. It uses structured transactional data that comes in via CDC from Postgres and MongoDB, ingested using Hevo, along with clickstream sources brought in through Mixpanel. Updates are batched, with frequencies ranging from hourly to daily.
Apna experienced many issues with the original architecture:
Apna saw these limitations as an opportunity to create something better. A new system could better meet the needs of a diverse set of users who are bringing new and existing use cases to the data in BigQuery:
In response to these varied and growing demands, Apna is building a new platform, Apna Analytics Ground (AAG – which is also the word for fire in Hindi). The new platform is built on the data lakehouse architecture, which unifies the flexibility of a data warehouse for updates and SQL queries with the advanced analytics capabilities of a data lake.
AAG uses the medallion architecture (link), with multiple inputs unified in Kafka to Onehouse, to a bronze table, through incremental ETLs using Onehouse services to a silver table.
Sarfaraz Hussain provided details for AAG. AAG wanted to implement their data lake using open source Apache Hudi; they decided to use the Onehouse managed service to speed up development of their lakehouse, while using Hudi open source for dbt and Spark downstream. Onehouse offers managed ingestion from sources like Postgres, MongoDB, and Kafka, and it automatically streams data from these sources into managed Hudi tables on GCS or S3.
For AAG, transactional data comes from Postgres via a Debezium Kafka connector and from Mongo via the Mongo Atlas connector. Apna has written a logging service to bring data from their web and mobile applications into a Kafka topic. This results in one topic per table; for instance, if there are 100 tables in the Postgres database, there will be 100 Kafka topics with the changelogs from Postgres.
The bronze layer is in Google Cloud Storage in Hudi format. It serves as long-term storage with months or years of raw data – an exact append-only data log, with no filtering nor transformations. Analytics users do not have access to the bronze layer. The relevant bronze tables can always be used to recreate downstream tables from any point in time, using Onehouse incremental processing capabilities, in case of any errors.
AAG then uses Onehouse streams to create tables in the silver layer, queryable for analysts and all downstream users. The silver streams do a lot of work: schema validation, timestamp validation, and transformations such as flattening of complex struct fields, deriving new columns where needed, adding current timestamp values, and other steps as needed.
Clickstream data is received as a complex JSON object with events of multiple types, so it’s flattened and fanned out. Data received from CDC goes through schema validation; then it is filtered, flattened and so on, and finally materialized into an up-to-date version of the source table from MongoDB or Postgres. At the same time, it writes into the Hive metastore and also exposes the same data in BigQuery/BigLake as external tables. This allows for data checking and validation with BigQuery, with no need to write and deploy a Spark job to do this work.
Exposing data in BigQuery using the BigQuery-Hudi sync preserves familiarity for analysts in the organization, with the ability to migrate users to Presto or Spark if desired over time.
dbt on Hudi/Spark is then used to create a gold layer on top of the silver layer, with Presto serving as an ad hoc query engine which can query data from the silver and gold layers as needed. This allows Apna to use managed services for part of their data needs, while leveraging the power of open-source for other use-cases.
In the talk, Sarfaraz provides many additional specifics about handling records that fail schema matching in a quarantine table; partitioning data on ingestion date to help in reprocessing whenever needed; and how compaction and clustering are used to maintain silver tables, including for very large tables with billions of records. Aggregated tables, such as feature stores for machine learning or precomputed tables for business intelligence, are kept up-to-date as desired, usually refreshed on a daily cadence. All of this avoids a lot of heavy joins.
There are many benefits to the new architecture:
Be the first to read new posts