April 17, 2024

Scaling and Governing Robinhood’s Data Lakehouse

Scaling and Governing Robinhood’s Data Lakehouse

In a detailed and deeply technical talk given at the Open Source Data Summit 2023, we learned how Apache Hudi supports critical processes and governance at scale in Robinhood’s data lakehouse. View the full talk or read below for an overview. 

In their talk, Robinhood team members Balaji Varadarajan, Senior Staff Engineer, and Pritam Dey, Tech Lead, described their company’s data lakehouse implementation. We learned how Robinhood’s lean data team has been relying heavily on Apache Hudi and related OSS services to handle exponential growth into the multi-petabyte scale. 

Key takeaways include details of their tiered architecture implementation; how the same architecture can be applied to track metadata and meet related SLAs (such as for data freshness); and how GDPR compliance and other data governance processes can be efficiently implemented at scale.

View highlights below or on Youtube, and visit the Open Source Data Summit site for the full talk.

Implementing the Robinhood Data Lakehouse

Pritam Dey started by giving us an overview of the Robinhood Data Lakehouse and overarching data ecosystem. The ecosystem supports more than ten thousand data sources, processes multi-petabyte datasets, and handles use cases that vary wildly in data freshness patterns (from near-real-time streams to static), data criticality, traffic patterns, and other factors.

Figure 1: High-level overview of the Robinhood Data Lakehouse

The Robinhood Data Lakehouse ingests data from many disparate sources: streams of real-time app events and experiments, third-party data available on various schedules via API, and online RDBMSes such as Postgres. This data must then be made available to many consumer types and use cases—including for both high-criticality use cases, such as fraud detection and risk evaluation, and for lower-criticality ones, such as analytics, reporting, and monitoring. 

Dey explained that support for all of the various use cases at Robinhood is built on top of a multi-tiered architecture, with highest-criticality data being processed in Tier 0, and subsequent tiers used to process data with lower constraints. To illustrate how the architecture addresses Robinhood’s needs, Dey drilled down to the implementation of a critical tier within Robinhood’s lakehouse.

Figure 2: Sample high-level architecture overview for a critical data tier 

Data processing in each tier starts with a data source—for this example, Debezium is monitoring a relational database service (RDS), such as Postgres. Before a tier can fire up, a one-time bootstrap process completes, making sure initial target tables and schemas are defined in the data lakehouse—anticipating a Debezium-driven change data capture (CDC) stream. Once the tables are in place, a multistep process is fired up and kept alive for the lifetime of the tier:

  1. Data is written into an RDS from any upstream application, API, or other data source, potentially in real-time and at high volume.
  2. Debezium uses one of the many predefined connectors to monitor the RDS and detect data changes (writes and updates). It then packages the data changes into CDC packages and publishes them to Kafka streams or topics.
  3. An instance of Apache Hudi’s DeltaStreamer application (powered by Spark) processes the CDC packages—either batching them or streaming them, as appropriate.
  4. As a side effect of its operation, the DeltaStreamer produces Hive schema and metadata updates—tracking data freshness, storage and processing costs, access controls, etc.
  5. After processing, incremental data updates and checkpoints are written to a data lake or object store such as Amazon S3.

Freshness Tracking for Critical Metadata at Scale

To show us how the architecture above generalizes and can be expanded on, Dey demonstrated how a critical metadata property—freshness—is maintained.

Figure 3: Freshness Tracking system architecture

Internally-produced metadata used for tracking data freshness (from Debezium and Apache Hudi sources) is looped back through the infrastructure mentioned at steps 2 and 3 in the process above (i.e., the Debezium-fed Kafka streams and the DeltaStreamer), and then fed to step 4. That is, Hive metadata stores are updated with changes in Debezium state and other freshness metrics produced by the DeltaStreamer.

Dey demonstrated that support for this kind of tiered architecture is built on a mix of core features of Apache Hudi and other OSS components of the lakehouse. Key features that a tiered architecture relies on include:

  • The ability to distinguish between tables at different tiers based on metadata, which Hudi supports with its storage layer abstractions
  • Resource isolation via Debezium connector isolation, compute, and storage swim lanes supported by Hudi RDBMS features and Postgres support for replication slots
  • Various SLA guarantees, including critical freshness-focused ones, which are supported by various flexible features built into Apache Hudi—such as ACID guarantees on transactions, near-real-time data ingestion, flexible/incremental ingestion of data at various points in the pipeline, and an extremely efficient ETL process downstream
  • Decoupled storage and processing, powering auto-scaling, supported by Apache Hudi
  • Apache Hudi’s powerful, serverless, transactional layer, available across data lakes, supporting advanced, abstract operations such as copy-on-write and merge-on-read.

Data Governance and GDPR Compliance at Scale

For his part, Balaji Varadarajan, Senior Staff Engineer at Robinhood, gave us a deep and nuanced look at how Robinhood uses the tiered architecture of its data lakehouse to address data governance and GDPR-related requirements. 

Data governance at scale is complex, with multiple objectives:

  • Tracking data and its flow
  • Keeping the lakehouse up-to-date with new and evolving regulations
  • Maintaining access controls and oversight over data assets
  • Obfuscating and updating personally identifiable information (PII) as required
  • Improving data quality over time

Robinhood addresses these objectives at scale - the Robinhood Data Lakehouse stores more than 50,000 datasets - by organizing the lakehouse into disparate zones.

Figure 4: Lakehouse organization for GDPR compliance and at-scale governance uses

Zone tags and related metadata are used to track and propagate information about the disparate zones throughout the lakehouse. Robinhood’s team has implemented a central metadata service to support the zones. The service is built on the same kind of tiered architecture we saw above for the freshness metadata. 

Varadarajan explained that tagging is done both by hand and automatically in the system (including programmatically at the source-code level), with tag creation being co-located with schema management work. Any changes to tagging are enforced, tracked, and monitored via Lint checks in the system, as well as with automated data classification tools, which help cross-check the tags and detect any data leaks or aberrations. 

Efficient and Scalable Masking and Deletion of PII

To demonstrate how powerful the resulting system is, Varadarajan walked us through Robinhood’s efficient implementation of the PII delete operation, which is required for the GDPR’s “right to be forgotten” - required by the European General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA). 

Figure 5: Masking and deletion implementation for PII delete operation support

Support for PII tracking and masking, as required for an efficient, GDPR-compliant implementation of PII deletion, is difficult to deliver in a lakehouse as large and complex as Robinhood’s. It needs to be possible to delete all of a single user’s PII on demand over the entire multi-petabyte-sized data lakehouse. And this has to be done quickly, efficiently, and without impacting other users. Varadarajan explained that Robinhood’s implementation relies on just two (tricky to implement) metadata services:

  • An ID Mapping service, which performs a complex replacement of all user identifiers in the system with a unique, user-specific Lakehouse ID 
  • A Mask to PII service, which maps PII to masks that are consistent per user (the associated mapping data is stored in the lakehouse’s sensitive zone)

The two metadatas (ID and mask) are applied and tracked ubiquitously across the lakehouse. As a result, the PII delete operation can be implemented via a standard Apachi Hudi delete operation, which is efficient, fast, and operates over the entire lakehouse. 

“Apache Hudi is the central component of our data lakehouse. It enabled us to operate efficiently, meet our SLAs, and achieve GDPR compliance.” — Balaji Varadarajan, Senior Staff Engineer, Robinhood


We were shown the many benefits of building a tiered architecture on top of an Apache Hudi driven, OSS-powered data lakehouse. Specifically:

  • CDC-based, tiered pipelines, built with Debezium on top of Apache Hudi, efficiently scale to support more than 10,000 data sources and process multi-petabyte data streams under exponential growth.
  • Apache Hudi and related OSS projects (Debezium, Postgres, Kafka, Spark) support effective resource isolation, separation of storage and compute, and other core technical requirements for building tiered processing pipelines in a data lakehouse.
  • The system scales well, so production systems can be built, scaled, and managed by a lean team.
  • Robinhood’s tiered architecture generalizes. In addition to data processing at scale, it also supports critical metadata use cases—such as for data freshness, cost management, access controls, data isolation, and related SLAs.
  • Data governance and GDPR use cases are well-supported with the same architecture.
  • Critical compliance actions, such as the PII deletion, are easily supported by Robinhood’s implementation and perform well at scale.

Day and Varadarajan make a convincing case that the Robinhood Data Lakehouse uses its dependence on Apache Hudi and related OSS projects to gain strategic advantage over competitors. They’ve implemented reliable data governance mechanisms, are staying up to date with GDPR compliance efficiently and at scale, and have been able to handle exponential data and processing growth. They can also support various metadata, tracking, and other SLAs, such as for data freshness. For an in-depth understanding, view their complete talk.

Read More:

Subscribe to the Blog

Be the first to read new posts

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.