March 14, 2024

Diving into Uber's Cutting-Edge Data Infrastructure 

Diving into Uber's Cutting-Edge Data Infrastructure 

This blog post is a summary of a talk by Girish Baliga, Director of Engineering at Uber at Open Source Data Summit 2023. Visit the event page to learn more or watch the talk here

Uber is a global brand that operates in more than 10,000 cities worldwide. The business has a large operational footprint, servicing over 137 million monthly users and 25 million trips every day. And data drives — no pun intended — every action taken by passengers, drivers, and those who run the business. At this scale of data, the technical challenge of turning the raw data from all this activity into business insights is particularly difficult — especially doing so in a performant and reliable way. 

Uber is also central to the Onehouse origin story. The Hudi project, which was then called a “transactional data lake,” was started at Uber by Onehouse Founder and CEO Vinoth Chandar. The project was then taken on by the Apache Software Foundation in 2016, and Vinoth went on to found and run Onehouse. 

In his talk, Girish Baliga, Director of Engineering at Uber, shared how the company has built and evolved its data infrastructure to fulfill Uber's mission to help people go anywhere and get anything. For context, the Data Infrastructure team at Uber supports a broad base of internal customers with different data needs: 

  • Community Operations teams work with local authorities and provide reports to ensure the business runs smoothly
  • Machine learning engineers and data scientists use company data to improve the product's matching algorithms and connections
  • Growth Marketing teams use data insights to drive incentives to grow the business

All these internal customers have the same basic goal: they want to get business insights from enormous troves of company data. But they don't have the same expectations in terms of data freshness, scale, or software integrations. Some customers need insights in real-time or near real-time, with data that's updated frequently (e.g. data freshness of less than one minute). Others can accommodate a longer wait, up to a day, for instance when running scheduled Uber Eats reports for restaurant owners. 

Data Analytics Challenges at Uber

Uber's data infrastructure team receives four main types of analytics requests. Each one represents a distinct use case with different implications from a technical standpoint. To illustrate, Girish shared the following graphic.

Figure 1. The four kinds of analytical use cases that Uber's Data Infrastructure team supports.

Streaming Analytics
This category demands extremely fresh data, typically requiring updates in less than a minute. A prime example at Uber is addressing surge pricing imbalances, where immediate adjustments to pricing algorithms are necessary. These applications are often integrated with real-time systems, such as Kafka topics, to facilitate the rapid processing and circulation of data.

Real-time Analytics
UberEats offers a feature that displays popular orders in a user's vicinity. This feature is a key example of real-time analytics, where data is refreshed rapidly  but not as frequently as streaming analytics requires. However, this category of applications is more traffic-intensive, with queries sometimes reaching 2000 per second. These applications typically interact with a backend through an RPC (Remote Procedure Call) interface that queries an analytics engine. 

Interactive Analytics
A typical use case here involves generating daily summary reports for restaurant managers, detailing their UberEats orders and revenue. For interactive analytics, the data freshness is about a day, and the scale is on the order of multiple reports per restaurant owner. These queries are executed by an orchestrator or query runner that handles the automation. 

Batch Analytics 

Batch analytics is used to examine historical data, such as order trends over the past year. Interactive tools such as query builders enable users to explore and analyze data with ease. These applications run automated queries on a predefined schedule. 

A Unified Data Analytics Framework

In this architecture, an incoming data stream serves both real-time and batch cases. For the real-time case, a streaming analytics engine transports data from the data stream into a real-time data store. The data is then exposed to end-users through a query interface. For the batch case, the same data stream is ingested, but it goes into a data lake, where custom analytics and transforms are executed on it. Then, the engine creates data models from this data pipeline. The data is then exposed to users for reporting and further analysis. 

Figure 2. An illustration of Uber's data ecosystem that highlights the Lambda architecture and its important components.

To accommodate the range of internal customer needs, Uber built a data platform that follows the Lambda architecture, a widely adopted architecture in the data analytics space. This approach handles both low-latency streaming workloads as well as batch workloads. Consequently, Uber's data infrastructure platform can manage all four primary analytics use cases — streaming, real-time, batch, and interactive analytics — with a single design.

In this architecture, an incoming data stream serves both real-time and batch cases. For the real-time case, a streaming analytics engine transports data from the data stream into a real-time data store. The data is then exposed to end-users through a query interface. For the batch case, the same data stream is ingested, but it goes into a data lake, where custom analytics and transforms are executed on it. Then, the engine creates data models from this data pipeline. The data is then exposed to users for reporting and further analysis. 

Uber used open-source technology as the foundation of their Lambda architecture. Apache Hudi is a prominent pillar of this setup. Developed to address the challenge of managing data at scale, Hudi can reduce upsert times to 10 mins and shrink end-to-end data freshness from 24 hours to just one hour. Before Hudi existed, the company was limited by how quickly they could re-ingest data, which was often slow. Hudi improved efficiency by allowing the team to incrementally process new data with low latency.

For batch workloads, Uber runs ingestion jobs on Spark. Parquet is used for file management and Hadoop as a storage layer. Hive jobs ingest data from the data lake and build the data model using a very similar stack. 

On the streaming front, Uber uses Apache Kafka for the data stream and Flink for analytics. Real-time data is served on Pinot. And on top of Pinot, the team built a custom Presto query interface that allows users to write Presto SQL and run their queries on Pinot in real time as if it were a conventional production backend system.

Empowering users to query data at different levels

The Lambda architecture describes how data is transported through different analytics engines. But once the appropriate data is available, how do internal customers query the data to gain valuable business insights?

Figure 3: Presto SQL serves as the primary query language at Uber for a variety of use cases.

The Data Infrastructure team supports three query languages to meet customer needs — from a high-level, common SQL approach to more customizable, low-level support for advanced users:

  1. Presto SQL
    Uber's data platform supports Presto SQL as its default query language, serving thousands of internal users with a broad range of use cases, from generating reports to enhancing product features. Users craft and refine their queries in QueryBuilder (a tool akin to a local IDE for code development), and then deploy them via the Universal Workflow Orchestrator (uWorc) for production use.
  2. Custom SQL
    For more specialized requirements that Presto SQL cannot accommodate, such as the need for custom user-defined functions (UDFs), or tuning compute resources to support very large queries, Uber offers Flink SQL and Spark SQL. These SQL variants cater to hundreds of internal customers, offering extended capabilities for data engineering tasks, including ETL jobs and data modeling.
  3. Programmatic APIs
    For the most complex scenarios, Uber's Data Platform provides programmatic APIs. These low-level APIs with domain-specific libraries (e.g. Java, Scala, Python, etc.) give advanced users the ability to develop custom programs for their use cases based on Flink and Spark. Flink addresses the offline needs of real-time product use cases such as ETAs, surge pricing, and metrics, while Spark handles offline-only use cases such as ingestion, ETLs, and model training. 

Technical Innovations at Uber Scale

Girish highlighted several technical customizations implemented by the Uber Data Infrastructure team to boost reliability, enhance performance, and facilitate multi-cloud deployments.

Reliability and Scale Improvements

Examining a subset of the data infrastructure stack — specifically, the HDFS clusters (data storage) and Presto clusters (data reads and data writes) — Girish highlighted a few innovations that help Uber achieve reliability and scale:

  • Smart routing of queries, so heavy queries are run in different clusters than light queries
  • Multi-region deployment: Hive Sync is used to copy data from primary to secondary regions. In case of a regional failure, a standby Presto cluster handles high-priority jobs that need to run right away and other jobs run at a degraded SLA. 
  • Using Hudi’s Record Level Index: an advanced way to build a transactional layer on top of Apache Hudi, this eliminates the need for a side key-value store such as HBase.
  • Automatic retries that run in case of errors (for example, during cluster deployment or a restart)
  • Multiple copies of data are stored, so if one copy is corrupted, healthy stores of data exist.
  • New replicas can be added to support hot shards so the system can handle high levels of read traffic.
  • Erasure coding and tiered storage are used for older data in order to achieve higher efficiency at scale.
  • Separate clusters are used for interactive versus offline, scheduled queries.

Performance Optimizations

Uber's Data Infrastructure team designed and executed the following optimizations to enhance performance:

  • Presto optimizations for reading data including a vectorized reader, nested columns, and predicated push-downs.
  • Integrating the Alluxio library within Presto workers, which makes the local SSD available to cache data. Affinity scheduling is used to ensure the cache is properly exploited. 
  • On the storage side (HDFS), Alluxio local SSDs are used to cache for faster retrieval. Copies of all the hot data are kept so that most reads run very fast.

Multi-Cloud Improvements

Uber operates in a hybrid data environment. Traditionally, the team used on-premises deployments of their stack. But they are currently building their cloud data on Google Cloud, using HiveSync to copy data from HDFS to Google Cloud Object Storage.

Two key improvements stand out here:

  • Customized the Presto router for query analysis: When queried, the system uses Hive Metastore to determine the location of queried tables — on-prem or in the cloud — to optimize query execution and inform smart placement of data across storage locations.

  • Presto on top of Google Cloud Object Storage: By using a custom HDFS client, Presto interacts with Google Cloud Object Storage as if it were querying HDFS, thereby enhancing performance.

Collectively, these innovations highlight Uber's commitment to leveraging open-source technology to provide a robust data platform that is scalable and reliable across its global data infrastructure.

To watch the talk directly, you can access the recording here

Read More:

Subscribe to the Blog

Be the first to read new posts

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.