Scale Without Silos: Customer-Facing Analytics on Open Data
Customer-facing analytics is your competitive advantage, but ensuring high performance and scalability often comes at the cost of data governance and increased data silos. The open data lakehouse offers a solution—but how do you power low-latency, high-concurrency queries at scale while maintaining an open architecture?
In this talk, we’ll dive into the core query engine innovations that make customer-facing analytics on an open lakehouse possible. We’ll cover:
- Key challenges of customer-facing analytics at scale
- Query engine essentials for achieving fast, concurrent queries without sacrificing governance
- Real-world case studies, including how industry leaders like TRM Labs are moving their customer-facing workloads to the open lakehouse
Join us to explore how you can unlock the full potential of customer-facing analytics—without compromising on governance, flexibility, or cost efficiency.
Transcript
AI-generated, accuracy is not 100% guaranteed.
Adam - 00:00:07
Sida, are you with us?
Sida Shen - 00:00:08
Yeah. Here I am. Hey, how are you doing?
Adam - 00:00:12
I'm doing very well. Remind me, where are you located?
Sida Shen - 00:00:15
Menlo Park, in the Bay Area.
Adam - 00:00:18
Menlo Park. In the Bay Area.
Sida Shen - 00:00:20
Yep.
Adam - 00:00:20
Join us. We're gonna have a nice conference, a mini conference, unconference, whatever. It'd be nice to see you there.
Sida Shen - 00:00:28
Yeah, absolutely.
Adam - 00:00:30
You have about 10 minutes. I want to make sure that you have all the time that you need. So take it away, man.
Sida Shen - 00:00:37
Thank you. Appreciate that. Appreciate that. So today we're gonna talk about customer facing analytics. Right now, we are not only gonna talk about how to make it fast, but we also want to talk about how to make it not break your data governance, not break your whole data pipeline. So first, let's get started. So first, what is customer facing analytics? Customer facing analytics basically directly serves your analytics to your end user. On the right is a screenshot of our YouTube Studio page where you can see the real-time number of views, watch time, and subscribers for that particular video. This is customer facing analytics right there. One characteristic is that it's extremely high stake. It's typically your revenue driver, your competitive advantage against your competitors. Some example industries can be MarTech, fraud detection, or just in general, external facing dashboards that are powered by lab style queries.
Sida Shen - 00:01:36
One characteristic of customer facing analytics that's different than internal facing analytics is the impossible SLAs. The SLA requirements are really, really high. So you have to get very low latency, even under a crazy amount of loads, thousands or sometimes even hundreds of thousands of concurrent users. For some of our users, their QPS number can be somewhere in the thousands or even tens of thousands of QPS for OLAP queries, and there's absolutely no room for error. It's also very difficult to do technically. Your workload changes; you don't know when your customer is gonna issue a whole bunch of queries that overwhelm your whole cluster. Also, you have infrastructure failures because the SLAs for a lot of the customer facing workloads are even more strict than some of the cloud infrastructure that's running on. If your cloud infrastructure fails, you need to have something to fall back to. Also, you are not only serving one customer, you're serving many of them, so you run into resource contention issues. How do you make sure that all of your customers have the appropriate amount of resources to correctly run their business? When data gets updated, new data being added, you have cache misses, your index gets outdated, your statistics get outdated, so your query planner takes a hit. Those are the main generators for slow queries that are just not acceptable for customer facing workloads. When it fails, you run into slow dashboards, timeouts, breached SLAs. If your system doesn't scale well, you have to overprovision for those rare peaks and end up with your team stuck firefighting instead of actually building and serving your customer.
Sida Shen - 00:03:26
Customer facing is very difficult to do. People have been doing customer facing analytics on proprietary systems all the time, and they thought proprietary systems were the only way. It does solve a lot of the performance challenges using something like a proprietary data warehouse to directly serve those customer facing dashboards. But it comes with a whole new set of challenges in cost. On the right, we have an oversimplified architecture of what your architecture might look like. You have a source of truth as your data lake, and then you have a data warehouse on top for query acceleration. But first, you have to consider the cost of maintaining that proprietary data warehouse. It stores a lot of data, uses a lot of compute, and is very expensive. You also have to consider the cost of data ingestion pipelines not only to your data lake systems but also to your data warehouse, often thousands of tables. Running ingestion pipelines is really expensive to maintain for petabytes of data. You have challenges matching schema, data types, and SQL from your source of truth to your data warehouse, which is very difficult. Schema design and new index designs are challenging. Most importantly, you run into data governance challenges. You don't really have your single source of truth anywhere anymore. You are basically copying the same data to 15 different places just for query acceleration, which leads to a data governance nightmare.
Sida Shen - 00:04:59
Why is running high-performance queries on the lakehouse today so challenging? Because many query engines used for data lakes or lakehouses today are not built for consistently fast query performance. Many are optimized for classic data lake workloads, long-running batch ETL workloads, not low latency, high concurrency queries. What are some essentials to have low latency, high concurrency query performance to power your customer facing workloads on open data? First, a hierarchical caching framework is absolutely necessary. Data lakehouses such as Apache Iceberg do well with cloud object storage, but the response time of cloud object storage like AWS S3 can be upward of 100 milliseconds. When you issue a query from the query engine side, that's not one query scan, but many query scans, each taking 100 milliseconds to respond, which is not fast enough for customer facing workloads. You want something built for low latency joins and high cardinality aggregation queries.
Sida Shen - 00:05:57
Cost-based optimization is important. The optimizer or query plan is probably the most important part of the query engine. It dictates the query path. An optimized query plan can be hundreds of times faster. You want something that is MPP, massively parallel processing, that does data shuffling between the memory of your nodes, optimized for low latency joins instead of batch workloads. You want something optimized for performance, written in a lower-level language like C++ to fully utilize CPU instruction sets, enabling parallel processing and batch data processing instead of row by row. The more batch you run, the faster your OLAP query will go.
Sida Shen - 00:07:43
With all of those, getting data warehouse-like performance on lakehouse is not only possible, it's quite easy. Here is StarRocks, a Linux Foundation open source lakehouse query engine built for low latency, high concurrency workloads. It has an MPP architecture that does data shuffling in-memory between node memory. It has a C++ optimized execution engine that's fully vectorized and a cost-based optimizer to ensure all joins and aggregations are optimal. It can run subsecond query latency with very high concurrency on OpenTable formats like Apache Iceberg, Delta Lake, Apache Hudi, and Apache Parquet files. StarRocks base performance is very good; it is around 4.6 times faster than Trino on the TPC-DS benchmark at one terabyte.
Sida Shen - 00:08:40
One thing I don't like about performance benchmarks is they don't tell the full story. They tell a good story about base performance, but the most important thing for customer facing analytics is how to solve slow queries. We talk a lot about slow queries and caching. Why caching is important and how a cache miss can produce slow queries. We want all data scanned to come from cache. We build proactive cache warmup to make sure even your first query is fast. We have compute replicas for cache to ensure if one compute node fails, you have another copy of the cache to fall back to for best performance. We segment the RRU to make sure a single big query scan doesn't invalidate all your hot data in cache. We have low latency lakehouse statistic collection and low cardinality dictionaries, data warehouse-like features ported to open storage. We have smart metadata handling tailored for each open table format, for example, Apache Iceberg metadata. When it grows to terabytes, how do you parallel read that metadata and handle equality deletes? How do you accelerate merging equality delete files and data files? Your query engine can help a lot with that. We ensure query plan stability to eliminate slow queries, adaptive parallelism to allocate the right resources for your workload, and query cost prediction to identify and isolate big queries so they don't affect other queries or your cluster. We have test retry and phase scheduling for better fault tolerance.
Sida Shen - 00:10:23
Here are some examples of customer facing analytics running on open data on open lakehouse. First, TRM Labs is a financial platform securing cryptocurrencies. This is the Iceberg Summit 2025 talk. They run customer facing analytics on 100 terabytes of data growing 25 to 45% annually with complex joins and high cardinality aggregations with a three-second P95 SLA requirement. They were on Presto and ran into performance bottlenecks. They moved to Apache Iceberg and picked a query engine on top of their Apache Iceberg lakehouse. They considered Trino, StarRocks, and DuckDB, and picked StarRocks, solving 50% performance improvements in their P95 and 54% reduction in include timeout errors.
Sida Shen - 00:11:24
The next is HerdWatch, external facing dashboards for livestock like cows and goats. Before, they were running on Athena on top of Apache Iceberg and outgrew Athena as their business grew and latency became too high. They moved to StarRocks and dropped latency from minutes to around one second. They kept their single source of truth on Apache Iceberg with simplified governance because there was no need to ingest into another data warehouse for performance acceleration, ending up with better performance.
Sida Shen - 00:12:00
This is all I have. If any of that sounds interesting, be sure to join the StarRocks Slack channel using that link or the QR code right there.