Scale Without Silos: Customer-Facing Analytics on Open Data

May 21, 2025

Speaker

Sida Shen

Product Manager

Customer-facing analytics is your competitive advantage, but ensuring high performance and scalability often comes at the cost of data governance and increased data silos. The open data lakehouse offers a solution—but how do you power low-latency, high-concurrency queries at scale while maintaining an open architecture?

In this talk, we’ll dive into the core query engine innovations that make customer-facing analytics on an open lakehouse possible. We’ll cover:

Key challenges of customer-facing analytics at scale
Query engine essentials for achieving fast, concurrent queries without sacrificing governance
Real-world case studies, including how industry leaders like TRM Labs are moving their customer-facing workloads to the open lakehouse

Join us to explore how you can unlock the full potential of customer-facing analytics—without compromising on governance, flexibility, or cost efficiency.

Transcript

AI-generated, accuracy is not 100% guaranteed.

Speaker 0 00:00:00
<silence>

Speaker 1 00:00:07
Sita, are you with us?

Speaker 2 00:00:08
Yeah. Yeah. Here I am. Hey, how are you doing?

Speaker 1 00:00:12
Sita? I'm doing very well. Remind me, where are you located?

Speaker 2 00:00:15
Uh, Menlo Park, uh, in the Bay Area.

Speaker 1 00:00:18
Menlo Park. In the Bay Area.

Speaker 2 00:00:20
Yep.

Speaker 1 00:00:20
Join us. We're gonna have a, a nice conference, a mini conference, unconference, whatever. It'd be nice to, uh, see you there.

Speaker 2 00:00:28
Yeah, yeah, absolutely.

Speaker 1 00:00:30
You have about 10 minutes. I wanna make sure that you have all the time that you need. So take it away, man.

Speaker 2 00:00:37
Thank you. Appreciate that. Appreciate that. Uh, so today we're gonna talk about, uh, customer facing analytics. Right now, we are not gonna only talk about how to make it fast, but we also wanna talk about how to make it not break your data, data governance, not break your whole data pipeline, right? So first let's start is get started. So first, what is customer facing analytics? Customer facing analytics, basically directly serving your analytics to your end user, right? So on the right is a screenshot of our, uh, YouTube Stu studio page where you can see the real time, uh, number of views, uh, watch time and subscribers, you know, for that particular video, right? This is customer facing analytics right there, right? One characteristic is that it's extremely high stake. It's typically your revenue driver. So your competitive, a advantages against your competitors, right? Some example industry can be MarTech, you know, uh, fraud detection and or just in general, you know, external facing dashboards, uh, that are powered by lab style queries, right?

Speaker 2 00:01:36
Uh, so one characteristic of, uh, customer facing analytics that's different than those, like internal facing analytics is the impossible SLAs, right? So the SLA requirements are really, really high. So you have to, um, get very low latency, even under crazy amount of loads, right? Thousands or sometimes even hundreds of thousands of concurrent users, right? So for some of our users, their QPS number can be somewhere in the thousands, you know, or even to the tens of thousands of QPS, right? For OLA queries, and there's absolutely no room forever, right? And there are also, it's very also, uh, very difficult to do technically, right? So first, you know, your workload changes, right? You don't know when your customer is gonna issue a whole bunch of credits, you know, it's gonna overwhelm your whole cluster, right? And also you have infrastructure failures because the SLAs for a lot of the customer facing like workloads are so strict that they're even more strict than some of the cloud infrastructure that's running on, right?

Speaker 2 00:02:34
So if your cloud infrastructure fails, you know, you need to have something that, you know, to fall back to, right? Also, you are not only serving one customer, you're serving many of them, right? So you run into resource, resource contention issues, right? So how do you make sure that all of our customers have, you know, the right amount, the appropriate amount of resources, you know, to correctly run their business, right? And also, you know, when data gets updated, new data being added, right? You have cache misses, you have index account gets outdated, your statistic gets outdated, so your query planner takes a hit, right? And those are the main generators, you know, for those slow queries, that is just not acceptable for customer facing like workloads, right? And also when it fails, you know, you run into slow dashboards, timeouts, right? Breach SLAs, right? And also, if your system doesn't scale that well, you have to overprovision, you know, for those rare peak peaks, right?

Speaker 2 00:03:26
And end up your team, uh, you're stuck firefighting instead of actually building, you know, um, uh, and to serve your customer, right? So, um, yeah, so this is customer facing is very, uh, very difficult to do, right? And people have been doing customer facing analytics on proprietary system all the time, and they thought proprietary system was the only way, right? And yeah, it does solve, you know, a lot of the performance challenges, you know, using something like a proprietary data warehouse to do directly serve, you know, those customer facing like dashboards. Yeah. But you know, it comes with a whole new set of challenges as in cost, right? So on the right, we have a oversimplified, um, architecture of, you know, what your architecture might look like. You have a source of truth as your data lake, and then you have data warehouse on top, you know, just for career acceleration, right?

Speaker 2 00:04:11
But first you have to consider the cost of actually maintaining that proprietary data. Warehouse stores a lot of data, right? It uses a lot of compute, right? It's very error point expensive. And also you have to consider the cost of the data ingestion pipelines not only to your data lake systems, but also to your data warehouse. Not one table, but mostly most often thousands of tables. Uh, that, you know, running ingestion, pipelines really expensive to, to maintain for petabytes of data, right? And also you have the challenges from, you know, matching the schema data type sql, you know, from your source of truth to your data warehouse, right? And that's very, very difficult to do, right? Schema design and new index index designs, right? Those are really challenging to do. And also, most importantly, you, you're gonna run into data governance challenges, right? You don't really have your single source of truth anywhere anymore, right?

Speaker 2 00:04:59
You are basically copying your data the same data to 15 different places, you know, just for clear acceleration purposes, right? And this is gonna run you into a data governance nightmare, right? So why is, uh, running CR high performance query on the lakehouse today, uh, seem that challenging, right? Because a lot of the query engines that are used for data lake or data lakehouse today are not really built for consistently fast query performance, right? A lot of them are still optimized for those classic data lake kind of workloads, long running batch like ETL workloads, not really low latency, high concurrency like queries, right? So what are some of the essentials, you know, to have low latency, high concurrency, like query performance to power your customer facing LA workloads on open data, right? So first is a hierarchical caching framework is absolutely necessary, right? Uh, data lakehouse such as Apache Iceberg, um, they do really well with object storage with cloud, cloud object storage, right.

Speaker 2 00:05:57
But the response time of a cloud object storage, such as a Ws S3 can be upward to a hundred milliseconds, right? And when we issue a query from the query agent side, um, that's just not one query scan, that's many query scan that each one of them takes a hundred milliseconds to respond and just not fast enough for customer facing like workloads, right? And also you want to find something that's built for low latency, joins a high cardinality aggregation query, right? So first is cost base optimization, right? Optimizer or query plan is probably one of the most important part of the query engine, right? So it dictates, you know, what query path we take, right? Optimized query plan in optimized query plan can be a difference of hundreds of times difference, right? And also you wanna find something that's MPP, right? Massively parallel processing, right?

Speaker 2 00:06:45
So do data shuffling, not only does data shuffling, but also do data shuffling between the memory of your nose, right? To be optimized for low latency joins instead of, you know, batch kind of workloads, right? And also we wanna find something that's optimized for performance, something that's written in the lower level language such as c plus plus, you know, to fully utilize, uh, CCD instruction sets, you know, so we can para you, we can parallel process and we can, um, process data in batches, uh, instead of row by row, right? The more batch you run, uh, the faster your all que is gonna go, right? So actually, uh, with all of those getting data warehouse, like performance on lakehouse is not only possible. It's actually quite easy to do, right? So here is ro, ROS is a Linux Foundation open source lakehouse query engine. It's a query engine built for low latency, high con ency like workloads has a MPP architecture that does data shuffling in-memory between no memory, right?

Speaker 2 00:07:43
It also has a CC plus plus CMD optimized execution engine, right? That's fully vectorized as well. And also has a cost-based optimizer, you know, to ensure that all of your joints and aggregations are optimal, right? So to run your subsecond create latency, you know, with very high concurrency to run those data warehouse like workloads directly on OpenTable formats such as Apache Iceberg, Delta, lake Apache hoodie govern park files, right? So, uh, Starbucks base performance is very good. So versus Trino is around 4.6 times faster than Trino on tb. CT. C-T-B-C-D is one terabyte, right? So the base performance really good. Actually, one thing I don't really like about performance benchmark is it just doesn't tell the full story, right? It, it tells a good story about, you know, the base performance of your system. But the most important thing for customer facing analytics is actually, um, how to solve those slow queries, right?

Speaker 2 00:08:40
We talk a lot about slow query and caching, right? Why caching is important and a cache miss can produce, you know, slow queries, right? So we absolutely want all of your data that's gonna be scanned to be scanned from a cache, right? So we build a proactive cache warmup, you know, to make sure that even your first query is fast, right? And we also have compute replicas for cache to make sure that even your instances fail, uh, one of your compute notes fail, right? You have another copy of the cache to fall back to, right? To ensure the best possible performance and also segment the RRU to make sure that, uh, a single big query scan is not going to invalidate all of your data, uh, hot data in your cache, right? And also low latency on the lakehouse statistic collection. Um, low cardinality dictionary, right?

Speaker 2 00:09:29
Those are data warehouse like features we're porting to open data, to open storage, right? And also start smart metadata handling, right? Specifically tailored for each one of the open table formats. For example, your Apache iceberg, um, metadata, right? When it grows to, to terabytes, you know, how do you parallel read that metadata and how do you handle equality deletes? How do you accelerate, um, the merge of, you know, the equality delete files and your data files, right? And that's something that actually your query engine can help with a lot, right? And also to ensure query plan stability to eliminate all of the slow queries, adaptive parallelism to, uh, automatically allocate the right amount of resources, you know, for your kind of workload, right? And query cost prediction to identify and isolate those big queries so they don't affect other kind of queries, uh, your cluster, right?

Speaker 2 00:10:23
So test retry and phase scheduling, you know, for better false tolerance as well, right? Uh, so here are some examples of customer facing analytics that are running on open data on open Lakehouse. So first, TRM Labs. TRM Labs is a financial platform, you know, that secures your cryptocurrencies. And this is the Iceberg Summit 20 20, 20 25 talk, right? And they run their customer facing analytics on a hundred terabytes of data that's growing 25 to 45% annually with complex drawings and high cardinality aggregations with a three second P 95 SLA, um, requirement. And they were on bakery and they were run into performance bottlenecks. And then they moved to Apache Iceberg and they were picking a query engine on top of their Apache iceberg Lakehouse. They considered Trino, StarRocks and Duck db, right? And they find they pick StarRocks and solve 50%, uh, performance improvements in their P 95 and 54% reduction in include timeout errors.

Speaker 2 00:11:24
And the next one's herd watch. It's, uh, external facing dashboards for livestocks for cows and goat, right? And before they were running on Athena, uh, on top of Apache iceberg, and they just grew out of Athena, right? Whereas their business grew and their latency just too high bottlenecks, right? And they move to Starbucks and drop the latency from from minutes to around one second, right? So they kept their single source of truth on the Apache iceberg with even simplified governance because no more need to ingest, you know, into another data warehouse for performance acceleration and end up getting better performance after. All right? This is all I have. Um, if any of that sounds interesting, be sure to join the Star log Slack channel using that link or, uh, the QR code right there.