Open Data using Onehouse Cloud

calendar icon
May 21, 2025
Speaker
Chandra Krishnan
Solutions Engineer

If you've ever tried to build a data lakehouse, you know it's no small task. You've got to tie together file formats, table formats, storage platforms, catalogs, compute, and more. But what if there was an easy button?

Join this session to see how Onehouse delivers the Universal Data Lakehouse that is:

Fast - Ingest and incrementally process data from stream, operational databases, and cloud storage with minute-level data freshness.

Efficient - Innovative optimizations ensure that you squeeze every bit of performance out of your resources with a runtime optimized for lakehouse workloads.

Simple - Onehouse is delivered as a fully managed cloud sevice, so you can spin up a production-ready lakehouse in days--or less.

The session will include a live demo. Attendees will be elible for up to $1,000 in free credits to try Onehouse for their organization.

Transcript

AI-generated, accuracy is not 100% guaranteed.

Speaker 0    00:00:00    
<silence>

Speaker 1    00:00:07    
There we go, Cameron and Chandra, where you all at? So I'm gonna leave it to you. This is the last that you will see of me. For better or worse, I'm gonna sign off,  

Speaker 2    00:00:20    
Man. You're the best host ever, man. I've been digging watching <inaudible>. Here we go.  

Speaker 1    00:00:26    
I'll tell you a little secret. Today was my birthday. I couldn't have thought of a better way to spend it than with everybody here enjoying and learning a ton. It's a little bit of edutainment we got going on.  

Speaker 3    00:00:38    
Thank you so much, Demetris. Alright, for sure. Um, excited to to be here, everyone. Thank you for, for sticking around for a little bit. Um, brief introductions before we get started. I'm Chandra, I'm on the solutions team here at Onehouse. Uh, I wanna introduce, uh, my, my colleague Cameron here, if you wanna take a second to quickly introduce yourself. Yeah,  

Speaker 2    00:00:56    
Cameron O'Rourke and I'm, uh, with the, uh, product marketing team.  

Speaker 3    00:01:02    
Awesome. So we're, we're really excited to, to come in, show you guys a little bit about the, the Onehouse platform, um, you know, why, why the company, uh, was formed, why we built the, the platform and, and kind of what we do. And as Cameron's mentioning, uh, a lot of the, the cool work that's, that's going on around combining, you know, databases and data lakes and making analytics and, and data science and machine learning, and all of these really exciting things that we've talked about, uh, all day here, uh, available for everyone. Uh, with that, why don't we get started? Um, one of the things that, you know, we wanted to, to immediately start with is talking a little bit about the problems that come with building data platforms. You know, uh, I've, Cameron, I, I know you've been, you've been in the data space for, for, uh, a few years, quite a few years here.  

Speaker 3    00:01:45    
I've, I've also been working in data for, uh, the last several years. And, and if there's one takeaway that, that I've gotten from, uh, from all of this is, uh, it's, it's not easy, right? Um, you know, I'm sure all, all of us in, in, in the room here can, can agree, right? There's, there's lots of challenges that come with building a data platform. You know, it takes a long time, you know, it, it can often be expensive, you know, take a lot of people to, to kind of work together and work really hard, uh, to, to get it done. Um, would love to hear, you know, if anyone's got cha like specific challenges that they, they encountered, thought that they thought was really interesting, and things they had to work through. Drop 'em in the comments too. You know, we, we'd love to, to hear about, you know, the, the challenges that people out there in, in the community are working on, on, on solving.  

Speaker 3    00:02:27    
And the other thing is, uh, a a lot of data platforms are out there and, and, and they help solve a lot of these, uh, you know, issues around things. You know, maybe taking time or, um, you know, taking a lot of like resources. But, uh, oftentimes what we found is, uh, when, when you adopt a, a platform, they can kind of lock you into to using maybe, maybe the, the formats and, um, uh, and, and the, the, you know, compute of that platform and, and, and things like that. And so, so your, your data is kinda like loaded into to this platform here. Um, and, and we find that, uh, it's, uh, it, it, it it's oftentimes like difficult to, to move your, your data around. You know, you've heard a lot of exciting, uh, talks today about the open table formats and, and what they do to, to make that data open and interoperable.  

Speaker 3    00:03:19    
And, and at onehouse, we just wanted to expand on that and, uh, make, make those capabilities of available to, to everyone. That brings us to kind of how onehouse came about and, and, and what the, the goal of, of ONEHOUSE is, uh, earlier, and, and, and the day you had the chance to hear from, uh, our founder and CEO Vinod, who, uh, was the creator, original creator of, uh, of the Apache Hoodie project. Um, the, the original kind of like data lakehouse, uh, project out there that was designed around making, um, you know, transactional data available in, in, in, uh, an an open format, um, with, uh, with, you know, high, high update, uh, throughput, uh, capabilities, the Onehouse platform, it's, it's really meant to, to do that for, for everyone, right? Um, you know, we wanna get that data available from, from the, your, your data sources, as Cameron's gonna show you in a in a second here, um, ingested, landed on top of your, your open table formats optimized once it's there, and, and having those tables like fully, you know, managed and, and, uh, properly cleaned and compacted and, and, and the, the files all sized correctly.  

Speaker 3    00:04:28    
Um, and also let you bring, you know, all of the, uh, the ETL, uh, you know, logic and, and, and the business, uh, requirements and, and capabilities around, um, your transformations, uh, to, to, to the platform and being able to execute all of those really efficiently. Um, and finally doing all of that on top of the great, uh, openness and interoperability that we've seen, uh, come out of the, the, this Lakehouse phenomenon, um, where, you know, the data, once it's, it's created, it's, it's available for all the use cases that, that you might need. You know, whether it's bi analytics, um, machine learning, uh, data science, or, you know, even, you know, vector embeddings and, and AI and, and, and things like that. It's kind of why, why the, uh, the onehouse product was created and how we hope to deliver impact through the, through the product.

Speaker 3    00:05:17    
Um, inside of the, the product itself, we have, uh, several kind of like core components that make up what, what we do as, as a part of the, the Onehouse product. Um, you'll see all of these components in action in, in, uh, our workshop here. Uh, but the, the first piece is, uh, around data ingestion, right? We wanna make sure that, you know, wherever your data is being created, you're able to take that data and with the click of a few buttons, uh, get that data ingested lightning fast and, and landed on top of your, your open table formats once they're there. Um, you know, as, as you all probably know from, from experience, and, and, you know, feel free to chime in, in, in the comments, um, with, with any things that, that you guys have had to work on around this. Uh, once those open, open table table formats are created, um, they need to be maintained and optimized, right?  

Speaker 3    00:06:07    
The tables need to be, need, need to be managed. Uh, and so we provided kind of, uh, an experience around getting that to happen really quickly and seamlessly. Um, from there, we wanted to say, okay, we've got these open table formats. Our, our, our data is expressed, um, across, you know, Huie, Delta and Iceberg in the platform. Um, let's make that available to anyone to, to use and consume and, um, and really take advantage of for their, their, uh, use cases. So, uh, we put in the, the one sync, uh, inside of our, our catalog sync, where we're able to sync it across, uh, all the different catalogs that, that you might have. So the data is now available via, via the catalog integrations in, in all of these different exciting engines. And also in our most recent launch that just happened a, a few, uh, months ago, we have the Open engines capability where if you wanted to do, uh, maybe some BI analytics with, uh, with Trino or, um, you know, some, some machine learning data science use cases on, on top of Ray, uh, that infrastructure can be spun up, uh, quickly and seamlessly in just the clicks of a few, uh, few buttons.  

Speaker 3    00:07:15    
Um, so that you're able to, to do those, uh, those use cases, uh, more fast, um, and, and, and, uh, easier, uh, on, on, uh, your teams from, from an infra, infra, uh, perspective. This last piece is, uh, is what I'm really excited to, to be talking about, um, and, and you'll see me, uh, in, in a, in a bit here, uh, chatting about them, um, is, uh, around our transformations capabilities. Um, we found, you know, as, as people are using data and, and as you know, uh, data engineers need to be able to transform the data, create spec specified views of, of, of their data, um, and have, uh, you know, the ability to, to be able to, uh, perform aggregations and joins and, and all of these like, complex queries and transformations. Um, and so recently we, uh, we launched the ability to do, uh, spark sql, um, as well as Spark Jobs directly on, on top of that data hook in, you know, whatever tools that you're doing to right now for, for those, those capabilities, whether it's, you know, through Python jobs or, uh, maybe you're orchestrating SQL on DBT or, or something like that.  

Speaker 3    00:08:21    
Uh, have those run, um, directly on top of onehouse. And, and what makes onehouse really exciting is all of these things happen on top of a shared compute platform that's optimized for your lakehouse uh, workloads. Um, we call it the, the Onehouse Compute Runtime. Um, and then on top of that, for, for Spark and, and Spark SQL specific things, we have our quant engine that was just recently released that, um, perfor performs additional accelerations on top of that. So all of this runs directly on top of OCR, the Onehouse compute Runtime, which is speci, uh, specifically Lakehouse optimized, uh, operations, um, on top of what we have, uh, for, for the, uh, the, the capabilities inside the, uh, the product. So you get, um, you know, vectorization of, of, of operations, uh, that, that happen on top of the lakehouse. We've got, you know, advanced multiplexing, job scheduling, um, and, you know, compute management, that's, that's entirely serverless, uh, inside the platform to, to really, uh, maximize the, the, the compute efficiencies that you're able to get.  

Speaker 3    00:09:27    
And some of the, the really exciting, um, you know, benchmark results that we were able to get is, uh, on top of the, you know, OCR some, uh, we were able to see queries run, uh, oftentimes, you know, 30 x or, or more faster, um, right operations, were able to, to be sped up, um, you know, up to 10 x uh, and, and really, uh, make, make your, your platform run more, more efficiently. With that, you know, uh, I, I want to hand it up back to, to Cameron, uh, to, to maybe talk a bit about, uh, the, the platform and, and show you guys what's, what's going on under, under the hood.  

Speaker 2    00:10:02    
Awesome. Thanks, Chandra. If we could switch over to my screen, that'd be super cool. And, um, um, I'm gonna be showing you the, uh, onehouse open data, lakehouse, at least the ui, the part that, you know, you can see, um, you know, there's so much more that goes on, be behind the scenes, and, um, I'm gonna really focus on this demo on, on two things. And again, we'd love to do a full like workshop with you guys. 'cause, uh, I've, we've actually built this whole diagram. I know this diagram looks a little overwhelming, but we actually have all this stuff running. We built it, we can't, you know, we just don't have time to show every little piece. So we're really interested in, you know, having a full workshop experience. But I wanna really drill into, um, Sean, I want to go into the speed at which people can provision, you know, really a world class data lakehouse implementation, you know, and, and get that up and running really quickly.  

Speaker 2    00:10:56    
'cause that's one of the things I've noticed. And then just what it looks like to, you know, use an open data, lakehouse, have this all be open, and the different ways that you can use, uh, the data, you know, just like you've laid out in, in a real practical sense. So, um, I'm gonna refer back to this diagram in a little bit, you know, just to point out different pieces, but let's just head right on over into the Onehouse ui. And I wanna start down here on the usage page. And the reason I wanna do that is I wanna be sure everyone understands that what we're actually provisioning, all the servers and the storage and everything, all the components of this data lakehouse, are going in your cloud account, right? You own it. So like here, if I click over to Amazon S3, and here we have, like, here, we go back and I have like some tables over here, you know, some silver tables.  

Speaker 2    00:11:48    
If I go over here to Amazon S3, these are the S3 buckets, right? Where the data is, is actually living. And this is in, you know, this, in one of our cloud demo accounts, it's my login, my account. So this is data that I control, I control it, I own it. In other words, I'm not having to upload my data to a third party to another vendor, you know, really make a copy of it at all. So that's a huge difference, you know, in, in, just in how you use the onehouse state of Lakehouse and it being open than you might be used to seeing, right? And the other thing that we see here, um, on the usage page is this kind of this OCU, this onehouse compute unit. And this both allows you to limit and throttle the usage of the resources within your cloud account.  

Speaker 2    00:12:35    
And we also use it as a way to bill for our, our management services. We have a control plan, and we watch your system, and we look at the metadata. We never touch your data, and we don't look at your data, but we look at the metadata and we keep everything running very smoothly. We've got a bunch of guys with pagers that, you know, are, are on top of that if there's, you know, ever a hiccup, right? And so we can further then, you know, take these ocus and we can, we can customize how they're allocated, right? Across the different types of compute clusters that we support, right? And these include, see down here, manage clusters. And you could have like maybe a, a separate one, you know, for different teams. Looks like I'm, I'm ui yeah, I was running a bunch of queries earlier.  

Speaker 2    00:13:18    
Um, you know, so we have the manage clusters, we have the SQL clusters, which give you an endpoint for external tools to tap into the onehouse services. And then we have open engines, right? Which, uh, here we cluster type open engines, which let you provision open source compute engines that work with your data lakehouse very easily. So now to provision new tables in the data lake, um, and keep in mind that all these steps could be automated through an API, right? You don't have to do any of this manually. You can go through APIs and automate the whole thing. There's basically three steps. And the first step is you're gonna set up your metadata catalogs, and these are the things that you wanna populate and keep in sync so that you can use the data with different systems and tools. So we have, you know, different things that you can populate, including Snowflake and Databricks.  

Speaker 2    00:14:12    
But in particular, I wanna point out this one here, which is one table. And, um, this is our implementation of Apache X Table, actually, which Onehouse created and we donated. And it's being used by several players in the industry right now. Um, but, uh, this is what gives you access and we'll generate metadata for hoodie, Delta, lake, and Iceberg, right? And this makes sure that your data can be used everywhere. And we're gonna, we're gonna see that in just a minute. So you set up the catalogs, you know, it's pretty, pretty simple, fill in the blanks. And then the very next thing you're gonna do is define your data sources. And we have a number of data sources, you know, we're pretty into stream, uh, data sources, and that's because our platform uniquely is able to handle data streams and do incremental ingestion so that your data, you know, is fresh.  

Speaker 2    00:15:03    
We, we try to stay away from, you know, the overnight batch thing and, and, and keep the data coming in concurrent with whatever's going on in your business systems. Um, but this really just involves, um, let's see. Uh, I can try to add a new data source. Lemme just refresh this. Here we go. This just really, um, involves basically giving your credentials in the location of the data, right? And so that's a really simple fill in the blanks kind of thing, just like that. And then the very last thing you do is a stream capture. And I'm actually gonna create a new stream right now. You can see that we have four of them running. I have four tables, kind of in my bronze or my raw data section. And right now I'm gonna just add a new stream so you can see what exactly what that looks like.  

Speaker 2    00:15:55    
And I'll pick a data source here. I'm going to actually grab this from Confluent, uh, cloud. And we have confluent running over here. I have some, some messages coming in here, but we're gonna create a new stream. And so what I'm gonna do is I have actually, um, something that's updating a Postgres database, and we're actually with this, these steps, it's gonna go out, it's gonna set up dium, it's gonna grab that data off of the Postgres database. It's gonna create a confluent topic. It's gonna create the things needed. And in, in short, it's gonna create the whole data pipeline for you. This is something that would just take you days and days to set up if you were doing it manually. So I'm basically gonna say, I want it from Confluent Cloud, CDC, I can choose, do I want it to be app pen only, or mutable?  

Speaker 2    00:16:43    
And this is a big difference with onehouse. Our data lake can be, you know, can support changes very efficiently. And that's one of the big, uh, advantages of, of the hoodie. Um, you know, table format is it can really handle updates, inserts, and deletes very efficiently. So we wanna sync every minute. And here's the table we're gonna grab, and I'm just gonna configure this and show you some of the options. So quarantine is, if you have records that don't meet validation, um, you can put those into a separate table so you could deal with them later. That's pretty cool. Transformations are things that are applied during ingestion. These are low code or no code transformations. They work on the, uh, incremental data coming in. You can see we have one applied here. It was done, uh, for us automatically, but I have a few others.  

Speaker 2    00:17:31    
And plus you can create your own. Here's one that I wrote and added to the system, um, that does a bunch of string operations, but you can write, you know, uh, whatever you need here, and they can get quite, quite, um, robust. And then finally, we get down to, you know, just basic things like, you know, do I wanna choose a validation? Do I, you know, what's my key fields? Things like that. And then finally, where's the data gonna be located? What data lake? What database does it go in? And now the catalogs I wanna populate. So I'm creating this new table, and this is true, even if I do schema, you know, migrations, I add columns, I do things like that. It'll keep all this synchronized. So let's say I'm already populating glue. I'm gonna also, uh, push the metadata out to Snowflake. We'll send it to Databricks. And I'm gonna do that metadata format conversion into Delta Lake and Iceberg as well.  

Speaker 2    00:18:25    
And let's just give this guy a name. So we'll just call this CDC, uh, what is this? The promotions, yeah, promotions table. And we will get that guy going. And while that's going, let's run up and see what some data, um, uh, looks like that we already have. So we can see, you know, here are the, well, now it's five, so it's already started to create the table, uh, space for the table that we just provisioned. Um, but you can see the other tables that we had streams for right here. Um, if I click into one of these, I can look at metrics for the table and see how many rows, see a bunch of information about data coming in here. We can clearly see the inserts, the inserts and deletes that are happening on this table every day. Um, this table's a little bit larger and, um, we can see, um, more data coming in. And in particular, there's quite a few up certs on this table as well. So a lot of, lot of, um, you know, not just insert activity going on. The other thing we can see, um, is, you know, all the, um, table services that are running here. So we have the cleaning, clustering, and compaction services running on this table to keep the data, you know, um, organized and compact and performing well on disk, as well as the metadata sync process.  

Speaker 2    00:19:56    
So, let's see. We're still waiting for this to, I know this hasn't provisioned yet. It takes a few minutes, few minutes. But remember the meta, so we remember the metadata catalogs we were looking at a little bit earlier. Let's move from provisioning to showing you how we could use this data that we have here with some other tools and see how that works. So let's pop over to, uh, Databricks here. And we can see that, that these, uh, tables are being, uh, populated automatically. Now, these are, um, we're not actually moving the data into Databricks, we're just putting a reference here. So everything we do is by reference, right? So we we're pointing back to the ONEHOUSE data, and then I can run my queries here. I can do whatever I need to do in Databricks, whether it's machine learning or whatever, using that same copy of the data that's in onehouse.  

Speaker 2    00:20:50    
I don't have to move it, right? And it's all just by reference. Same in here in Snowflake, right? So here we have the tables, all these got populated automatically. I can run queries against the data over here, and it all just works. And I could do whatever I needed to do in Snowflake, again, against that same copy of the data. And you may have noticed these silver tables, you know, that I have here. I don't have, uh, you know, these guys here. Well, how did those get there, right? So I've also got, um, DBT cloud running, and so I'm using DBT Cloud and remember the SQL endpoint I mentioned earlier, um, I'm using that SQL endpoint to, to tap into onehouse, right? And run these models in DBT to create my silver tables or on my refined data lakehouse tables. And those get created, they get put right back here in onehouse.  

Speaker 2    00:21:45    
And the really cool thing about this is that these also participate in all the onehouse services. So they get table services, they get metadata sync, that's how, you know, it gets created here. And then the, and then the exact same tables get pushed out, you know, to all the places that we wanna, we wanna see it so that we can use it. So it's a really, you know, tight system. You can do all your data prep now, um, ETL, right? In the data lakehouse one copy of it, uh, not having to do extracts or replication or anything like that, and then use it in all the workloads that you might have in your environment, right? And, and that's just, you know, that's amazing. We've been waiting a long time in the industry for, for, uh, for something. I'm very old. Shandra, I've been in the industry a long time. We've been waiting, yeah, a long time to have a, you know, really integrated thing like this where, you know, we could, we could do all our workloads on one copy of the data.  

Speaker 2    00:22:45    
Um, let's see, what else? So basically that's, that's mostly what I wanted to show you. Um, you know, there's also, you know, of course you can get at through, you know, AWS and, and all the rest. Um, there's so much more I'd love to show you, um, you know, just don't have time. Um, hopefully this showed you how you can very quickly provision a data lakehouse and how it can handle, you know, uh, all your use cases and that you can do it all with one copy of the data. Um, one other thing, um, I wanna mention, you know, we use SQL to do our data preparation, right? Uh, but sometimes there's data transformations or data preparation that you really can't do easily with sql. It requires imperative code. And for example, that'd be like if you had complex string processing, right? Or text, or recursive or graph structures, or you're doing feature engineering. I mean, there's a lot of examples. And for that, genre's gonna show you a new feature that we have that actually allows you to submit code right into the onehouse ecosystem here. So I'll let Chandra take it away and show you that,  

Speaker 3    00:23:52    
Of course. Um, thanks Cameron. Uh, super exciting. Uh, thanks, thanks for the demo. I'm gonna take over the screen share, uh, now, uh, all right, we can flip it over, uh, to, to what I've got. Um, perfect. Thank you guys. Uh, so yeah, as, as, as Cameron mentioned, uh, this is something new that, that, you know, we, we've released in inside of the, the onehouse, uh, platform is the ability to, to run Spark jobs, uh, directly, uh, on top of the, the compute that that Onehouse is provisioning and, and managing. Um, you know, as, as we all know, spark has kind of become a, a really powerful framework for, for doing data transformations. Um, you know, how many of you guys are, are out there kind of running Spark jobs right now on, on other platforms? Maybe maybe you guys have, you know, some, some EMR Spark jobs or, you know, maybe you're doing it in like GCP data proc or something like that.  

Speaker 3    00:24:43    
Uh, drop in the comments, you know, where, you know, if you guys are running Spark jobs, what, what are you guys using to, to run it right now? Uh, maybe you're hosting it yourself, running it on, uh, Kubernetes or, or something like that. Um, it's always, uh, an, an exciting infrastructure challenge to, to try to, to do something like that. What we want to do as a part of this is, you know, if you have a Spark job that you're already running, or maybe you want to write a Spark job, um, but you want to take advantage of all of the accelerations that we have from our quant on engine, uh, from our, you know, lakehouse, uh, integrations, um, and, and from the ability to, to maybe do table services and management and have that data interoperable across all of the different other tools in your, your ecosystem.  

Speaker 3    00:25:25    
That's kind of where, where this, this comes in. Um, as a part of this, it's, it's really straightforward. You kind of have your existing code somewhere. Maybe it's a PI Spark job, or, or you've compiled, uh, you know, a Java jar or something like that. Um, basically what you do is you go in, you give the, the JA name, um, you specify, oh, is it a a jar or, or do I have Python code here? Um, and then you assign that to a compute cluster. So I'm gonna go and let's say I've got a jar, I assign it to a, a compute cluster. Um, and, and this is, you know, one of the compute clusters that, that Cameron mentioned, uh, earlier under the, the clusters tab here where he created a cluster. One of the options is that you can create a, a Spark cluster.  

Speaker 3    00:26:06    
Um, and so, so you go and you create that spark cluster, um, and, and it'll show up here as, as something that, that you can assign the, the job to. Um, and then you just pass it, your, your spark submit ARGs, you know, um, the, the same arguments that you'd give to, to any, you know, spark submit, uh, regardless of, of where you're running it. So you give it your class name or you know, your, your QI file, um, you know, your spark configs that, that you want it to, to run with, um, and you hit create job. I've already got a bunch of jobs created here. Um, and, you know, I've got this one that I made for, for the conference demo. And so you can see I've got a jar here, um, that specified. I gave it some, some configs. I'm, I'm using a lot of default spark configs here, but I'm also passing some, some custom configs here around, uh, you know, table path and, and things like that.  

Speaker 3    00:26:53    
Um, it lets you see, you know, the, the past runs of the job. So I can say, oh, my last run failed. Let me go look. What, why did it fail? I can go in and take a look at the, the default, uh, uh, default, uh, the, the, the driver logs, um, for, for the job. And I can say, oh, okay, it looks like I forgot to set my, my database, um, or I forgot to set defaults on, on, you know, the database for, uh, in inside the, the job somewhere. Um, so I can go back in, fix it and, and, and rerun the, the job. I can also take a look at the Spark UI directly from here and start, um, you know, analyzing what's going on inside the job, you know, what stage is it at, things like that. Um, and, uh, you know, for me to, to run the job, it's as simple as going to, to this job here, hitting run, um, and having the, uh, the job start to run.  

Speaker 3    00:27:36    
It takes a second for it to spin up the, the compute needed and, um, you know, provision, uh, resources from the compute cluster. And so it'll queue up and then it'll be running and, you know, uh, it'll, it'll tell you, oh, did the job fail? Did it succeed? Um, if it fails, you know, we'll let you know. We'll give you a notification as a part of the platform notifications. Um, what you'll see here though, is because my last run succeeded, I basically, what I do inside this jar is I read some, some downstream, uh, uh, tables. Um, I do some, some simple like aggregations in there, and I write a new hoodie table. The really powerful thing is because all of this is happening inside of the, the ecosystem here at Onehouse, that Hootie table that I'm writing, um, as a part, you know, and, and this Hootie table, by the way, it's, it's writing to, to my S3 bucket sitting in my AWS account.  

Speaker 3    00:28:24    
Um, that hoo table automatically gets synced up here. So under the data tab, I'm able to go in and I'm able to, to see open X data. My, my Hootie table here. Um, you know, this employee's OpenX data table. It's basically some, some simple aggregations on top of some, some, uh, some synthetic, uh, like employee data. I can see the records I've aggregated a lot, so it's only five records that are in there. Um, but I get all the same rich metrics that come in from, um, you know, the stream captures and, and things like that, that, uh, that Cameron showed you. I also get all of those table services, so I'm able to come in and automatically the, the onehouse, uh, tool will realize, oh, okay, this is a hoodie table. I can run some table services on here. Let me try to sync it to, to the, the catalogs that, that are configured.  

Speaker 3    00:29:12    
Um, and you can manually, uh, you know, edit this as, uh, as well, um, you know, let me make sure the table's cleaned up. Um, and, you know, uh, let me make sure it's a, it's a merge on read table. I wanna make sure the compaction service can, can run, um, things like that kind of, uh, come in seamlessly through, through the platform. Um, on top of all of these tables that are, are deployed on, on, on, on top of the, the, the jobs, uh, all of this, um, you know, wanna, wanna come back, uh, is is built on top of the, you know, our, our Quantan engine and, and the onehouse compute runtime, uh, where you're able to, to take advantage of, of performance accelerations. Those are just an illustration that we saw, uh, from, uh, from some sample customer workloads, um, where, you know, because of, of the, the compute accelerations that we have, um, inside of the, the platform, the, the infrastructure, uh, spend, uh, gets reduced pretty significantly. Again, you know, this, these are kind of a, a theoretical like illustration scenario, but we'll see that infrastructure spend go down, um, and, and you'll be able to, to really take advantage of some exciting, uh, cost performance, uh, you know, benefits.

Speaker 3    00:30:23    
That's kind of what we wanted to talk about for, for our workshop and what we wanted to, to show you guys. Um, wanted to quickly, you know, thank you for sticking around, uh, a little bit longer and, and spending some time with Cameron and I, I know we, we enjoyed, um, getting to, to show you guys all of this. Um, there's lots of ways that we can continue to, uh, to work together, um, you know, between onehouse and, uh, and, and, and the use cases that you're running. Um, as, as you saw today, we have a pretty, pretty exciting, um, capabilities, uh, across ingest, uh, you know, cost, performance, table optimizations, uh, keeping your data open, interoperable available for, um, the, the use cases that, that you want, uh, in, in, in your, uh, organizations. So definitely reach out, um, got my email and, and Cameron's email right there. Let's, uh, let's stay in touch. And again, as Cameron said, if you wanna, if you wanna get your hands, uh, on, uh, and, and try the, the product, uh, let us know. Um, drop a comment here. Send one of us a, an, an email, um, and, and we'll work together to, to, to get you on and, uh, and, and trying out the, the, the product and, and, and seeing what, what we're able to, to build with it together.  

Speaker 2    00:31:34    
Do we have time for questions? Shara?  

Speaker 3    00:31:37    
I think we've got maybe a minute or two. Yeah,  

Speaker 2    00:31:40    
I dunno if we have any que I can't see the questions here for some reason.  

Speaker 3    00:31:43    
Yeah. I'm not able to see the questions either. If, uh, if someone wants to, I know, I know one of the, the hosts you guys wanna maybe let us know if, uh, if there are questions,  

Speaker 2    00:31:58    
Maybe not. Oh, here we go.  

Speaker 3    00:32:00    
Oh, Dimitris,  

Speaker 2    00:32:01    
He's coming back.  

Speaker 1    00:32:03    
Work. You guys are making me work. I, uh, <laugh> my video is off, to be honest, or, oh, it's very dark because <laugh>, but the questions are so many. We've got, um, a whole different platform that I'm gonna give you all a link to so you can check 'em out. But basically I will, uh, I'll drop in here a few. How do you, oh, man. Given that queries seem to be much faster than rights, are there certain use cases you immediately think of for this technology?  

Speaker 2    00:32:50    
Well, I mean, the big use case for, for, you know, the data Lakehouse, I think is just getting all of your data acquisition and data preparation off of the more forms and into something that, you know, to run faster to run. You do more things with fewer copies of the data. I mean, it's, uh, it's really a cost argument and an architectural argument. You know, you're still gonna have, I mean, you're still gonna have your platforms, you know, that do specialized things for machine learning, data science, analytics, dashboarding, you know, I mean, all that. You've still gotta, you've still gotta do, you know, an analysis, but this just makes it so much more cost efficient. I, I, I don't know, Sean, what, you know, would, how would you add to that?  

Speaker 3    00:33:43    
Uh, yeah, I, I think that there's, there's tons of like, really exciting, uh, advantages. Uh, you, you highlighted, uh, a few of, um, uh, some of the things I'd, I'd wanna talk about are also around scale, right? Mm-hmm <affirmative>. You know, these, these platforms, uh, especially, you know, built around, around, you know, lakehouse and, and, and the way, you know, uh, we've been able to operate this at, at Onehouse, uh, get, you know, battle tested at some of the largest scales that, that we've, we've seen, um, out there. And, and so, so that's, you know, really exciting too, uh, you know, to, to see, see these platforms sing at scale is, is a rare thing. And, um, something that it certainly gets me fired up. Yeah.  

Speaker 2    00:34:23    
Yeah. If you think about some of the, um, you know, like we we're also dabbling with, uh, you know, doing vector embeddings, uh, for large Yeah.  

Speaker 2    00:34:32
Language models, and you think about that, you know, the <inaudible> that you hold in the data lakehouse versus the amount of data you either can or would want to put platform, you know, again, it's a cost savings thing. So, um, you know, it's not a question of, you know, doing new things. It's a question of really offloading those specialized systems that are expensive and limited in, you know, in terms of, of of volume, uh, on, and the data lakehouse being able to just handle more. Um, I did see a, I did see a question about it being on-prem or in the cloud, and I just wanna mention that no, this is, uh, completely a cloud mm-hmm <affirmative>. And if you think about, you know, what I, with being able to pick up the data from Databricks or Snowflake Data dashboard and Superset and, and, and then using, uh, DBT Cloud, it all needs to be in the cloud, right? Right. You know, that's how you, you, you, and, and integrate all these different workloads. So no, it's, it's cloud only.  

Speaker 3    00:35:39    
Yeah. Yeah, definitely. And how, I think you asked that question, um, if you've, if you've got use cases, you know, let's, let's chat afterwards. Um, let's, uh, let's get in touch and, um, and, and we can chat about, you know, how, how, uh, you might want to think about this, uh, from, from, from that standpoint, uh, cloud versus on-prem type things. Um, Michael, I know you've got some questions, uh, around, um, you know, cost and, and speed and accelerations. Um, we can take this offline and connect, uh, one-on-one as well, but, uh, there's a whole variety of ways that, that we see, uh, you know, performance accelerations on, on these jobs. Um, most recently, you know, we actually, uh, launched, you know, if, if, if you're running Spark jobs, um, you know, I'll, I'll, uh, see if I can, I can drop a, a link, uh, to, uh, to our most recent blog here.  

Speaker 3    00:36:26    
Uh, but, um, we, uh, at the, at the, at the end of this blog, um, at the end of this blog here, uh, you'll see that, um, there's, uh, there's actually a, a workload cost calculator. Um, so, so some benchmarks here, um, in, in, in this blog. Um, and then in this blog, you'll, uh, you'll, you'll find a, a, a cost kind of predictor where we're able to go in and, and estimate for you how much we think that, uh, the, the onehouse, uh, you know, quantan engine might be able to speed up, uh, your, your workload across, you know, the extract, load, transform, uh, phases of it. And so, so definitely something to, to check out. Uh, you'll see it directly in our, our homepage, right, on the blogs. Um, and, and so yeah, feel free to, to fill out the form there, and it, it should, uh, give you back some, some results on, uh, what we think, uh, our, our, you know, performance accelerations could, could do for you.  

Speaker 2    00:37:26    
Sorry, I'm seeing just sort of random questions popping up here. Somebody asked, uh, does it have data quality built in? Yeah, I mean, slide data validations to the data coming in. Remember, you can remove those records that don't pass the validation, um, over to the quarantine table so you can deal with them later. But the real beauty of this is, remember, you could, you could, uh, plug in any third party party data quality solution. Data's so open, just like I was preparing data with <inaudible>, you know, you, you, you know, whatever industry customer management system, you, you could actually plug that in. So it's very, um, very flexible that way. Um, uh, uh, uh, uh, exist. I, some of these questions I just can't understand what they're asking. Um, somebody, uh, asked, um, what software do you think are most important for me to dive into? Um,

Speaker 3    00:38:26    
I think you, I think you're on the right track right now. Um, you got a lot of, uh, good experience Yeah. Across sql, Python, um, on the data visualization side, uh, you know, would love to, to have, uh, have you, you know, start taking a look at some spark, uh, capabilities, uh, and, and seeing how how Spark can, can help out on, on there. Uh, David, that that would be really exciting, um, especially, you know, as  

Speaker 2    00:38:51    
Platform. Yeah, I think data, I, I think,  

Speaker 3    00:38:53    
Yeah,  

Speaker 2    00:38:55    
I mean, I think, you know, in addition to all those, quite often people that can really put a system together, of course, we're making it, you know, uh, this, this, even the notion of using a data lakehouse is still, um, new to people, you know, this architect, you know, bringing this, a lot of people that, oh, we're just gonna, you know, throw our data in a database or whatever, you know, and they don't really think about, um, using a data lakehouse and getting the scale and efficiency and cost reduction and just the openness that, you know, we can provide. So I think that's a good thing to consider as well.  

Speaker 3    00:39:35    
Yeah, yeah, definitely. Um, perfect. Uh, I know, I know we're a little over time here, so we,  

Speaker 2    00:39:43    
Yeah,  

Speaker 3    00:39:44    
We wanna  

Speaker 2    00:39:45    
Free to, you know, send our emails there. We have 'em up and so, but yeah.  

Speaker 3    00:39:54    
Yeah, certainly. Well, thank you everyone for, for sticking around, uh, for a little bit. You know, I know we had a lot of fun, as I said earlier, so thanks for, thanks for hanging around and, uh, you know, looking forward to, to hearing from all you and, and staying in touch.