Open Data using Onehouse Cloud

calendar icon
May 21, 2025
Speaker
Chandra Krishnan
Solutions Engineer
Onehouse

If you've ever tried to build a data lakehouse, you know it's no small task. You've got to tie together file formats, table formats, storage platforms, catalogs, compute, and more. But what if there was an easy button?

Join this session to see how Onehouse delivers the Universal Data Lakehouse that is:

Fast - Ingest and incrementally process data from stream, operational databases, and cloud storage with minute-level data freshness.

Efficient - Innovative optimizations ensure that you squeeze every bit of performance out of your resources with a runtime optimized for lakehouse workloads.

Simple - Onehouse is delivered as a fully managed cloud sevice, so you can spin up a production-ready lakehouse in days--or less.

The session will include a live demo. Attendees will be elible for up to $1,000 in free credits to try Onehouse for their organization.

Transcript

AI-generated, accuracy is not 100% guaranteed.

Demetrios - 00:00:07  

There we go, Cameron and Chandra, where you all at? So I'm gonna leave it to you. This is the last that you will see of me. For better or worse, I'm gonna sign off,  

Cameron O'Rourke - 00:00:20  

Man. You're the best host ever, man. I've been digging watching. Here we go.  

Demetrios - 00:00:26  

I'll tell you a little secret. Today was my birthday. I couldn't have thought of a better way to spend it than with everybody here enjoying and learning a ton. It's a little bit of edutainment we got going on.  

Chandra Krishnan - 00:00:38  

Thank you so much, Demetris. Alright, for sure. Excited to be here, everyone. Thank you for sticking around for a little bit. Brief introductions before we get started. I'm Chandra, I'm on the solutions team here at Onehouse. I want to introduce my colleague Cameron here, if you want to take a second to quickly introduce yourself.  

Cameron O'Rourke - 00:00:56  

Cameron O'Rourke and I'm with the product marketing team.  

Chandra Krishnan - 00:01:02  

Awesome. So we're really excited to come in, show you guys a little bit about the Onehouse platform, why the company was formed, why we built the platform and kind of what we do. And as Cameron's mentioning, a lot of the cool work that's going on around combining databases and data lakes and making analytics and data science and machine learning, and all of these really exciting things that we've talked about all day here, available for everyone. With that, why don't we get started? One of the things that we wanted to immediately start with is talking a little bit about the problems that come with building data platforms. Cameron, I know you've been in the data space for quite a few years here.  

Chandra Krishnan - 00:01:45  

I've also been working in data for the last several years. And if there's one takeaway that I've gotten from all of this is it's not easy, right? I'm sure all of us in the room here can agree, right? There's lots of challenges that come with building a data platform. It takes a long time, it can often be expensive, takes a lot of people to work together and work really hard to get it done. Would love to hear if anyone's got specific challenges that they encountered, thought was really interesting, and things they had to work through. Drop 'em in the comments too. We'd love to hear about the challenges that people out there in the community are working on solving.  

Chandra Krishnan - 00:02:27  

And the other thing is a lot of data platforms are out there and they help solve a lot of these issues around things, maybe taking time or taking a lot of resources. But oftentimes what we found is when you adopt a platform, they can kind of lock you into using maybe the formats and the compute of that platform and things like that. So your data is kind of loaded into this platform here. And we find that it's oftentimes difficult to move your data around. You've heard a lot of exciting talks today about the open table formats and what they do to make that data open and interoperable.  

Chandra Krishnan - 00:03:19  

And at Onehouse, we just wanted to expand on that and make those capabilities available to everyone. That brings us to how Onehouse came about and what the goal of ONEHOUSE is. Earlier in the day you had the chance to hear from our founder and CEO Vinod, who was the original creator of the Apache Hoodie project. The original kind of data lakehouse project out there that was designed around making transactional data available in an open format with high update throughput capabilities. The Onehouse platform is really meant to do that for everyone. We want to get that data available from your data sources, as Cameron's gonna show you in a second here, ingested, landed on top of your open table formats, optimized once it's there, and having those tables fully managed and properly cleaned and compacted and the files all sized correctly.  

Chandra Krishnan - 00:04:28  

And also let you bring all of the ETL logic and the business requirements and capabilities around your transformations to the platform and being able to execute all of those really efficiently. And finally doing all of that on top of the great openness and interoperability that we've seen come out of this Lakehouse phenomenon, where the data, once it's created, is available for all the use cases that you might need. Whether it's BI analytics, machine learning, data science, or even vector embeddings and AI and things like that. It's kind of why the Onehouse product was created and how we hope to deliver impact through the product.  

Chandra Krishnan - 00:05:17  

Inside of the product itself, we have several core components that make up what we do as a part of the Onehouse product. You'll see all of these components in action in our workshop here. The first piece is around data ingestion. We want to make sure that wherever your data is being created, you're able to take that data and with the click of a few buttons get that data ingested lightning fast and landed on top of your open table formats once they're there. As you all probably know from experience, feel free to chime in in the comments with any things that you guys have had to work on around this. Once those open table formats are created, they need to be maintained and optimized. The tables need to be managed. So we provide an experience around getting that to happen really quickly and seamlessly.  

Chandra Krishnan - 00:06:07  

From there, we wanted to say, okay, we've got these open table formats. Our data is expressed across Hoodie, Delta, and Iceberg in the platform. Let's make that available to anyone to use and consume and really take advantage of for their use cases. So we put in the One Sync inside of our catalog sync, where we're able to sync it across all the different catalogs that you might have. So the data is now available via the catalog integrations in all of these different exciting engines. Also in our most recent launch a few months ago, we have the Open Engines capability where if you wanted to do maybe some BI analytics with Trino or some machine learning data science use cases on top of Ray, that infrastructure can be spun up quickly and seamlessly in just the clicks of a few buttons.  

Chandra Krishnan - 00:07:15  

So that you're able to do those use cases faster and easier on your teams from an infrastructure perspective. This last piece is what I'm really excited to talk about, and you'll see me in a bit here chatting about them, is around our transformations capabilities. We found as people are using data and as data engineers need to be able to transform the data, create specified views of their data, and have the ability to perform aggregations and joins and all of these complex queries and transformations. Recently we launched the ability to do Spark SQL as well as Spark Jobs directly on top of that data, hook in whatever tools that you're doing right now for those capabilities, whether it's through Python jobs or orchestrating SQL on DBT or something like that.  

Chandra Krishnan - 00:08:21  

Have those run directly on top of Onehouse. What makes Onehouse really exciting is all of these things happen on top of a shared compute platform that's optimized for your lakehouse workloads. We call it the Onehouse Compute Runtime. On top of that, for Spark and Spark SQL specific things, we have our Quant Engine that was just recently released that performs additional accelerations on top of that. So all of this runs directly on top of OCR, the Onehouse Compute Runtime, which is specifically Lakehouse optimized operations on top of what we have for the capabilities inside the product. So you get vectorization of operations that happen on top of the lakehouse. We've got advanced multiplexing, job scheduling, and compute management that's entirely serverless inside the platform to maximize the compute efficiencies that you're able to get.  

Chandra Krishnan - 00:09:27  

Some of the really exciting benchmark results that we were able to get is on top of OCR, we were able to see queries run oftentimes 30x or more faster, right operations sped up up to 10x, and really make your platform run more efficiently. With that, I want to hand it back to Cameron to maybe talk a bit about the platform and show you guys what's going on under the hood.  

Cameron O'Rourke - 00:10:02  

Awesome. Thanks, Chandra. If we could switch over to my screen, that'd be super cool. I'm gonna be showing you the Onehouse open data lakehouse UI, the part that you can see. There's so much more that goes on behind the scenes. I'm gonna really focus on this demo on two things. We'd love to do a full workshop with you guys because we've actually built this whole diagram. I know this diagram looks a little overwhelming, but we actually have all this stuff running. We built it, we just don't have time to show every little piece. So we're really interested in having a full workshop experience. I want to drill into the speed at which people can provision a world-class data lakehouse implementation and get that up and running really quickly.  

Cameron O'Rourke - 00:10:56  

That's one of the things I've noticed. Then just what it looks like to use an open data lakehouse, have this all be open, and the different ways that you can use the data, just like you've laid out in a real practical sense. I'm gonna refer back to this diagram in a little bit to point out different pieces, but let's just head right on over into the Onehouse UI. I want to start down here on the usage page. The reason I want to do that is I want to be sure everyone understands that what we're actually provisioning, all the servers and the storage and everything, all the components of this data lakehouse, are going in your cloud account. You own it. So here if I click over to Amazon S3, here we have some tables, some silver tables.  

Cameron O'Rourke - 00:11:48  

If I go over here to Amazon S3, these are the S3 buckets where the data is actually living. This is in one of our cloud demo accounts, it's my login, my account. So this is data that I control, I own it. I'm not having to upload my data to a third party or another vendor, really make a copy of it at all. That's a huge difference in how you use the Onehouse state of Lakehouse and it being open than you might be used to seeing. The other thing we see here on the usage page is this OCU, this Onehouse Compute Unit. This allows you to limit and throttle the usage of the resources within your cloud account.  

Cameron O'Rourke - 00:12:35  

We also use it as a way to bill for our management services. We have a control plane, we watch your system, look at the metadata. We never touch your data, and we don't look at your data, but we look at the metadata and keep everything running smoothly. We've got a bunch of guys with pagers on top of that if there's ever a hiccup. We can take these OCUs and customize how they're allocated across the different types of compute clusters that we support. These include managed clusters. You could have a separate one for different teams. Looks like I'm running a bunch of queries earlier.  

Cameron O'Rourke - 00:13:18  

We have managed clusters, SQL clusters which give you an endpoint for external tools to tap into the Onehouse services, and then we have Open Engines, which let you provision open source compute engines that work with your data lakehouse very easily. To provision new tables in the data lake, and keep in mind all these steps could be automated through an API, you don't have to do any of this manually. You can automate the whole thing. There are basically three steps. The first step is to set up your metadata catalogs, which you want to populate and keep in sync so you can use the data with different systems and tools. We have different things you can populate including Snowflake and Databricks.  

Cameron O'Rourke - 00:14:12  

I want to point out this one here, which is One Table. This is our implementation of Apache X Table, which Onehouse created and donated. It's being used by several players in the industry. This gives you access and generates metadata for Hoodie, Delta Lake, and Iceberg. This makes sure your data can be used everywhere. We're gonna see that in just a minute. You set up the catalogs, it's pretty simple, fill in the blanks. The next thing you do is define your data sources. We have a number of data sources, we're pretty into stream data sources because our platform uniquely handles data streams and does incremental ingestion so your data is fresh.  

Cameron O'Rourke - 00:15:03  

We try to stay away from the overnight batch thing and keep the data coming in concurrent with whatever's going on in your business systems. This involves giving your credentials and the location of the data. That's a simple fill in the blanks kind of thing. The last thing you do is a stream capture. I'm actually gonna create a new stream now. You can see we have four of them running. I have four tables in my bronze or raw data section. I'm gonna add a new stream so you can see what that looks like.  

Cameron O'Rourke - 00:15:55  

I'll pick a data source here. I'm going to grab this from Confluent Cloud. We have Confluent running over here. I have some messages coming in. We're gonna create a new stream. I have something updating a Postgres database, and with these steps, it's gonna set up Dium, grab that data off the Postgres database, create a Confluent topic, create the things needed. In short, it's gonna create the whole data pipeline for you. This would take days to set up manually. I'm gonna say I want it from Confluent Cloud CDC, choose if I want it append only or mutable.  

Cameron O'Rourke - 00:16:43  

This is a big difference with Onehouse. Our data lake can support changes very efficiently. That's one of the big advantages of the Hoodie table format, it can handle updates, inserts, and deletes very efficiently. We want to sync every minute. Here's the table we're gonna grab. I'll configure this and show you some options. Quarantine is if you have records that don't meet validation, you can put those into a separate table to deal with later. Transformations are applied during ingestion. These are low code or no code transformations that work on the incremental data coming in. We have one applied here automatically, but I have a few others.  

Cameron O'Rourke - 00:17:31  

You can create your own. Here's one I wrote and added that does a bunch of string operations. You can write whatever you need, and they can get quite robust. Then basic things like validation, key fields, and where the data will be located, what data lake, what database it goes in, and the catalogs to populate. I'm creating this new table, and even if I do schema migrations, add columns, it'll keep all this synchronized. If I'm populating Glue, I'll also push metadata out to Snowflake, send it to Databricks, and do metadata format conversion into Delta Lake and Iceberg as well.  

Cameron O'Rourke - 00:18:25  

Let's give this a name. We'll call this CDC Promotions table. We'll get that going. While that's going, let's see what some data looks like that we already have. Here are the tables we had streams for. If I click into one, I can look at metrics and see how many rows, information about data coming in. We can see inserts and deletes happening every day. This table is larger with quite a few upserts. We can see all the table services running here: cleaning, clustering, compaction services to keep data organized and performing well, as well as metadata sync.  

Cameron O'Rourke - 00:19:56  

We're still waiting for this to provision, it takes a few minutes. Remember the metadata catalogs we looked at earlier. Let's move from provisioning to showing how we could use this data with other tools. Let's pop over to Databricks. These tables are populated automatically. We're not moving data into Databricks, just putting a reference. Everything is by reference, pointing back to Onehouse data. I can run queries in Databricks for machine learning or whatever using the same copy of data in Onehouse.  

Cameron O'Rourke - 00:20:50  

Same in Snowflake. Tables got populated automatically. I can run queries against the data there. You may notice these silver tables. How did those get there? I have DBT Cloud running, using the SQL endpoint I mentioned earlier to tap into Onehouse and run models in DBT to create silver tables or refined data lakehouse tables. Those get created and put right back in Onehouse.  

Cameron O'Rourke - 00:21:45  

These also participate in all Onehouse services, get table services, metadata sync, that's how they get created here. The same tables get pushed out to all the places we want to see them so we can use them. It's a tight system. You can do all your data prep ETL in the data lakehouse with one copy of it, no extracts or replication, and use it in all workloads in your environment. That's amazing. We've been waiting a long time in the industry for something integrated like this where we can do all workloads on one copy of data.  

Cameron O'Rourke - 00:22:45  

That's mostly what I wanted to show. You can get at it through AWS and the rest. There's so much more I'd love to show, but don't have time. Hopefully this showed how you can quickly provision a data lakehouse and handle all your use cases with one copy of data. One other thing, we use SQL for data prep, but sometimes data transformations require imperative code, like complex string processing, text, recursive or graph structures, feature engineering. Genre's gonna show a new feature that allows you to submit code right into the Onehouse ecosystem. I'll let Chandra take it away.  

Chandra Krishnan - 00:23:52  

Of course. Thanks Cameron. Super exciting. Thanks for the demo. I'm gonna take over the screen share now. Alright, we can flip it over to what I've got. Perfect. Thank you guys. As Cameron mentioned, this is something new we've released in the Onehouse platform: the ability to run Spark jobs directly on top of the compute that Onehouse is provisioning and managing.  

Chandra Krishnan - 00:24:43  

Spark has become a powerful framework for data transformations. How many of you are running Spark jobs on other platforms? Maybe EMR Spark jobs or GCP Dataproc? Drop in the comments where you run Spark jobs. Maybe hosting yourself, running on Kubernetes or something. It's always an exciting infrastructure challenge. What we want to do is if you have a Spark job you're running or want to write one, take advantage of all the accelerations we have from our Quant Engine, lakehouse integrations, and table services and management, and have that data interoperable across all tools in your ecosystem.  

Chandra Krishnan - 00:25:25  

It's straightforward. You have your existing code somewhere, maybe a PySpark job or a compiled Java jar. You go in, give the jar name, specify if it's a jar or Python code, assign it to a compute cluster. One of the compute clusters Cameron mentioned earlier under the clusters tab is a Spark cluster. You create that Spark cluster, and it shows up as something you can assign the job to. Then you pass your spark submit args, the same arguments you'd give to any spark submit regardless of where you're running it. You give your class name or your Python file, spark configs, and hit create job. I've got a bunch of jobs created here, including one for the conference demo.  

Chandra Krishnan - 00:26:53  

You can see the past runs of the job. If the last run failed, you can look at the driver logs to debug. For example, I forgot to set my database defaults inside the job. I can fix it and rerun. You can also look at the Spark UI directly from here to analyze the job stages. To run the job, just hit run. It takes a second to spin up the compute and provision resources from the compute cluster. It queues up and runs, and tells you if the job failed or succeeded. If it fails, you get a notification.  

Chandra Krishnan - 00:27:36  

My last run succeeded. Inside this jar, I read some downstream tables, do simple aggregations, and write a new Hoodie table. The powerful thing is because this happens inside the Onehouse ecosystem, that Hoodie table I'm writing is writing to my S3 bucket in my AWS account. That Hoodie table automatically gets synced here. Under the data tab, I can see open X data, my Hoodie table, some simple aggregations on synthetic employee data. I see the records aggregated, only five records. I get all the rich metrics from the stream captures Cameron showed. I also get all the table services, so Onehouse automatically realizes this is a Hoodie table, runs table services, syncs it to the configured catalogs.  

Chandra Krishnan - 00:29:12  

You can manually edit this as well, make sure the table is cleaned up, it's a merge on read table, compaction service can run, all comes in seamlessly through the platform. All of these tables deployed on top of the jobs are built on our Quant Engine and Onehouse Compute Runtime, where you take advantage of performance accelerations. These are illustrations from sample customer workloads where compute accelerations reduce infrastructure spend significantly. These are theoretical scenarios, but infrastructure spend goes down and you get exciting cost-performance benefits.  

Chandra Krishnan - 00:30:23  

That's what we wanted to talk about for our workshop and show you guys. Thanks for sticking around and spending time with Cameron and me. We enjoyed showing you all this. Lots of ways to continue working together between Onehouse and your use cases. As you saw today, we have exciting capabilities across ingest, cost performance, table optimizations, keeping your data open, interoperable, and available for your use cases in your organizations. Reach out, got my email and Cameron's email right there. Let's stay in touch. If you want to try the product, let us know. Drop a comment or send us an email, and we'll work together to get you on and trying the product and see what we can build together.  

Cameron O'Rourke - 00:31:34  

Do we have time for questions, Shara?  

Chandra Krishnan - 00:31:37  

I think we've got maybe a minute or two.  

Cameron O'Rourke - 00:31:40  

I don't know if we have any questions, I can't see them here for some reason.  

Chandra Krishnan - 00:31:43  

Yeah, I'm not able to see the questions either. If someone wants to, I know one of the hosts, you guys want to let us know if there are questions.  

Cameron O'Rourke - 00:31:58  

Maybe not. Oh, here we go.  

Chandra Krishnan - 00:32:00  

Oh, Dimitris,  

Cameron O'Rourke - 00:32:01  

He's coming back.  

Demetrios - 00:32:03  

Work. You guys are making me work. My video is off, to be honest, or it's very dark, but the questions are so many. We've got a whole different platform that I'm gonna give you all a link to so you can check them out. But basically I will drop in here a few. How do you, given that queries seem to be much faster than writes, are there certain use cases you immediately think of for this technology?  

Cameron O'Rourke - 00:32:50  

Well, the big use case for the data lakehouse, I think, is just getting all of your data acquisition and data preparation off of the more forms and into something that runs faster, does more with fewer copies of the data. It's really a cost argument and an architectural argument. You're still gonna have platforms that do specialized things for machine learning, data science, analytics, dashboarding. You still gotta do analysis, but this makes it so much more cost efficient. Sean, how would you add to that?  

Chandra Krishnan - 00:33:43  

I think there's tons of really exciting advantages. You highlighted a few. Some things I'd want to talk about are around scale. These platforms, especially built around lakehouse and the way we've been able to operate this at Onehouse, get battle tested at some of the largest scales we've seen out there. That's really exciting to see these platforms sing at scale. It gets me fired up.  

Cameron O'Rourke - 00:34:23  

Yeah. We're also dabbling with vector embeddings for large language models. The data lakehouse holds a lot of data versus what you can or want to put on platform. Again, it's a cost savings thing. It's not about doing new things, it's about offloading specialized systems that are expensive and limited in volume, and the data lakehouse can handle more. I saw a question about on-prem or cloud. This is completely cloud. If you think about picking up data from Databricks or Snowflake, dashboards, Superset, and using DBT Cloud, it all needs to be in the cloud. That's how you integrate all these workloads. So it's cloud only.  

Chandra Krishnan - 00:35:39  

Yeah, definitely. If you've got use cases, let's chat afterwards. Let's get in touch and talk about cloud versus on-prem. Michael, I know you have questions around cost, speed, and accelerations. We can take this offline one-on-one. There are ways we see performance accelerations on these jobs. We recently launched if you're running Spark jobs, I'll see if I can drop a link to our most recent blog. At the end of this blog, there's a workload cost calculator, some benchmarks, and a cost predictor where we estimate how much the Onehouse Quant Engine might speed up your workload across extract, load, transform phases. Check it out on our homepage blogs. Feel free to fill out the form and get results on performance accelerations.  

Cameron O'Rourke - 00:37:26  

Sorry, seeing random questions popping up. Someone asked if it has data quality built in. Yes, we slide data validations on incoming data. You can remove records that don't pass validation to a quarantine table to deal with later. The real beauty is you can plug in any third-party data quality solution. Data's so open. Just like preparing data with whatever industry customer management system, you can plug that in. It's very flexible. Some questions I can't understand. Someone asked what software is most important to dive into.  

Chandra Krishnan - 00:38:26  

I think you're on the right track now. You have good experience across SQL, Python, data visualization. I'd love to have you start looking at some Spark capabilities and see how Spark can help.  

Cameron O'Rourke - 00:38:51  

Platform. Yeah, I think in addition to all those, quite often people who can put a system together, even the notion of using a data lakehouse is still new to people. Many think to throw data in a database and don't think about using a data lakehouse to get scale, efficiency, cost reduction, and openness we provide. So that's a good thing to consider.  

Chandra Krishnan - 00:39:35  

Yeah, definitely. I know we're a little over time here.  

Cameron O'Rourke - 00:39:43  

Yeah,  

Chandra Krishnan - 00:39:44  

We want to  

Cameron O'Rourke - 00:39:45  

Feel free to send our emails there. We have them up.  

Chandra Krishnan - 00:39:54  

Certainly. Thank you everyone for sticking around a little bit. We had a lot of fun. Thanks for hanging around and looking forward to hearing from all of you and staying in touch.