June 8, 2023

Building a Data Lakehouse: Managed vs. Do-it-Yourself

Written by:

Andy Walner

Building a Data Lakehouse: Managed vs. Do-it-Yourself

Intro

Your data platform team is beginning a new ambitious initiative to build a data lakehouse. The objectives are clear: modernize your data stack to save on the costs of your data warehouse, improve data freshness, and bring order to the data swamp that is your lake. You'll start with an initial proposal, then assemble a proficient team of engineers to construct the lakehouse. Before long, you'll find yourselves navigating through the vast array of open-source capabilities and staring at a timeline that stretches six months or more just to get your lakehouse functional.

This process is too slow, costly, and downright agonizing, so we crafted Onehouse – a solution designed to accelerate how you get started, automate the complexities, and keep you at the forefront of the ever-evolving data ecosystem. This article sheds light on the considerations between undertaking a Do-It-Yourself (DIY) data lakehouse project versus opting for a streamlined, managed solution like Onehouse.

What is the Data Lakehouse?

Before we dive in, let's review what is a data lakehouse. The data lakehouse architecture was created at Uber in 2016 to solve the challenges posed by large-scale real-time data pipelines. The data lakehouse marries the concepts of a data lake and a data warehouse, providing the strengths of both in a single platform. Fast forward to today: the lakehouse has exploded in popularity with thousands of companies adopting the architecture (Walmart, Zoom, and Robinhood to name a few).

The lakehouse has grown on the backs of three open source projects: Apache Hudi™, Apache Iceberg™, and Delta Lake. These projects each offer their own benefits, but also share many core attributes that define the lakehouse. The common attributes include:

Perform ACID transactions on the Data Lake
Incrementally process new data from sources to avoid expensive rewrites
Ingest real-time data with low-latency for faster insights
Serve as the unified storage layer for all your data (batch and streaming, structured and unstructured) to prevent data silos

At Onehouse, we've helped countless data teams embark on the journey to build their data lakehouse using these open source technologies. While teams have found success with open source technology, achieving long-term success requires a significant upfront investment, deep expertise to optimize the lakehouse, and ongoing maintenance. We built Onehouse to address each of these phases in the journey.

Phase 1: Getting your Lakehouse off the ground

Getting your lakehouse off the ground is a daunting task for new teams, with hurdles ranging from the initial analysis paralysis during the proposal stage, to setting in motion a multi-month project for a group of engineers to operationalize the lakehouse.

‍

Initial Investment Overhead

Building a lakehouse with open-source technology requires setting up a plethora of tooling - everything from creating Airflow jobs, configuring CI/CD deployments, to scaling Kubernetes compute. Companies will often spin up a team of 3-6 full-time engineers to set up the lakehouse over the course of 6+ months. Building the lakehouse in a reusable way is even more difficult, organizations end up hiring a new team and repeating the 6+ month process each time they want to expand to a new department (like integrating data for a new Sales organization).

Using Onehouse, you can cut down the cost of building the lakehouse to just one part-time engineer (as opposed to 3-6 full-time engineers) and tap into the value of your data within days to weeks (rather than 6+ months). Our platform automates the tactical work involved in constructing a well-operating lakehouse, so your team to focus on what they do best: producing and analyzing data.

‍

Avoiding Analysis Paralysis

Apart from the technical tasks involved in building a lakehouse, teams often spend a substantial amount of time and energy just planning their lakehouse. Teams may spend months on internal debates and proof of concepts (POCs) concerning compliance, security, and open-source project selection.

Onehouse helps you skip the POC process by incorporating industry best practices for compliance and providing easy configurations to meet your company's requirements.

Furthermore, Onetable from Onehouse allows you to query data in any data lakehouse format (Apache Hudi, Apache Iceberg, Delta Lake) through any query engine (Apache Spark™, Google BigQuery, Databricks, Snowflake, etc.), eliminating the headache of choosing the right open-source projects and ensuring your platform is future proof. You'll get industry-leading efficiency for writing data to the lakehouse while ensuring your data is accessible where and when you need it.

Phase 2: Optimizing your Lakehouse

Building an efficient data platform yourself requires deep expertise. You can either recruit experienced lakehouse engineers or build up your existing team's expertise over several months. Both these approaches are costly and introduce dependencies on the knowledge of a few engineers.

‍

Mastering Data Modeling

Proper data flow modeling is a fundamental component in building a high-performance lakehouse. While building your lakehouse, data modeling questions will arise, like:

"Given a Kafka topic for each customer, how should we structure our tables for the sales team? Should we store the raw data and use transformations to materialize the final tables, or should we land the data directly in its final format to save on cost?"

Teams may spend months testing one data model, only to discover they need a different model due to unsatisfactory performance. With Onehouse, you can cut through the noise and blueprint your lakehouse alongside our expert team who has built data platforms from the ground-up at scale. You'll get it right on the first first attempt, saving you the ordeal of researching, experimenting, and building multiple iterations of your data pipelines. Onehouse also makes it easy to quickly prototype and deploy, so you can test and iterate much faster to fine tune your pipeline designs.

‍

Transitioning from Batch to Streaming

Another challenge we've observed is moving from batch to streaming pipelines. When teams first build their data lakehouse, they usually start with straightforward batch pipelines. As teams move towards real-time data ingestion, it becomes necessary to use advanced features like Merge-On-Read, tune configurations like async compaction for optimal performance, and resize clusters based on fluctuating data volumes.

Onehouse offers these capabilities out-of-the-box as a managed service, empowering you to build streaming pipelines that fit your latency and cost requirements seamlessly.

Phase 3: Sustained Maintenance of Your Lakehouse

Once your data lakehouse is up and running, it requires ongoing, careful maintenance, including data integrity checks, debugging pipeline failures, and continuous updates to keep up with the latest technology.

‍

Ensuring Data Integrity

While most DIY lakehouse teams are aware of the importance of integrating data integrity checks into their pipelines, amidst a mountain of other priorities, these checks often get deprioritized or inadequately addressed.

Rather than relying on piecemeal solutions, Onehouse provides automated data integrity checks to maintain data consistency between the source and the data lakehouse, preventing problems such as silent data loss or duplicate records in a table.

‍

Operating and Debugging Pipelines

Having built data platforms to handle petabytes of data, we understand that debugging is more complex than coding, and operations are harder than debugging. Building the lakehouse is just the first step, followed by countless nights of on-call alerts and developing substantial infrastructure to ensure smooth operation. We've baked decades of learning into Onehouse to provide a seamless, pain-free experience by automatically operating your lakehouse at scale and providing monitoring dashboards out-of-the-box.

‍

Keeping up with New Tech

FirstMark MAD Landscape

The data landscape is rapidly evolving, with new technologies emerging and existing ones continuing to innovate. When you build a DIY lakehouse, you'll need track the latest innovations and continuously set aside cycles to integrate them into your data platform.

For instance, a new Spark version release might require months of updating and testing your pipelines to leverage the latest features. With Onehouse, you gain automatic access to the latest technologies without the hassle of constant updates and system overhauls.

Why do teams choose to DIY?

While managed solutions offer convenience, there are often trade-offs that lead teams towards DIY solutions instead of adopting a managed product. At Onehouse, we've designed our product to deliver the benefits of a managed solution without the typical downsides.

‍

Preserving Data Privacy

A common concern with managed software is the need to send data outside your cloud account to a third-party provider. This issue even plagues popular providers like Snowflake and Fivetran. Instead, Onehouse employs a privacy-first architecture, enabling you to enjoy the advantages of a managed lakehouse without your data ever leaving your cloud account.

‍

Enabling Interoperability

Vendor lock-in can be a significant pain point when using managed software, and it's especially prevalent in data warehouses. We believe in the freedom of data teams to choose their tools, and in facilitating access to the data ecosystem rather than restricting it. With Onehouse (and Onetable), you can access your data using preferred query engines, catalogs, and more.

Built on Apache Hudi, an open-source project with a thriving ecosystem, Onehouse has pledged a commitment to openness. Your lakehouse is your lakehouse, so if you ever decide Onehouse no longer fits your goals, you can turn off the service and retain a fully functioning Apache Hudi lakehouse in your cloud provider account.

‍

Implementing Cost Controls

Cost is often the top priority when building a data platform. Unfortunately, many managed products provide insufficient cost controls for users in the name of simplicity. Onehouse strikes a balance, providing an easy-to-use experience while empowering users to navigate tradeoffs between data latency and cost. Onehouse automatically cuts out wasteful spending with autoscaling, while allowing users to set limits to ensure they stay within their team's allotted budget. We routinely find that Onehouse users save significant costs versus other vendors and even DIY buildouts.

Conclusion

Building a data lakehouse from scratch can work well for those determined, but it presents complexities and demands significant resources. Onehouse provides a compelling alternative, blending convenience and control without compromising on data privacy, interoperability, and cost management.

By choosing Onehouse, you sidestep the overwhelming setup process, effortlessly optimize your data pipelines, and maintain your lakehouse without the burden of constant updates. Our solution frees your team to concentrate on what matters most: driving insights and value from your data.

At Onehouse, we're committed to guiding your data journey. We leverage our expertise to offer a solution that adapts to your needs and grows with your team. With Onehouse, your data platform becomes a powerhouse for insights and innovation. Reach out to us at gtm@onehouse.ai, or sign up through our product listing on the AWS marketplace today!

Authors

No items found.

Building a Data Lakehouse: Managed vs. Do-it-Yourself

Intro

What is the Data Lakehouse?

Phase 1: Getting your Lakehouse off the ground

Initial Investment Overhead

Avoiding Analysis Paralysis

Phase 2: Optimizing your Lakehouse

Mastering Data Modeling

Transitioning from Batch to Streaming

Phase 3: Sustained Maintenance of Your Lakehouse

Ensuring Data Integrity

Operating and Debugging Pipelines

Keeping up with New Tech

Why do teams choose to DIY?

Preserving Data Privacy

Enabling Interoperability

Implementing Cost Controls

Conclusion

Read More:

From the trenches: Managing Apache Iceberg metadata for near-real-time workloads

Announcing Apache Spark™ and SQL on the Onehouse Compute Runtime with Quanton

Measuring ETL Price-Performance On Cloud Data Platforms

Towards Open Data - Part 1: Cloud Warehouses Now Love Open Formats

Announcing Open Engines™: Flipping defaults to “open” for both data and compute

ClickHouse vs StarRocks vs Presto vs Trino vs Apache Spark™ — Comparing Analytics Engines

Building a Data Lakehouse: Managed vs. Do-it-Yourself

Intro

What is the Data Lakehouse?

Phase 1: Getting your Lakehouse off the ground

Initial Investment Overhead

Avoiding Analysis Paralysis

Phase 2: Optimizing your Lakehouse

Mastering Data Modeling

Transitioning from Batch to Streaming

Phase 3: Sustained Maintenance of Your Lakehouse

Ensuring Data Integrity

Operating and Debugging Pipelines

Keeping up with New Tech

Why do teams choose to DIY?

Preserving Data Privacy

Enabling Interoperability

Implementing Cost Controls

Conclusion

Read More:

From the trenches: Managing Apache Iceberg metadata for near-real-time workloads

Announcing Apache Spark™ and SQL on the Onehouse Compute Runtime with Quanton

Measuring ETL Price-Performance On Cloud Data Platforms

Towards Open Data - Part 1: Cloud Warehouses Now Love Open Formats

Announcing Open Engines™: Flipping defaults to “open” for both data and compute

ClickHouse vs StarRocks vs Presto vs Trino vs Apache Spark™ — Comparing Analytics Engines

Subscribe to the Blog