Created at Uber in 2016, to bring data warehouse capabilities to the data lake for near real-time data, Apache Hudi (“Hudi” for brevity) pioneered the transactional data lake architecture, which has now seen mainstream adoption across all industry verticals. Over the last 5 years, a rich community has developed around the project and has innovated rapidly. Hudi has brought data warehouse/database like functionality to data lakes, making new things like minute level data freshness or optimized storage or self-managing tables, possible directly on data lakes. Many organizations from across the world have contributed to Hudi and the project has grown 7x in under two years to nearly 1 million monthly downloads. I am humbled to now see enterprises like Amazon, Bytedance, Disney+ Hotstar, GE Aviation, Robinhood, Walmart, plus many more, adopt and build exabyte scale data lakes to power their business-critical applications on Apache Hudi.
Jumping on the bandwagon, I am thrilled to share what we have been building using Hudi the last few months - Onehouse. To kick start our adventure, we raised an $8MM seed round from Greylock Ventures and Addition - investment firms with stellar track records and deep experience in nurturing enterprise data startups. Below is the story of our journey and vision for the future.
Working with different companies in the Hudi community, we noticed a common pattern. They often start with a data warehouse (“warehouse” for short) for traditional BI/Analytics primarily because it’s easy to use and often fully managed. Then, as they grow, so does the complexity and scale of their workloads which leads to an exponential increase in cost. Rising costs and more advanced data science workloads that are unachievable on their warehouse drive them to invest in a data lake (“lake” for short). The investment in a lake comes with a whole new set of challenges around concurrency, performance, and a lack of mature data management.
Most companies end up living between a rock and a hard place, juggling data across both a lake and a warehouse. However, in the last few years, emerging technologies like Hudi, have provided the means to solve some of the problems above, by adding critical warehouse features like transactions, indexing, and scalable metadata to data lakes. Recently, the term “Lakehouse” has been popularized as a new type of lake that supports both workloads. The term is new, but it captures the essence of why we originally built Hudi at Uber.
Even though the technologies exist, a lakehouse still needs to be built by highly skilled, expensive engineering teams, using various open source tools together. Engineers need to deeply understand at least 3 to 4 distributed systems or databases and build everything including CDC ingestion, data deletion/masking jobs, file size control, and data layout optimizations from the ground up. In the five years of engaging with the Hudi community, I have seen this routinely take anywhere from a few months to over a year depending on the data scale and complexity. In most cases, companies are rebuilding fragmented pieces of the same data infrastructure.
Unlike other efforts out there, Hudi has recognized this problem from the very start and provides a rich set of open services that reclaims storage space, streaming ingestion, or optimize tables for performance. For e.g, we have seen a beeline of companies relying on Hudi’s streamer tool to build their lake ingestion, which drives code level standardization. However, companies still need to build operational excellence with these services and their interplays, for their lakes to reach their full potential. Operating a lakehouse is challenging and can become even more daunting when you have real-time streaming and transactional data sources that require complex change data capture pipelines.
In fact, we built a ton of operational systems around Hudi at Uber, which made it possible to offer the lake as a service to a massive, global organization with 20,000+ employees. Having spent countless hours in the last decade resolving production outages, restoring system stability across five large-scale distributed databases, including Voldemort, ksqlDB, and of course, Hudi, I can safely say that operational excellence is the most important aspect of successful data infrastructure. Many lake projects never reach their full potential due to a lack of standardized high-quality data infrastructure, around the lakehouse technologies, and that’s what we are going to fix.
We rethought the entire set of data architectures at play here, through the eyes of the user. For e.g, if I were to join the next LinkedIn or Uber, how would I set it up for success with data, what lessons have we learned and what would we change. We believe that data should not be locked into a particular query or compute engine, but universally accessible across the different BI, AI tools, and frameworks, sitting on vendor-neutral, standardized data infrastructure without investing 3-4 years into them again. That’s how Onehouse was born.
Onehouse is a cloud-native, managed foundation for your lakehouse that automatically ingests, manages and optimizes your data for faster processing. Onehouse is not another query engine, but a self-managing data layer that seamlessly interoperates with any of the popular query engines or data/table formats and vendors out there so you can pick what best suits your needs. By combining breakthrough technology and a fully-managed easy-to-use service, organizations can build data lakes in minutes, not months, realize large cost savings and still own their data in open formats. Onehouse aims to be the bedrock of your data infrastructure as the one home for all of your data. We are getting started by tackling challenges broadly in the following categories.
Continuous Data Delivery: Built on Hudi’s incremental storage and processing capabilities, Onehouse will replace old-school batch processing with incremental pipelines. Only processing data that changes will result in massive cost savings and low latency pipelines that keep your data always up to date.
Automagic Data Infrastructure: Onehouse delivers automagic performance at scale with no tuning required. Automate away tedious data chores including clustering, caching, small-file compaction, catalog syncing, and scaling table metadata, allowing data engineers/scientists to focus on creating direct business impact.
Truly Open and Interoperable: Sometimes you need Spark, sometimes you need Presto, and sometimes you still need a warehouse. The modern data ecosystem is evolving at such a rapid pace that interoperability with many engines, with the same levels of performance and functionality, is the only scalable model anymore. While open formats are the necessary first step, without open data services to manage data, users are at the same risk of being locked into the few vendors who provide them. By reusing Hudi's open services, Onehouse delivers true openness and flexibility.
Unlock savings at scale: Instead of retrofitting for advanced workloads much later, laboring through data migration projects while footing costly data infrastructure bills till then, Onehouse helps companies get going with a future-proof architecture. Onehouse enables the ease-of-use early on as teams embark on their analytics journey while scaling in cost-effective ways as data volumes grow or complexity increases.
So, where does this leave Hudi? Actually, better than ever! We are not here to fork an enterprise version of Hudi. With the funding, we can now bring the energy of a full-time, dedicated team of engineers to the Hudi community. Having supported the Hudi community for over four years now, I feel Hudi’s tremendous growth has driven user support, developer engagement, and community expectations well beyond what volunteer engineers or individual engineering teams at different companies can sustain. We will be avid Hudi users, active contributors to the community and remain champions for the project. We plan to contribute more foundational open source contributions from Onehouse to help make Hudi's already great platform services even better. At Onehouse, our focus is going to be on helping companies who cannot afford such large engineering investments, leveraging our collective operational experience with large-scale data systems. We transparently share more about our commitment towards openness in this dedicated blog post.
We have been working on an initial iteration of this vision over the past few months. If you are on the cusp of building out your lakehouse or are actively looking to future-proof your data architecture, then we would love to partner with you to bring this product and platform to life. Engage with us on one of these next steps:
Finally, I want to take this opportunity to thank our investors for their unwavering support, as I underwent a year-and-half long, arduous journey to obtain my green card. As someone who has poured four years of weekends/nights into the community on a work visa, I cannot ask for a better outcome - being able to work full-time on making Hudi and data lakes better.
Back to building...
Be the first to read new posts