Earlier we announced our new company Onehouse, which provides a managed data lakehouse foundation built on top of Apache Hudi (“Hudi'' for brevity). In this blog, our founder and CEO Vinoth Chandar - who is also the creator and PMC Chair of Hudi, would like to transparently declare our principles and plans to continue contributing to the Hudi community in a meaningful and uninterrupted way.
To the unfamiliar reader, Hudi is governed by the Apache Software Foundation and no one individual or organization can exercise unjust control over the project. Hudi has long adopted open, transparent governance processes across code reviews, contributions, RFC/design proposal process, releases, user support and roadmapping, which will continue to drive the project. I have now spent ten years working on open-source software as my day job and nothing is going to shake that commitment.
To put things into perspective, 13 out of the 16 member strong Hudi PMC currently work companies other than Onehouse. Our PMC and committers come from different organizations across the world, which all depend on Hudi in mission critical ways. Onehouse will join that long list of companies, leveraging Hudi to build out services in our platform, while giving back by investing our resources into strengthening the community.
Having supported the Hudi community for over four years now, I feel Hudi has become a victim of its own success, where the tremendous growth has driven user support, developer engagement and expectations well beyond what volunteer engineers can sustain. Onehouse will dedicate a team of full-time engineers, product managers and support engineers to the Hudi community.
As we researched the market and interacted with different data companies, we noticed a trend where table formats were treated as new leverage to “lock-in” data and then upsell proprietary services on top of them or optimize specifically for a query engine. Most lake offerings are limited to query engines or authoring pipelines, without mature, automated data management functionality. Without any open services to manage this data, users are still locked-in even on open table formats or forced to further invest their own engineering resources on piecemeal solutions. We think this is the single biggest problem preventing organizations from operationalizing their lakes.
As a project, Hudi has always been much more than a format and this principle goes all the way back to Hudi's origins at Uber, where the even very first version shipped with automatic rollbacks, indexing and cleaning. Users routinely pick Hudi over other choices due to a rich set of such open services like clustering, snapshot/restore, compaction, file-sizing, streaming ingest, incremental ETL, CDC sources and much, much more. Having them available in open source, hardened by the large organizations that run them, reduces duplication of engineering efforts across the board.
At Onehouse, we want to uphold these principles and contribute even more foundational data lakehouse components like a caching service or a standalone meta-server. Onehouse's mission is to provide our customers with an open, interoperable data plane across the numerous lake engines, warehouses and other ML/AI data frameworks out there. While this may require us to build interop layers with other formats and systems, we firmly continue to believe Hudi is the most versatile toolkit available today to ingest, manage and optimize data on the lake.
One question that surfaced often over the last 2 years is: "What is your commercialization strategy for open source?". There wasn’t one. I never had any strategy here; the Hudi community just happened. The thing we never wanted to do was create some kind of “enterprise” version that would lock away all the useful features. Studying the different open source companies, what we noticed was that the ones beloved by customers and open-source users alike supported seamless switching between open-source and commercial/managed offerings.
Typically, when companies start building out for their early analytics needs, they lack the engineering resources needed to build a data lake, on top of different open source projects. Instead, they choose to start with vertically integrated and fully-managed data solutions, which are typically closed. Eventually they adopt a more open data lake later in the life cycle as workload complexity, cost and scale of data increases, when they can justify expanding their data engineering teams significantly. So, we thought, why not start open and stay open? What’s preventing this?
A wise man once said, “debugging is harder than coding, operations is harder than debugging”. At Uber, we built a ton of infrastructure around Hudi, to operate it as a platform for the entire company at planet scale. Onehouse will offer a simpler path for companies to adopt a data lake without investing in such infrastructure upfront and they can enjoy open data formats and services in Hudi from the start. If companies grow faster than Onehouse or have mandates on in-house data operations or for any such reasons, they would be able to migrate off Onehouse to just Apache Hudi operated by their own team. We believe this is the true freedom that should come with infrastructure services built around open source software.
What I hope you will take away from this is that Hudi and open source projects remain close to our hearts at Onehouse. While we have come far on our open source journey, we are still learning and discovering on this new journey as a company. What we do know is that we want to open source even more tools or services around the broader data ecosystem, as we hit the road. We encourage you to reach out to us at firstname.lastname@example.org with any follow-ups or ideas. Finally, if you are not a part of the Hudi community yet, join us on Slack or learn how to get involved.
Be the first to read new posts