First, I want to thank Ryan for writing this blog. It now finally allows me to respond publicly to FUD I have been hearing through some users questions, mumblings in podcasts, hallway conversations in some offices and even some analysts. The easy thing would have been a direct video call with at-least one Hudi PMC member and even just follow up with Hudi committers who directly responded at these various places.
I would have loved to write a long line-by-line explanation since I have a lot to say as well. But, at the current time we are focused on some very exciting things we are launching for the open source community at large including the first ever Open Source Data Summit and even open-sourcing a new interoperability project “Onetable”.
Even though this write-up is shorter, we are going to address the blog in full, since we know misinformation travels fast. The statements in the blog are factually inaccurate and we firmly stand behind the atomicity, consistency, isolation and durability (ACID) guarantees of Hudi. The claims in the blog can be roughly categorized into the following types of statements:
Much of the blog talks about instant collisions and covering that first will dramatically reduce the volume of text to address. We will separate the discussion across talking about the design and the implementation. The Hudi spec clearly states that instant times are monotonically increasing, so I don’t understand the need for simulating collisions on same millisecond epoch.
Action timestamp: Monotonically increasing value to denote strict ordering of actions in the timeline. This could be provided by an external token provider or rely on the system epoch time at millisecond granularity.
Writer will pick a monotonically increasing instant time from the latest state of the Hudi timeline (action commit time) and will pick the last successful commit instant (merge commit time) to merge the changes to. If the merge succeeds, then action commit time will be the next successful commit in the timeline.
When it comes to implementation, I understand the criticism as “what if you end up with collisions since Hudi uses wall-clock time”. The response for this is already summarized here in a faq. Additionally, I will share my own practical perspectives. I had the good fortune to operate an actual low-latency data store based on vector clocks at Linkedin scale. Our key-value store at Linkedin used vector-clocks while Apache Cassandra showed timestamps are good enough to resolve conflicts, reducing the need for a complex external token provider. Data lake transactions last much much longer than operational systems and timestamp collisions are very rare . The conditional probability of two writers picking the same millisecond even for minute-level commits is exactly (1/60000) * (1/60000) = 0.000000000277778 and this has not been an issue at all in practice, as well easily fixable with a salt if needed. However, we understand this is a touchy emacs-vs-vim like topic, so Hudi already has a time generator that is pluggable and skew adjusting like Spanner’s TrueTime APIs. Even there, implementing a TrueTime like timestamp generation is much much easier, since we can afford to wait several hundred milliseconds to ensure these time only move forward (the same amount of wait would kill performance in an operational system)
The remainder of the blog is summarized into five bullets at the end, which we will now talk about one by one.
Here, again there are clear FAQs (1,2), so it’s as much a revelation as it is just understanding the expectation. As mentioned there, the conflict detection mechanism is pluggable and can be used to implement a check for uniqueness violation (under the same lock discussed in the blog). So, why not do it? Because, it adds an extra read of keys for each commit, which is not desirable for the majority of our users, who are writing much lower-latency pipelines. By the way, this is also how conflict resolution is defined for a third referenceable system - Delta Lake and I am admittedly unclear on Iceberg’s strategy here. We have already written a pretty in-depth blog about how keys help our users for by de-duping and event-time based merging.
To address any concerns raised around GDPR or CCPA et al, by default (e.g., in the default Simple Index and Bloom Index), any subsequent updates and deletes are applied to all copies of the same primary key. This is because the indexing phase identifies records of a primary key in all locations. So deletes in Hudi remove all copies of the same primary key, i.e., duplicates, and comply with GDPR or CCPA requirements. Here are two examples [1, 2] demonstrating that duplicates are properly deleted from a Hudi table. Hudi 0.14 now has added a auto key generation scheme, which will remove the burden of unique key generation from the user.
First, the copious amount of sentences talking about lack of clarity around lock provider requirements can be condensed by reading the concurrency control page, which lists different deployment modes. Our Iceberg friends may not buy that single writer or avoiding concurrency is doable, but Hudi has built-in and automatic table services (so you don’t have to always buy them from a vendor). Taking a step back, we need to really ask ourselves if lakehouse is an RDBMS or if the same concurrency control techniques that apply to OLTP transactional databases are applicable to high-throughput data pipeline writers. Hudi has a less enthusiastic view on OCC, built out things like MVCC based non blocking async compaction (the commit time decision significantly aids this), that can have writers working non-stop with compactions running in the background. The single writer is a very smart choice to avoid horrible livelock issues (we hear as a common pain from prominent Iceberg users) with optimistic concurrency control, while letting many of the use-cases that need concurrent writers like table maintenance to be taken care of by the system itself.
Getting to the topic of concurrent writers and table versions, let me clarify that nothing affects snapshot queries (batch reads) on the table, with concurrent writers. Here, lets say a commit with the greater instant time (say C2) finishes before a commit with an lesser instant time (say C1), leaving C1 as an inflight write transaction in the Hudi timeline. It may be tempting to think this violates some serializability principle for snapshot queries. This scenario is no different than two iceberg snapshots (uuid1 and uuid2, uuid2 > uuid1) committing the snapshot of uuid2 first. In the Hudi example, Whether or not C1 successfully commits is dependent on whether there are conflicts between the concurrent commits (C1 and C2), inflight compactions, other actions on the timeline. Otherwise, Hudi aborts the commit with an earlier instant time (C1) and the writer has to retry with a new instant time, which is later than C2 similar to other OCC implementations. When discussing retries in this scenario, the blog conveniently misses early conflict detection that was added well over a year ago, to abort early to avoid recomputations.
Now, that brings us to the discussion on incremental read patterns. When dealing with incremental queries, special handling needs to be done since Hudi currently chooses to use the instant time based on the start of the transaction for incremental queries. When a case like above happens, the incremental query and source do not serve data from any inflight instant or beyond, so there is no data loss or dropped record. This test demonstrates how Hudi incremental source stops proceeding until C1 completes. Hudi favors safety and sacrifices liveness, in such a case. The blog brings up the case of a C2 being handed out to the incremental reads, while C1 is chosen in memory but does not have a requested file yet on the timeline. The cases like network errors, file creation errors etc is going to ultimately fail the creation of the requested file and C1 will be retried with a different commit time.
While the blog is quick to point out one faq, that acknowledges that the guards are not implemented for a specific version. The blog suggests an adhoc approach to concurrency design and stops short of almost calling Hudi contributors names. It conveniently ignores the faq right below this, that brings up a very important reason why. Community over Code!
When we added concurrent writing in 0.7, my then smart co-workers at Uber engineering also implemented in parallel a change to the timeline to address this even without the liveness sacrifice. Standard practice for Hudi upgrades is to first upgrade the reader (query engine) and then the writer (see release notes). However, a certain very popular hosted query engine did not upgrade for an entire year and a bunch of users would have their production broken if we did. If I have to make the choice between appearing smart vs helping users or at the very least, not causing them pain, I’ll always pick the latter. In this case, the vast majority of the community agreed as well. The blog once again fails to mention how this has now changed in 0.14 release.
We are comparing apples-oranges here. Hudi’s read-optimized query is targeted for the MOR table only, with guidance around how compaction should be run to achieve predictable results. It provides users with a practical tradeoff of decoupling writer performance vs query performance. MOR compaction which runs every few commits (or “deltacommit” to be exact for the MOR table) by default, merges the base (parquet) file and corresponding log files to a new base file within each file group, so that the snapshot query serving the latest data immediately after compaction reads the base files only. This periodically aligns the “state” boundary that is pointed out in the blog. To read the latest snapshot of a MOR table, a user should use snapshot query. Popular engines including Spark, Presto, Trino, and Hive already support snapshot queries on MOR table. We have some of the largest data lakes leveraging this in a safe, smart way to save millions of dollars with both columnar read performance when data freshness does not matter (say you want to batch deletes once every X hours) and faster write performance from just logging deltas. Note that read optimized queries have nothing to do with right-to-be-forgotten enforcement. There are typical SLAs within which they need to be forgotten and running a full compaction within that time frame is how organizations accomplish this.
So, Is the issue simply that we expose this flexibility to an user? Are you wondering why on earth you will optimize for performance here over consistency? See PACELC which calls out how important these are in thinking about system tradeoffs and popular systems like Cassandra which allow these tradeoffs.
In all, the only constructive feedback I could derive was, that the spec could have been clearer or expansive in places. More on this later.
In this section, we will use some quotes from the blog and put Iceberg in front of a mirror, as an exercise in empathy building, if not anything else.
"But the primary problem is that this check isn’t required, nor is the issue mentioned in the Hudi spec."
I sincerely admire this passion for keeping a piece of documentation up to date. But, when it comes to concurrency and all these different aspects we discussed, the real thing to consider is a TLA+ spec, if we are going for a pedantic show of correctness.
"The problem is that locking is entirely optional"
The funny thing is we now have production Iceberg experience from Onetable and the one issue we routinely hit a dangling pointer from the metadata file to the snapshot, that renders the table unusable, without manual intervention. The iceberg spec has 0 mention of the word “catalog”, considering how critical it is to safely, operating in a concurrent manner on a table. Moreover, the spark write configs seem to default to “null” isolation level, with no checks? Or isn’t this the same as an Iceberg user running Hadoop Catalog and then not realizing it’s not truly safe?
"Reading a Hudi table is similar to reading from Hive, except for using the timeline to select only live files"
Couldn't this be re-written for Iceberg as : “except for reading some avro files first to select the live files?” The tie in with Hive feels like a convenient “kill two birds in one stone” play using some clever word association, while completely ignoring a direct FAQ.
"Hive’s biggest flaw was tracking table state using a file system. We knew that we wouldn’t achieve strong ACID guarantees by maintaining that compatibility"
Since the Hive as a table format does not have a spec, it’s again left to everyone’s own interpretation what the term even means. While Iceberg takes pride in not relying on file naming conventions to track state, it completely glosses over practical considerations like losing one manifest/metadata file due to an operational error, will render the table useless. On the other hand, the blog is quick to unilaterally decide also why Hudi designers chose to add this information to the file names - compatibility with Hive. Again, see the FAQ above for how we differ and also may be read more FAQs to fully understand what this buys us. (Hint: it’s something Iceberg cannot do)
"Iceberg considers consistency guarantees the responsibility of the query engine to declare"
"The Iceberg spec is straightforward, complete, easier to implement, and delivers the guarantees that these engines rely on."
Now, on to the more complex stuff. Maybe it’s me, but I can’t tell what delivers what by reading these two statements. Going by the first statement, and the fact that “serializable” is a documented write operation, I take it that Iceberg Spark guarantees Serializability? If so, we actually find that it violates serializability on a straightforward test that mimics literally the example from the postgres docs.
Screencast showing the Iceberg test live in action. Gist here.
"reading the same instant at different times can give different results."
This one is very interesting. The Iceberg spec tracks a “timestamp-ms” value for each snapshot, that denotes when this snapshot is created. Let’s now take the same example of a timestamp-ms collision i.e there are two snapshots S1 and S2 with the same timestamp-ms value. I can’t find anything in the spec, that forbids this. In fact, I can’t even find any mention of time-travel or change capture queries even work in general on the format. Its left to the reader’s imagination, even as the blog was nitting on omissions on the Hudi spec.
Now, trying to understand how a time-travel query using “as of timestamp” works, I peeked into the implementation which picks the latest snapshot with the matching timestamp. Unless there are undocumented implementation paths (other than what I can see from the Spark docs), adding some artificial waits to ensure the timestamps are monotonically increasing (deja vu?), this will produce different results for the same time-travel query as illustrated below.
And we have not opened any of those forbidden doors around Iceberg system design and performance, that are usually addressed by claiming “it’s a spec”. There is a fundamental mismatch between what the Hudi and Iceberg communities value. I think it’s more constructive to embrace that diversity rather than force fit the same subjective correctness everywhere.
Now that we have discussed the technical issues, time to get down to chatting about how this is not going down as it should. So why am I saying all this finally, after writing a rebuttal blog? Pot calling the kettle black? No. If I didn’t I’d be letting the Hudi community down by allowing false narratives to spread. Hudi community has made some amazing contributions (even introducing this very architecture, terms like CoW/MoR) and deserves more credit than “squashed a lot of bugs”.
Open source projects should be first and foremost focused on community over code. Our managed services will win based on which is better. Healthy competition should be embraced as a catalyst to innovation and having Hudi, Iceberg, and Delta all rapidly growing benefits all developers in the community. A very important aspect of this is recognizing the diversity of ideas. As an example, back to the evergreen topic of specs, the iceberg spec is now almost an ebook at 20 pages. While I totally respect the Iceberg community’s spec first approach, we believe it’s better to provide APIs in languages we deem users want (given our designs and experiences on Parquet). Both communities should be able to approach these without spec-policing or name-calling.
Next is just realizing that we are saying things that now involve at-least a few thousand developers. What if all these subjective opinions cause people to different things unnecessarily at work and impact their careers/projects? Then, there are all these off-hand remarks like - “Hudi is like Hive”, “More complexity means more issues.....” , “...you don’t know what you don’t know. There are always more flaws lurking.”, “simple vs complex” - which are really uncalled for, in my opinion. Simple vs Complex can also be Basic vs Advanced. Who’s to decide?
There’s already 3 major projects - Delta Lake, Hudi & Iceberg with thousands of users for each project. From a data OSS community perspective, we’re way past having a standard open table format, what’s important is to now make progress and move the industry forward in interoperability. Pushing an agenda of one common table format and making it sound like a panacea, sounds to me more like a personal or company goal to gather more users and market share rather than helping users and communities to navigate complexities. We are doing our bit here, by building a brand new open-source project, OneTable, to accelerate collaboration across the aisles and provide a neutral zone for building bridges. It makes it seamless for developers in both of our communities to take advantage of the best of our projects.
We are busy working on the next generation of Hudi, as probably the only open data lake platform that aims to go beyond table formats and bring much-needed automation and platformization for users. Hearty congratulations on Snowflake picking Iceberg, but I derive more happiness from seeing some amazing companies build great data experiences using Hudi, and a community that keeps growing, in spite of marketing headwinds.
As open source leaders we bear responsibility to promote healthy discussions and debates grounded on a foundation of being truth-seekers. It is sad to see misinformation spread about Hudi’s ACID guarantees on the basis of subjective interpretation. In the same spirit of truth seeking, Ryan, I have two invitations for you. If you truly want to encourage the data community to learn, please cross-link this blog to let your readers decide. Secondly I or any Apache Hudi PMC member would be more than happy to have a friendly conversation to help discuss these concepts in depth so you can understand the Hudi community's firm stance on our support for ACID guarantees.
Be the first to read new posts