With the proliferation of cloud computing, modern data lakehouse platforms are primarily deployed on cloud infrastructure. The lakehouse storage formats, such as Apache Hudi™ and Apache Iceberg™ have evolved rapidly over the past few years. Apache Spark has become the de-facto data processing engine, powering most of the ETL workloads in lakehouses.
On-prem deployments for Spark have heavily relied on YARN, which provides coarse-grained scheduling to share compute resources across Spark jobs. To fully exploit cloud elasticity, it’s important to ensure Spark clusters can dynamically scale up and down with load. In this blog, we call out the limitations of Spark’s default implementation of autoscaling – called dynamic allocation. Dynamic allocation scales up with task parallelism, not based on resource utilization or data volumes (load) and is thus fundamentally ineffective at scaling well as data volumes rise or drop.
We argue on the importance of solving the autoscaling for Spark for data lakehouse workloads which are primarily characterized by the data volumes they handle. We believe this is crucial to ensure higher performance while keeping infrastructure costs under control. We’ll dig into the design challenges of Spark autoscaling and explain why it’s fundamentally harder than traditional autoscalers that many cloud platforms use. We conclude the blog by showcasing our optimized autoscaler deployed as a part of the Onehouse Compute Runtime and how it performs against the default autoscaler.
At Onehouse, our vision is to deliver a fully managed open data platform built on open lakehouse technologies—without the associated complexity of stitching together multiple OSS projects. The platform must support diverse ETL/ELT workloads, each with unique performance and cost requirements.
Moreover, Onehouse operates within customer cloud accounts, and needs to support these two seemingly contradicting goals around cloud compute resource usage:
Achieving both requires a compute runtime that adapts to workload needs. While Spark vendors like Databricks, AWS EMR, and GCP Dataproc provide enhanced dynamic allocation, their autoscalers lack full awareness of workload characteristics. For example, an incremental ETL job with strict latency SLAs has very different scale-up and scale-down requirements than a batch ETL job optimized for cost efficiency. Similarly, an autoscaler that bases decisions solely on task backlog for an ETL job streaming data from Kafka into the lake risks unbounded lag as Kafka messages accumulate. In addition, ETL jobs often repeat similar tasks across runs, leveraging this historical information can significantly improve autoscaler performance.
Moreover, these Spark platforms typically run one ETL job per cluster, simplifying operations but incurring additional startup and JVM memory overhead per run. Multiplexing ETLs over shared Spark jobs and shared Kubernetes clusters allows resources freed by one job to be reused by another with rising demand. Autoscaling needs to account for these deployment models to truly deliver the cost savings.
By aligning autoscaling with workload characteristics, the Onehouse lakehouse platform transforms elasticity and cost efficiency from competing priorities into complementary strengths. We now highlight the key autoscaling requirements, drawing from workload behavior as well as production deployments at Onehouse.
Some of the most demanding ETL jobs we run come from customers in blockchain analytics. One such customer operates ~10 ETL pipelines with strict freshness guarantees of 5–10 minutes. These pipelines perform upserts—requiring indexing and compaction operations to apply updates. In steady state, the data volume is predictable and can be handled by a fixed compute footprint.
The real challenge comes during regular backfills (every few days or so). Here, data volumes can surge 100X beyond steady state, with the number of updates spiking up to 200X. Despite this massive increase in workload, the data freshness or latency requirement remains the same. To meet it, the platform must dynamically scale up compute during the backfill—and just as importantly, scale down once the job completes to avoid runaway costs.
The figures below illustrate one such instance: during the early hours of a day with backfill activity, traffic spiked dramatically compared to steady-state volumes. This highlights the dual requirement for elastic scalability: the ability to scale-up compute resources quickly to handle bursts while scaling back down once the surge subsides.
Onehouse’s ingestion platform, OneFlow multiplexes ingestion jobs on a single Spark cluster using an efficient bin-packing algorithm. Many of our customers use it for CDC ingestion, syncing tables from Postgres or MySQL into data lake tables, often on varying schedules depending on their ETL frequency.
In practice, data volumes vary—both across different tables and over time for the same table. The figure on the left shows the percentage of compute resources required over time for tables to meet their configured latency targets, while the figure on the right shows the percentage distribution of average daily data volume across tables. The left figure shows that there is high variance in compute requirements for a single job across time and sharing resources helps one job to borrow compute from another job and vice-versa. Provisioning separate Spark clusters for each job leads to over-provisioning and inefficiencies, since every job must independently determine its own resource requirements.
Statistical multiplexing dramatically reduces infrastructure costs, but it still requires an always-on Spark cluster that can dynamically scale resources in response to workload fluctuations—whether driven by day/night cycles, peak loads, or periods of low activity. The following plot shows the data volume (in GB) processed over a week for one of our customers - running 1000+ ETLs that sync data from operational databases in near real-time.
Onehouse’s Spark & SQL clusters powered by the Quanton engine enables running several Spark jobs over a shared, managed Kubernetes cluster. As jobs are submitted and completed, resource demands can shift quickly. We plot the number of executor pods added or removed over time in the figure below for one of our production deployments. Each color code represents a different spark job.
If Spark executors don’t autoscale proportionally to job load, then nodes remain idle or underutilized—erasing the very benefits of cluster consolidation and cost efficiency that the shared cluster should provide. Kubernetes autoscalers do re-balance the executor pods to increase node utilization by killing pods on lightly loaded nodes. However, we had to disable re-balancing by setting safe-to-evict = false since Kubernetes is not aware of the Spark shuffle data and cached dataframes. Kubernetes autoscaler and Karpenter both ended up being very aggressive in killing executor pods, causing a lot of intermittent data to be re-processed, and in some instances even causing job failures. While Kubernetes should handle node additions and removals, it's best if executor (pod) level auto-scaling decisions are handled at the Spark layer.
We highlighted these scenarios to emphasize how Spark deployments have evolved: from running large batch jobs on dedicated clusters to powering more complex pipelines that are incremental, streaming-based, and executed on shared clusters in modern lakehouse environments. In such deployments, autoscaling Spark executors is no longer optional—it is a fundamental requirement to balance cost efficiency with the performance guarantees expected by both users and pipelines.
Apache Spark provides the ability to dynamically increase or decrease the number of executors between a configured min and max range. Its primary purpose is to scale the resources based on the task parallelism, and not the CPU resource utilization or data volumes processed. Without it, users must manually configure a fixed number of executors, a process that often involves extensive trial and error for each Spark job.
The core logic is implemented in the class ExecutorAllocationManager. Every 100 milliseconds, Spark computes the target number of executors (numTargetExecs) based on the maximum needed executors (maxNeededExecs). maxNeededExecs is calculated from the current workload — essentially the maximum achievable parallelism given the pending and active tasks.
If numTargetExecs is below maxNeededExecs, Spark increases it gradually: starting with +1 executor, and then doubling the increase on subsequent intervals (exponential) until it reaches maxNeededExecs. New executors are requested only if there is a sustained backlog (controlled by sustainedSchedulerBacklogTimeout) of pending tasks. Spark tracks this backlog of pending tasks using listeners that monitor task and stage events:
When numTargetExecs exceeds maxNeededExecs, the target is immediately reduced to match maxNeededExecs. However, Spark does not kill executors immediately at that moment. Instead:
This means that even when workload decreases, the number of active executors may continue to remain higher than the necessary number of executors required. Executor removal is governed by three configurable timeouts:
This layered mechanism ensures executors holding important intermediate shuffle data or cached data are not removed prematurely, while executors with no state can be aggressively removed.
While using Spark's dynamic allocation in production, it exposed several shortcomings that have caused inefficiencies in resource utilization and significant operational overhead.
The core issue is the efficacy of down-scaling which has been called out by a Databricks blog as well. The fundamental problem is the decoupling of the reduction of the number of active executors from the core algorithm that computes the number of Target Executors. While not perfect, the value of numTargetExecs does accurately capture the actual number of executors required based on the current load of the system. However, the number of executors is only reduced when 1 or more executors are fully idle, i.e., they do not receive any tasks for at least executorIdleTimeout.
As the data volume drops, one or both of these may happen: the rate of generation of tasks and the time to process tasks reduces. The value of maxNeededExecs should start reducing as the peak load (max tasks) drops. The pending tasks should start draining faster and there may not always be a sustained load of tasks for sustainedSchedulerBacklogTimeout time interval. Subsequently, the value of numTargetExecs should drop from its previous levels. However, the drop in numTargetExecs may or may not impact the number of active executors, since the reduction of executors is controlled by the idle timeout values. Even with reduced data volume, sufficient tasks may be generated to keep the executors active and avoid them hitting the timeout values. For instance, with an executorIdleTimeout of 60 seconds, it is possible to have 40 executors active with just 40 tasks generated per minute. We confirmed that the Spark scheduler operates independently of dynamic allocation, often distributing tasks across active executors rather than concentrating them within fewer executors for maximum resource efficiency.
The plots below are taken from one of our customer deployments in production. Around 14:00, there was a sudden surge in the number of upsert records to be ingested. As the number of backlogged tasks grew, the numTargetExecs was increased from 1 to about 45. Consequently, the number of actual executors grew from 10 to 24. But as the spike was short-lived, the numTargetExecs was quickly adjusted back to 0, ensuring that the number of executors did not grow beyond 24. However, the number stayed at 24 thereafter for > 10 hours. The result was a drop of the average CPU utilization of the spark cluster from approximately 55% to 15%. The final plot (bottom-right) provides an explanation on why none of the executors timed out: the average number of tasks generated across the timeout window (shuffle timeout) is significantly higher than the minimum number required to keep all 24 executors active. Note that in this case the timeout values were configured as – executorIdleTimeout set to 30secs, cachedExecutorIdleTimeout set to 1min and shuffleTrackingTimeout set to 5 mins.
Even within a single customer deployment, identifying the right configurations has been a challenge. The optimal settings can vary significantly depending on the context—such as during heavy backfilling or peak traffic periods versus quieter, post-backfill, or non-peak hours. In the sections below, we break down some of the key pain points we’ve encountered.
In our production environments, we have encountered at least two critical issues with Spark’s dynamic allocation that have proven either hard to reproduce or non-trivial to root cause. In essence, besides performance issues, we have faced a few reliability issues in large scale deployments.
One recurring issue appears to be a potential bug or race condition where the number of executors stops increasing, despite rising data volumes and tasks. In most of our customer deployments, we configure minExecutors=0 to maximize efficiency. However, in this scenario, if executors dropped to zero, the system became completely unavailable.
For example, in one customer deployment (see plot below), executor count became stuck at 8 after 14:00 hrs. As a result, sync latencies increased sharply—from about 5 minutes to over 15 minutes—demonstrating how sensitive ingestion latency can be to autoscaling efficacy.
We also observed a second issue during scale-ups: a subset of newly provisioned executors were killed before processing any tasks. Our current hypothesis is that these executors were terminated by the idleTimeout mechanism, as pod initialization was taking roughly one minute, mainly due to time taken to load the image. Increasing the idleTimeout does mitigate this, but doing so undermines the effectiveness of downscaling.
To allow executors to shut down without losing the shuffle files they produce, Spark relies on an external shuffle service that runs independently of executors on every worker node in the cluster. This design ensures that shuffle data remains accessible even after executors are killed.
There are multiple External shuffle service frameworks available for Spark today, some of them such as Apache Celeborn work with Kubernetes as well. The ESS is an additional infrastructure with added complexity and additional operational overhead. With ESS, the dynamic allocation does not need to worry about the presence of shuffle data, but executors may still have cached RDDs or dataframes stored locally. Depending on the specific implementation of the ETL jobs, loss of cached data could also cause DAG retriggers and reprocessing. Even with ESS, most of the drawbacks mentioned above still apply – for instance, when the load in a spark cluster goes down, as long as sufficient tasks are generated within the idle timeout, the cluster will not scale-down effectively.
In this section, we look at the autoscaling techniques used across the data stack today. We start with cluster-level autoscalers, which focus on scaling the underlying compute infrastructure. We then cover warehouse autoscalers, which adjust resources for SQL warehouses. Finally, we examine how Spark vendors have built their own autoscalers on top of the OSS implementation, extending or replacing it to better handle production workloads. Together, these approaches capture the current state of the art in autoscaling for modern data platforms.
Most cloud providers (AWS, GCP) offer managed node groups that automatically scale VM instances based on load. By default, autoscaling uses target CPU utilization as the trigger. If average vCPU usage exceeds the threshold, more instances are added.
In GCP, the number of new VMs added is proportional to the ratio of current vs. target CPU utilization. At very high load (~100%), the autoscaler can add ~50% more instances to absorb surges. Scale-out happens quickly, but scale-in is deliberately conservative to avoid instability. Two key scale-in controls are:
The stabilization window helps by:
Kubernetes HPA scales pods in or out based on metrics like CPU, memory, or custom/external signals. Implemented as a control loop (default: every 15s), HPA estimates replicas using:
desiredReplicas = ceil[currentReplicas * (currentMetricValue / targetMetricValue)]
For example, with CPU autoscaling, if the target is 75% and average usage is higher, more pods are added. When multiple metrics are defined, HPA calculates replicas for each and uses the maximum value.
To avoid thrashing (frequent scale-in/out), HPA applies a stabilization window (default: 5 minutes), similar to cloud autoscalers, ensuring safe and steady scale-down decisions.
Snowflake offers compute in the form of either a single-cluster warehouse or a multi-cluster warehouse. A warehouse is essentially a cluster of compute resources, available in different sizes.
We believe this is a critical gap for ETL workloads, which are generally less latency-sensitive than interactive SQL but highly cost-sensitive. For short-lived or interactive queries, it makes sense to keep clusters warm and scale the number of clusters as demand fluctuates. But for ETL, autoscaling within a cluster is essential for efficiency.
Most major Spark providers now ship their own custom autoscalers, which in itself is proof that the open-source (OSS) version of Spark’s dynamic allocation falls short for real-world production use cases.
When building autoscalers, there are common challenges across distributed systems, and then there are Spark-specific complexities. Spark’s execution model, workload variability, and reliance on stateful executors make scaling particularly tricky. Below, we outline the key problem areas.
Collecting stable usage signals: Autoscalers depend on workload metrics—CPU utilization, memory pressure, request throughput, or latency. The challenge is that raw signals are often noisy: short-lived spikes or measurement jitter can mislead the autoscaler, causing it to scale in the wrong direction. Instead of improving efficiency, unstable signals can amplify instability.
For Spark, this problem is compounded because CPU or memory usage alone may not reflect true bottlenecks. Heavy operations such as large disk shuffles, disk spills, or writes to cloud storage can saturate I/O that the autoscalers need to account for.
Avoiding thrashing and flip-flopping: A common inefficiency for autoscalers is thrashing: rapidly adding and removing resources in response to transient changes. This wastes resources and slows applications due to repeated provisioning overhead. To prevent this, autoscalers typically employ dampening techniques—cool-down periods, hysteresis, or predictive smoothing—that ensure small fluctuations don’t trigger unnecessary actions.
Safely scaling down: Scaling down is often riskier than scaling up. Remove too many resources, and the cluster may be unable to absorb sudden demand spikes, causing degraded performance or outright failures. An autoscaler must first validate that the reduced cluster size can still handle the observed peak load during a stabilization window before scaling down safely.
For Spark, this is especially nuanced. A stage might temporarily exhibit low parallelism, misleading the autoscaler into shrinking the cluster. But if demand ramps back up, scaling back up takes time, potentially delaying job completion. Over-aggressive downscaling can also stress other resources: tighter memory and disk pressure can increase the risk of out-of-memory (OOM) errors or out-of-disk failures, sometimes leading to job crashes.
Handling stateful executors: Unlike many stateless systems, Spark executors maintain important state: cached datasets and intermediate shuffle outputs. Removing executors without careful selection can result in the loss of shuffle data, forcing costly recomputation and slowing downstream stages.
An autoscaler must therefore be state-aware, intelligently choosing which executors to terminate. Poor executor selection can degrade performance or even cause job failures, highlighting the importance of Spark-specific heuristics in scaling decisions.
As part of the Onehouse Compute Runtime (OCR), we’ve started rolling out an optimized Spark Autoscaler. The autoscaler was built in a way to solve a lot of the aforementioned challenges. Unlike the default approach that scales up solely based on task backlog, our autoscaler incorporates multiple signals: CPU and memory utilization, the estimated growth of pending tasks, workload and source statistics and shuffle data volume.
We ran several A/B experiments comparing OSS Spark’s dynamic allocation with the Onehouse implementation. In one customer's ETL job—which scanned a large table, sorted the data, applied filters, and then wrote the output to a Hudi table—we observed clear differences.
With dynamic allocation, executors scaled up gradually as tasks increased, but they also scaled down too aggressively when task parallelism dropped momentarily. This led to repeated cycles of slow scale-up, with executor counts taking 30–40 minutes to stabilize. During this time, latencies remained elevated.
With the Onehouse autoscaler, executor scale-up was faster, and a stabilization period prevented unnecessary downscales. By also considering compute and memory utilization, scale-down happened steadily after peak load, rather than abruptly. The result: lower latencies after the initial ramp-up and far more predictable performance.
Autoscaling Spark for lakehouse workloads is much harder than just watching CPU and memory. The mix of long-running executors, I/O-heavy pipelines, and job variability means Spark’s default dynamic allocation often thrashes, leaves clusters underutilized, and drives up costs. As ETL shifts from big nightly batches to continuous incremental pipelines across shared clusters, these inefficiencies only get worse.
The Onehouse Compute Runtime (OCR) solves this by rethinking autoscaling for the lakehouse. Instead of relying on generic signals, OCR uses workload-aware heuristics and smarter scaling policies to manage executor lifecycles, prevent thrashing, and adapt to diverse job profiles. The result: faster jobs, better cluster utilization, and meaningful cost savings—especially when workloads are multiplexed across shared Spark clusters.
If you’re running Spark today, a good first step is to run our free Spark Analyzer to see how much idle executor time is inflating your costs. And when you’re ready, the Onehouse brings managed ingestion, SQL and Spark jobs, orchestration, and an autoscaler purpose-built for lakehouse workloads—all delivered as an open, fully-managed platform.
Be the first to read new posts