ON-DEMAND WEBINAR
Upcoming
The Apache™ Spark Job That Wouldn’t Retry
Stop 2 A.M. firefighting caused by brittle retries and surprise downstream triggers, and adopt orchestration techniques that make Spark failures observable, recoverable, and safe to re-execute.
Stop 2 A.M. firefighting caused by brittle retries and surprise downstream triggers, and adopt orchestration techniques that make Spark failures observable, recoverable, and safe to re-execute.
Spark jobs fail. That’s expected.
What isn’t expected is when a failure can’t be safely retried, and every rerun makes things worse.
We’ve been there. We’ve seen Spark pipelines fail mid-run, and what should’ve been a routine retry turned into hours of manual cleanup, missed SLAs, and on-call fatigue. The Spark job was never the culprit. The problem was everything around it: retries handled by scripts, downstream jobs triggering anyway, and no clear notion of what had actually succeeded. When pipelines are stitched together with cron, failure becomes ambiguous. Did the job partially write? Can it be rerun? Should downstream jobs wait? No one really knows… and every retry increases risk.
In this session, we walk through what went wrong and why the fix wasn’t “more Spark” or “faster compute,” but proper orchestration. By modeling the pipeline as a workflow (with explicit state, dependencies, and recovery semantics) we made failures predictable, retries safe, and reruns boring again.
If you’re running production Spark pipelines and are tired of brittle retries, 2 A.M pages and late-night reruns, this one’s for you.
Your Presenters:

Your Moderator:
Stay in the know
Be the first to hear about news and product updates
