The Apache™ Spark Job That Wouldn’t Retry

Stop 2 A.M. firefighting caused by brittle retries and surprise downstream triggers, and adopt orchestration techniques that make Spark failures observable, recoverable, and safe to re-execute.

Jan 29, 2026 | 10am PT

January 29, 2026

Spark jobs fail. That’s expected.

What isn’t expected is when a failure can’t be safely retried, and every rerun makes things worse.

We’ve been there. We’ve seen Spark pipelines fail mid-run, and what should’ve been a routine retry turned into hours of manual cleanup, missed SLAs, and on-call fatigue. The Spark job was never the culprit. The problem was everything around it: retries handled by scripts, downstream jobs triggering anyway, and no clear notion of what had actually succeeded. When pipelines are stitched together with cron, failure becomes ambiguous. Did the job partially write? Can it be rerun? Should downstream jobs wait? No one really knows… and every retry increases risk.

In this session, we walk through what went wrong and why the fix wasn’t “more Spark” or “faster compute,” but proper orchestration. By modeling the pipeline as a workflow (with explicit state, dependencies, and recovery semantics) we made failures predictable, retries safe, and reruns boring again.

If you’re running production Spark pipelines and are tired of brittle retries, 2 A.M pages and late-night reruns, this one’s for you.

Your Presenters:

Kyle Weller

VP of Product

Sagar Lakshmipathy

Solutions Engineer

Your Moderator:

No items found.

The Apache™ Spark Job That Wouldn’t Retry

Your Presenters:

Your Moderator:

Stay in the know