Graceful Failure: Advanced Error Handling and Retry Logic in workflow.do

In a perfect world, every API call returns a 200 OK, every network connection is stable, and every third-party service has 100% uptime. But we don't build for a perfect world; we build for the real one. In the real world, things fail. The true test of a robust automation system isn't whether it works when everything goes right, but how it behaves when things go wrong.

This is where orchestration shines. While an action.do is the perfect atomic building block for a single task, a workflow.do is the conductor that makes sure the whole symphony plays on, even if one instrument hits a sour note. It's designed with failure in mind, providing powerful tools for advanced error handling and intelligent retry logic right out of the box.

The Fragility of a Simple Script

Imagine a standard user-onboarding script. It’s a simple, linear process:

Create a user in the database.
Call a payment gateway to process a subscription.
Add the user to a CRM.
Send a welcome email.

What happens if the payment gateway (Step 2) is temporarily down? The script crashes. You're left with a user in your database but no payment, no CRM entry, and no welcome email. Your data is in an inconsistent state, and your new user is left in limbo. This brittle approach is a maintenance nightmare and a poor user experience.

Robust automation requires more than just executing steps; it requires managing state, handling exceptions, and defining what to do when the unexpected happens.

Build Resilience with workflow.do Retry Logic

workflow.do elevates your automation from a fragile script to a resilient, self-healing service. Instead of just failing, it allows you to define a clear, coded policy for dealing with transient errors.

The most common and effective strategy for handling temporary outages is a "retry with exponential backoff." The idea is simple: if an action fails, wait a bit and try again. If it fails again, wait longer before the next attempt. This gives the failing service time to recover without overwhelming it with constant requests.

Let's see how you can implement this within a workflow.do definition. Building on our onboarding scenario, we can wrap the fallible payment action in a workflow.retry block.

import { workflow, action } from '@do-sdk/core';
import { createUser, chargeFee, sendWelcomeEmail } from './actions'; // Your predefined actions

const userSignupWorkflow = workflow.create({
  id: 'user-signup-workflow',
  description: 'Handles new user signups with resilient payment processing.',
  execute: async ({ userDetails }) => {
    // This step is unlikely to fail, but could also be wrapped in a try/catch
    const user = await createUser.execute(userDetails);

    try {
      // Attempt the payment action with a built-in retry policy
      const paymentResult = await workflow.retry(
        () => chargeFee.execute({ userId: user.id, amount: 29.99 }),
        {
          retries: 3,        // Attempt up to 3 times AFTER the initial failure
          delay: 1000,       // Wait 1 second before the first retry
          backoffFactor: 2   // Double the wait time on each subsequent retry (1s, 2s, 4s)
        }
      );
    } catch (error) {
      // This block executes only if all retries fail
      console.error(`Permanent payment failure for user ${user.id}:`, error);
      
      // Execute a defined fallback plan
      await sendPaymentFailedNotification.execute({ email: userDetails.email });
      
      // Fail the workflow gracefully, leaving the system in a known state
      throw new Error('Payment processing failed permanently.');
    }

    // This part only runs if the 'try' block succeeds
    await sendWelcomeEmail.execute({ 
      email: userDetails.email, 
      name: userDetails.name 
    });

    return { success: true, userId: user.id, status: 'active' };
  }
});

In this example, if the chargeFee action fails, the workflow doesn’t immediately crash. It automatically waits one second and tries again. If that fails, it waits two seconds, then four. Only after the initial attempt and all three retries have failed does it enter the catch block. This simple addition transforms a brittle process into one that can automatically recover from temporary service glitches.

Beyond Retries: Defining Custom Failure Paths

Retries are for transient errors. But what about permanent ones? A try/catch block within workflow.do is your tool for defining explicit business logic for handling unrecoverable failures.

This is where you turn a catastrophic error into a manageable business event. Instead of the process simply dying, you can codify your response:

Notify Humans: Send a message to a Slack channel or create a ticket in Jira for manual review.
Compensating Actions: Revert a previous step. In our example, you might update the user's status in the database from pending to payment_failed.
Inform the User: Trigger an action to send an email to the user explaining that their payment could not be processed and asking them to update their billing information.
Fallback Workflows: Trigger an entirely different workflow.do to place the user into a "lite" or "free" plan.

By defining these paths in code, you are practicing Business as Code. Your company's policies for handling exceptions are no longer just a paragraph in a dusty handbook; they are living, breathing, and executable parts of your automated services.

Why This is Essential for Agentic Workflows

This level of resilience is non-negotiable for building sophisticated agentic workflows. An autonomous agent tasked with managing customer subscriptions can't call a developer every time a payment API times out. It needs the intelligence to handle these situations on its own.

Autonomy: Retry logic allows an agent to be self-sufficient, resolving common issues without human intervention.
Predictability: try/catch blocks provide a predictable, deterministic response to failure, ensuring the agent always takes the correct, pre-defined path when things go wrong.
Reliability: By composing stateless, single-purpose action.do steps within a stateful, resilient workflow.do orchestrator, you build systems that are reliable by design.