Self-Healing Workflows

Self-healing workflows detect failed steps and recover through retry, fallback, re-planning, or escalation.

Source and downloads

Repository source

Download code bundle

Intent

Use this pattern to keep a long-running agentic workflow honest when a step fails. A self-healing workflow does not blindly try again. It records the failure, classifies it, chooses a recovery action, updates state, and stops when recovery would become unsafe or wasteful.

The goal is not perfect uptime. The goal is controlled recovery with evidence.

Scenario

An account-support agent gathers customer context, checks policy, drafts a resolution, and updates a CRM. The CRM write fails after the draft is created. A weak workflow would rerun the whole task and risk duplicate messages or duplicate records. A self-healing workflow classifies the failure as a partial side effect, keeps the successful draft, retries only the CRM write with an idempotency key, and escalates if the retry still fails.

The important design choice is recovery scope. Recover the failed step, not the entire workflow, unless the state proves the plan is stale.

Use When

Failures are expected and can be classified before recovery starts.
Steps have observable inputs, outputs, side effects, and stop reasons.
Retries are idempotent, or compensation exists for partial side effects.
Fallbacks reduce authority, cost, or risk instead of hiding the failure.
Re-planning uses new evidence, changed state, or a confirmed stale plan.
Human escalation has an owner, message, and handoff packet.

Avoid When

The workflow cannot tell transient, fatal, policy, and partial-side-effect failures apart.
Retrying a step can repeat an irreversible external action.
The system would call the model again without new evidence.
The recovery action is more dangerous than the original failure.
The user or operator cannot inspect why the workflow recovered or stopped.

Architecture

Use this diagram to read Self-Healing Workflows as a system boundary, not only a code shape. The key ownership question is: the loop controller owns progress, budgets, stop conditions, and recovery state.

Self-healing workflow recovery loop

Read it as a recovery state machine: every retry, fallback, re-plan, compensation, escalation, and stop reason must be explicit and traceable.

Decision Rules

Classify before recovery. The controller should choose from a small recovery vocabulary and refuse ambiguous recovery.

Failure Class	Examples	Recovery Action	Stop Condition
Transient tool failure	timeout, temporary 5xx, connection reset	retry same step with backoff and idempotency key	retry budget exhausted
Rate limit or quota	429, token quota, per-account throttle	wait, use lower-cost fallback, or reschedule	deadline or budget exhausted
Missing evidence	source unavailable, weak retrieval, incomplete tool output	fetch more evidence or ask for clarification	no new source can change the decision
Stale plan	dependency changed, previous step invalidated	re-plan from current durable state	repeated stale-plan loop
Policy denial	unsafe tool request, forbidden data movement	block, explain, and escalate if needed	always stop autonomous recovery
Partial side effect	draft created but CRM write failed	compensate, resume failed step, or escalate	compensation unavailable
Fatal domain error	account closed, item unavailable, invalid recipient	stop and return a clear failure	always stop retry loop

flowchart TD F[Step failed] --> C{Classified failure?} C -->|no| E[Escalate with replay packet] C -->|yes| P{Policy denial or fatal?} P -->|yes| S[Stop and explain] P -->|no| B{Budget left?} B -->|no| E B -->|yes| I{Side effect happened?} I -->|yes| K{Compensation available?} K -->|yes| X[Compensate then resume] K -->|no| E I -->|no| R{New evidence needed?} R -->|yes| N[Fetch evidence or re-plan] R -->|no| T[Retry with backoff] X --> O[Observe result] N --> O T --> O O --> D{Recovered?} D -->|yes| G[Continue workflow] D -->|no| F

This graph is the production contract: every edge needs a trace event, budget check, and stop reason.

System Shape

Component	Owns	Must Emit
Workflow controller	current step, progress, budgets, stop conditions	selected step, attempt number, remaining budget
Failure classifier	failure class, severity, retryability	class, evidence, confidence, ambiguity
Recovery policy	retry, fallback, re-plan, compensate, escalate, stop	decision, reason, allowed authority
Idempotency layer	duplicate suppression for side effects	key, target, previous result
Compensation handler	undo or repair of partial external actions	compensation action, result, residual risk
Replay builder	incident packet for debugging and evals	inputs, state, tool outputs, policy version
Escalation channel	human owner and user-facing status	owner, summary, next action, deadline

The controller owns the loop. Tools do not decide to retry themselves, and model calls do not silently re-plan after failure. Recovery is a policy decision over durable state.

Contract

The smallest useful contract separates step result, failure class, recovery decision, and trace evidence.

type FailureClass =
  | "transient"
  | "rate_limit"
  | "missing_evidence"
  | "stale_plan"
  | "policy_denied"
  | "partial_side_effect"
  | "fatal";

type RecoveryAction =
  | "retry"
  | "fallback"
  | "replan"
  | "compensate"
  | "escalate"
  | "stop";

type RecoveryDecision = {
  action: RecoveryAction;
  reason: string;
  retryAfterMs?: number;
  requiresNewEvidence: boolean;
  idempotencyKey?: string;
};

type RecoveryTraceEvent = {
  workflowId: string;
  stepId: string;
  attempt: number;
  failureClass: FailureClass;
  decision: RecoveryDecision;
  budgetRemaining: number;
  stopReason?: string;
};

This contract prevents a common production bug: treating every failure as a retryable exception.

Core Protocol

Persist workflow state before each step that can create an external side effect.
Execute the step through a bounded tool, worker, or model route.
If the step succeeds, store evidence and continue.
If the step fails, classify the failure with concrete evidence.
Check policy, retry budget, time budget, cost budget, and no-progress breaker.
Choose one recovery action: retry, fallback, re-plan, compensate, escalate, or stop.
Emit a trace event before executing the recovery action.
Resume from durable state, not from an optimistic in-memory plan.
Convert unresolved incidents into regression evals.

Workflow Transition Map

From	Event	To	Required Evidence
running	step succeeded	running or complete	output, validation result, updated state
running	transient failure	retry_wait	failure class, attempt count, backoff
retry_wait	timer elapsed	running	same idempotency key and unchanged target
running	fallback selected	running	fallback reason and reduced authority
running	stale plan detected	replanning	changed state or new evidence
replanning	valid plan produced	running	plan diff and validation result
running	partial side effect detected	compensating	side-effect ID and compensation rule
compensating	compensation succeeded	running or escalated	compensation result and residual risk
running	policy denial or fatal error	stopped	policy rule or fatal domain evidence
any active state	budget exhausted	escalated	budget values and replay packet

Implementation Notes

Keep retry policy per step. A cheap read call and an outbound payment update should not share a retry rule.
Use idempotency keys for every side-effecting call, including emails, CRM writes, ticket updates, calendar edits, and file mutations.
Use backoff and jitter for transient infrastructure failures, not for policy failures.
Re-plan only when state changed or new evidence arrived. Re-planning with the same facts usually burns tokens.
Prefer lower-authority fallbacks: draft instead of send, read-only source instead of write source, cached summary instead of live mutation.
Build replay packets automatically. An operator should not have to reconstruct state from scattered logs.
Treat compensation as a first-class step with its own success, failure, and escalation path.

Failure Modes

The loop retries a policy denial until the budget is exhausted.
A partial side effect repeats because the retry lacks an idempotency key.
Re-planning hides the original failure instead of preserving the trace.
A fallback returns lower-quality data without marking the answer as degraded.
The controller escalates without enough state for a human to continue.
No-progress loops consume budget because every iteration looks slightly different.
Compensation fails and the system keeps acting as if rollback succeeded.
Recovery policy lives in prompt text only and cannot be audited.

Review Checklist

Use the self-healing workflow review checklist before moving a recovery loop past prototype stage.

Every failure class has an owner and recovery action.
Every retryable side effect has an idempotency key.
Every compensation path has a residual-risk message.
Every stop condition produces a user or operator-facing explanation.
Every production incident can become a replayable regression eval.

Evaluation Strategy

Test recovery as behavior, not as exception handling.

Eval Case	Expected Result
transient read timeout	retries with backoff, same inputs, trace event recorded
repeated timeout	stops or escalates when retry budget is exhausted
policy-denied write	does not retry, records policy rule, returns blocked status
partial CRM write	compensates or resumes from idempotent state without duplicate write
stale plan	re-plans only after evidence changes
degraded fallback	marks output as fallback-derived and lower confidence
bad classifier	fails eval because recovery action does not match failure class
no-progress loop	breaker stops after threshold and emits replay packet

Measure completion rate, recovered-run rate, unsafe retry rate, duplicate side-effect rate, mean recovery latency, budget burn, escalation quality, and replay success rate.

Production Checklist

Define failure classes in code, not only in prompt instructions.
Set retry, cost, time, and no-progress budgets per workflow and per step.
Persist state before externally visible side effects.
Store idempotency keys with action targets and results.
Require policy denial to stop autonomous recovery.
Emit structured trace events for failure class, recovery decision, budget, and stop reason.
Generate replay packets for escalations and failed recoveries.
Add dashboards for retry rate, fallback rate, compensation rate, escalation rate, and duplicate-prevention hits.
Turn resolved incidents into regression evals before widening automation.

Code Walkthrough

Read the excerpt as the smallest executable expression of the pattern. The surrounding chapter explains the design constraints; the code shows where those constraints become concrete interfaces, state, validation, or control flow.

Source Code

These excerpts show the implementation shape. The complete code is available in the download bundle and repository source.

`self-healing-workflow-agent-pattern/autogen_typescript_example/self_healing_workflow.ts`

Open full source

type FailureClass =
  | "transient"
  | "rate_limit"
  | "missing_evidence"
  | "stale_plan"
  | "policy_denied"
  | "partial_side_effect"
  | "fatal";

type RecoveryAction = "retry" | "fallback" | "replan" | "compensate" | "escalate" | "stop";

type StepFailure = {
  class: FailureClass;
  message: string;
  sideEffectId?: string;
};

type StepResult =
  | { ok: true; value: string }
  | { ok: false; failure: StepFailure };

function isStepFailure(result: StepResult): result is { ok: false; failure: StepFailure } {
  return result.ok === false;
}

type WorkflowState = {
  workflowId: string;
  stepId: string;
  attempt: number;
  maxAttempts: number;
  budgetRemaining: number;
  idempotencyKey: string;
  trace: string[];
};

type RecoveryDecision = {
  action: RecoveryAction;
  reason: string;
  retryAfterMs?: number;
};

function decideRecovery(state: WorkflowState, failure: StepFailure): RecoveryDecision {
  if (failure.class === "policy_denied") {
    return { action: "stop", reason: "Policy denials are never retried." };
  }

  if (failure.class === "fatal") {
    return { action: "stop", reason: "Fatal domain failure cannot be healed." };
  }

  if (failure.class === "partial_side_effect") {
    return failure.sideEffectId
      ? { action: "compensate", reason: `Compensate partial side effect ${failure.sideEffectId}.` }
      : { action: "escalate", reason: "Partial side effect has no compensation handle." };
  }

  if (state.attempt >= state.maxAttempts || state.budgetRemaining <= 0) {
    return { action: "escalate", reason: "Recovery budget exhausted." };
  }

  if (failure.class === "stale_plan" || failure.class === "missing_evidence") {
    return { action: "replan", reason: "Recovery needs changed state or new evidence." };
  }

  if (failure.class === "rate_limit") {
    return { action: "fallback", reason: "Use lower-cost fallback after quota failure." };
  }

  return {
    action: "retry",
    reason: "Transient failure can retry with same idempotency key.",
    retryAfterMs: Math.min(30_000, 2 ** state.attempt * 250)
  };
}

async function runSelfHealingStep(
  state: WorkflowState,
  step: (idempotencyKey: string) => Promise<StepResult>
): Promise<StepResult> {
  while (state.budgetRemaining > 0) {
    const result = await step(state.idempotencyKey);
    if (result.ok) return result;
    if (!isStepFailure(result)) return result;

    const decision = decideRecovery(state, result.failure);
    state.trace.push(`${state.stepId} attempt ${state.attempt}: ${decision.action} - ${decision.reason}`);

    if (decision.action === "retry") {
      state.attempt += 1;
      state.budgetRemaining -= 1;

Excerpt truncated for readability. Download the bundle or open the source file for the complete implementation.

`self-healing-workflow-agent-pattern/langgraph_python_example/self_healing_workflow.py`