SELF-HEALING WORKFLOW REVIEW CHECKLIST Workflow: Owner: Reviewer: Date: 1. Failure Classification [ ] Retryable failures are named. [ ] Fatal failures are named. [ ] Policy denials are not retried. [ ] Partial side effects are named separately from transient failures. [ ] Stale plans and missing evidence have separate recovery outcomes. [ ] Missing evidence, ambiguity, and unsafe tool proposals have explicit outcomes. Evidence: 2. Recovery State Machine [ ] The chapter or design doc includes a recovery state machine or equivalent decision graph. [ ] Every graph edge has a trace event, budget check, and stop reason. [ ] Retry, fallback, re-plan, compensation, escalation, and stop are distinct states or decisions. [ ] Ambiguous classification escalates instead of choosing autonomous recovery. Evidence: 3. Recovery Policy [ ] Retry limits are defined per step and per run. [ ] Backoff and jitter are defined for transient failures. [ ] Fallbacks reduce risk or authority. [ ] Re-planning requires new evidence or changed state. [ ] Escalation has owner and message. Evidence: 4. State and Idempotency [ ] Workflow state is persisted after meaningful steps. [ ] Side-effect steps have idempotency keys. [ ] Idempotency keys include action target, attempt, and previous result. [ ] Compensation is defined for partially completed external actions. [ ] Duplicate retry cannot repeat irreversible work. Evidence: 5. Breakers and Budgets [ ] No-progress breaker is defined. [ ] Tool failure breaker is defined. [ ] Cost, latency, and iteration limits are defined. [ ] Breaker events include trigger, threshold, observed value, and action. Evidence: 6. Observability and Replay [ ] Trace records decision, action, observation, recovery decision, and stop reason. [ ] Failed runs produce replay packets. [ ] Replay can run with mocked tools or sandboxed side effects. [ ] Incidents become regression evals. [ ] Escalation packets include current state, failure class, attempted recovery, residual risk, and owner. Evidence: 7. Evaluation [ ] Happy path is tested. [ ] Transient tool failure is tested. [ ] Fatal tool failure is tested. [ ] Repeated failure stops safely. [ ] Partial side effect is compensated or escalated. [ ] Bad recovery decision fails the eval. [ ] No-progress loops trip the breaker. [ ] Policy-denied actions do not retry. Evidence: 8. Final Decision [ ] Prototype only [ ] Internal pilot [ ] Production candidate [ ] Not safe to automate recovery Blocking gaps: Next actions: