RUNTIME SLO AND INCIDENT REVIEW WORKSHEET

Use this worksheet before a production agent handles real users, private data, external tools, money movement, infrastructure, durable memory, or unattended events.

1. Runtime scope

Agent or workflow:
Route:
Owner:
On-call owner:
Reviewer:
Date:
Release or design version:

Autonomy level:
Risk class:
Execution mode:
[ ] synchronous request
[ ] async job
[ ] durable workflow
[ ] event-triggered run
[ ] human-gated run

Highest-risk action:
Rollback or disable path:

2. Service-level objectives

Define SLOs for the route. Use real numbers before launch, even if they are conservative.

Availability target:
Latency target:
Cost target:
Trace coverage target:
Policy-decision coverage target:
Approval-latency target:
Eval-gate pass target:
Incident-response target:

3. Error budget and burn rules

What consumes the error budget?

[ ] failed runs
[ ] missing stop reason
[ ] missing trace
[ ] policy decision missing before risky action
[ ] approval missing before high-risk side effect
[ ] cost budget exceeded
[ ] latency budget exceeded
[ ] duplicate side effect
[ ] unsupported final answer
[ ] memory write policy violation

Investigate when:
Pause rollout when:
Roll back when:

4. Runtime dashboard

Dashboard or query links:

[ ] active runs by route, tenant, risk, state, age, and owner
[ ] stop reasons
[ ] failed runs
[ ] waiting approvals
[ ] policy denials
[ ] tool errors
[ ] retry count
[ ] cost per route
[ ] latency by step
[ ] trace completeness
[ ] eval-gate status
[ ] rollback and kill-switch status

Missing dashboard panels:

5. Incident triage

Incident ID:
Detected by:
Detected at:
Severity:
Affected route:
Affected tenants or users:
Active version set:
User-visible impact:
External side effects:

First action:
[ ] disable route
[ ] disable tool
[ ] force human approval
[ ] pin model or prompt
[ ] tighten policy
[ ] drain queue
[ ] cancel active runs
[ ] roll back workflow
[ ] escalate to incident owner

6. Trace review

Can the team reconstruct the failed run?

[ ] actor or event source
[ ] tenant or scope
[ ] goal
[ ] route
[ ] risk class
[ ] model version
[ ] prompt version
[ ] policy version
[ ] tool schema version
[ ] retriever or memory version
[ ] context packet
[ ] model proposal
[ ] policy decision
[ ] approval state
[ ] tool call
[ ] side-effect record
[ ] stop reason
[ ] rollback action

Trace gaps:

7. Incident-to-eval conversion

Should this become a regression fixture?

[ ] yes, blocking eval
[ ] yes, warning eval
[ ] yes, exploratory eval
[ ] no, trace evidence is enough

Fixture owner:
Fixture location:
Expected behavior:
Forbidden behavior:
Required trace spans:
Change types this fixture should gate:

8. Follow-up decision

[ ] keep running
[ ] keep running with reduced authority
[ ] force approval
[ ] pause rollout
[ ] roll back
[ ] deprecate route

Blocking gaps:
Accepted residual risks:
Next review date: