# Production Evaluation Feedback Loops Review Checklist

Use this checklist when operating evals after launch.

## Feedback Sources

- [ ] Incidents, near misses, human corrections, overrides, high-cost outliers, and regression reports are reviewed for eval value.
- [ ] Each serious incident produces at least one minimal replayable eval or a written reason why it should not.
- [ ] Production traces are redacted before becoming fixtures.
- [ ] Fixtures reproduce the failure mode, not the entire incident archive.

## Fixture Contract

- [ ] Fixtures include ID, incident link, owner, severity, creation date, and expiry date when temporary.
- [ ] Inputs include redacted request, event, state, evidence, memory, and mocked tool outputs as needed.
- [ ] Expected trajectory names required tools, forbidden tools, policy decisions, approval state, stop reason, and retry behavior.
- [ ] Expected result names final status, structured fields, evidence requirements, and user-visible constraints.

## Release Gates

- [ ] Blocking evals protect safety, privacy, policy, and known incidents.
- [ ] Warning evals cover quality, tone, and non-critical edge cases.
- [ ] Gates map change type to the relevant eval subset.
- [ ] Model, prompt, policy, tool, retrieval, memory, and topology changes replay known production failures.

## Ownership And Metrics

- [ ] Each important eval has an owner, reason for existing, severity, maintenance rule, and release decision role.
- [ ] Stale evals expire after migrations, policy sunsets, tool replacement, or customer-specific incidents.
- [ ] Metrics track incident-to-eval conversion, eval catch rate, recurrence rate, flaky eval rate, and time to regression test.
- [ ] Canary thresholds and rollback targets are documented for prompts, policies, tools, model routes, memory rules, and retrieval indexes.