PATTERN EVALUATION SCORECARD Use this scorecard before choosing a pattern, composing patterns, or promoting an agentic workflow toward production. Pattern: System or workflow: Reviewer: Date: Release mode under review: prototype | internal_pilot | controlled_production | production SCORING 0 = Missing or only prompt-level. 1 = Described but not implemented or tested. 2 = Implemented but weakly tested or hard to inspect. 3 = Implemented, tested, traceable, and owned. Do not average away hard failures. A pattern that touches money, private data, infrastructure, customer communication, or durable memory must not score 0 or 1 on security, tools, approvals, side effects, evaluation, or observability. SCORECARD Area: Goal Score: Owner: Evidence: Reviewer notes: Question: What user or system goal does this pattern own? Required evidence: task contract, success criteria, refusal criteria, and owner. Area: Boundary Score: Owner: Evidence: Reviewer notes: Question: What is outside the pattern's responsibility? Required evidence: handoff contract, caller contract, fallback, or escalation path. Area: Autonomy Split Score: Owner: Evidence: Reviewer notes: Question: What does the model decide, and what does software decide? Required evidence: proposal, validation, execution, approval, and stop boundaries. Area: Tools And Side Effects Score: Owner: Evidence: Reviewer notes: Question: What can the pattern read, write, send, change, delete, or trigger? Required evidence: tool allowlist, schemas, authorization checks, risk class, approval rule, idempotency, timeout, and audit record. Area: State And Memory Score: Owner: Evidence: Reviewer notes: Question: What state is read, written, persisted, retrieved, corrected, or deleted? Required evidence: state owner, schema, replay behavior, retention rule, memory write policy, and correction path. Area: Context And Evidence Score: Owner: Evidence: Reviewer notes: Question: What evidence enters the working set, and why should the model trust it? Required evidence: source eligibility, freshness rule, retrieval trace, context budget, and prompt-injection controls. Area: Security Score: Owner: Evidence: Reviewer notes: Question: What can untrusted input influence? Required evidence: threat model, sandbox boundary, credential policy, tenant isolation, policy denial tests, and approval gates. Area: Evaluation Score: Owner: Evidence: Reviewer notes: Question: What failure must be caught before release? Required evidence: happy path cases, edge cases, adversarial cases, trajectory evals, mocked tools, and regression fixtures. Area: Observability Score: Owner: Evidence: Reviewer notes: Question: Can a failed run be explained later? Required evidence: trace ID, model spans, tool spans, decision records, policy denials, costs, latency, and stop reason. Area: Operations Score: Owner: Evidence: Reviewer notes: Question: Can the pattern be disabled, rolled back, replayed, or degraded? Required evidence: versioned prompts, tool manifests, model routes, feature flags, rollback plan, circuit breakers, and incident runbook. RELEASE DECISION Any score is 0: [ ] Yes -> Block release. [ ] No Any score is 1 on security, tools, approvals, side effects, evaluation, or observability: [ ] Yes -> Block production release. [ ] No Average score: Recommended release mode: [ ] Prototype only [ ] Internal or low-risk pilot [ ] Controlled production candidate [ ] Staged production rollout [ ] Production for stated scope Blocking gaps: - - - Accepted risks: - - - Required next evidence: - - - Final decision: [ ] Approve for stated scope [ ] Approve with limits [ ] Pilot only [ ] Block Decision rationale: FOLLOW-UP Owner for missing evidence: Date for rescore: Regression eval to add: Rollback or disable path: