Pattern Evaluation Checklist
A pattern is not ready because it sounds right. It is ready when the team can explain what job it owns, where it can fail, what bounds the failure, how it is evaluated, and how it behaves in production.
This checklist is the shared review lens for the book. Use it before choosing a pattern, before composing several patterns, and before promoting an agentic workflow into production.
Download the reusable worksheet: pattern evaluation scorecard.
For the engineering loop behind this checklist, see Evaluation-Driven Agent Development. Use the checklist to review a pattern. Use the development chapter to turn the review into datasets, fixtures, release gates, and production feedback.
The Short Version
Every pattern should answer five questions:
- What responsibility does this pattern own?
- What new risk does it introduce?
- What control keeps that risk bounded?
- What eval proves the control works?
- What production signal tells us it is drifting?
If the pattern cannot answer those questions, it is probably not a pattern yet. It is an implementation idea.
Evaluation Table
Use this table as the default review template.
| Area | Question | Evidence To Look For |
|---|---|---|
| Goal | What user or system goal does the pattern own? | A task contract, success criteria, refusal criteria, and owner. |
| Boundary | What is outside the pattern’s responsibility? | A clear handoff, caller contract, or escalation path. |
| Autonomy | What does the model decide, and what does software decide? | A split between proposal, validation, execution, and stop. |
| Loop | Can the pattern repeat? | Max steps, max tool calls, timeout, retry budget, and stop reason. |
| Tools | What can the pattern read or change? | Tool allowlist, schema validation, permission checks, and audit events. |
| State | What state is read, written, or persisted? | State owner, update rules, replay behavior, and memory write policy. |
| Context | What evidence enters the working set? | Source eligibility, retrieval rules, freshness checks, and context budget. |
| Security | What can untrusted input influence? | Threat model, prompt-injection controls, sandboxing, and approval gates. |
| Evaluation | What failure must be caught before release? | Golden tasks, negative cases, trajectory evals, mocked tools, and regression fixtures. |
| Observability | Can a failed run be explained later? | Trace ID, model spans, tool spans, decisions, policy denials, costs, and stop reason. |
| Operations | Can the pattern be rolled back or disabled? | Versioned prompts, tool manifests, model routes, feature flags, and circuit breakers. |
Review Score
Use a simple score when the pattern is going into a design review or release gate.
| Score | Meaning | Decision |
|---|---|---|
| 0 | Missing or only prompt-level. | Block. The control is not real. |
| 1 | Described but not implemented or tested. | Do not release. Convert the description into code, config, or tests. |
| 2 | Implemented but weakly tested or hard to inspect. | Release only for low-risk internal use. |
| 3 | Implemented, tested, traceable, and owned. | Accept for the stated risk level. |
Score these areas: goal, boundary, autonomy split, tools, state, context, security, evaluation, observability, and operations. A production pattern should not have any 0 scores. A pattern that touches money, private data, infrastructure, customer communication, or durable memory should not have any 1 scores.
The point is not to create bureaucracy. The score prevents a common design-review failure: everyone agrees the idea is good, but nobody proves the boundary exists.
Use the downloadable pattern evaluation scorecard when the review must leave an auditable record. It captures the score, owner, evidence, release mode, blocking gaps, accepted risks, and next evidence for each area.
Score Decision Path
Use this path during review. It turns the score into a release decision without hiding high-risk gaps behind a good average.
Minimum Bar By Pattern Type
Different patterns need different proof.
| Pattern Type | Minimum Evaluation Guidance |
|---|---|
| Prompt chain | Validate each step output, gate transitions, and test malformed intermediate results. |
| Router | Test ambiguous requests, unsupported tasks, high-risk routes, and fallback behavior. |
| Agent loop | Test stop conditions, tool selection, recovery from bad observations, and budget exhaustion. |
| Tool-use pattern | Test forbidden tools, invalid arguments, idempotency, timeouts, and policy denials. |
| RAG or memory pattern | Test source relevance, stale evidence, missing evidence, citation coverage, and unsafe memory writes. |
| Evaluator or reflection pattern | Test false approvals, overcorrection, rubric ambiguity, and disagreement handling. |
| Multi-agent pattern | Test context isolation, permission isolation, merge accuracy, worker failure, and final accountability. |
| Human approval pattern | Test escalation criteria, approver visibility, timeout behavior, and audit records. |
| Production runtime pattern | Test replay, rollback, canary gates, incident-to-eval conversion, and operator diagnosis. |
A Small Review Contract
For lightweight design reviews, keep the contract short:
pattern: tool_using_agent
owned_goal: "Investigate refund eligibility from approved business systems."
model_decides:
- "which allowed read tool to call next"
- "whether evidence is sufficient for a recommendation"
software_decides:
- "which tools exist"
- "whether the caller is authorized"
- "whether a side effect requires approval"
- "when the run stops"
controls:
max_steps: 6
max_tool_calls: 8
timeout_ms: 45000
forbidden_tools:
- refunds.issue_refund
- support.send_customer_email
evals:
blocking:
- "does not issue refunds directly"
- "returns needs_human when evidence is missing"
- "cites policy before recommending refund"
operations:
trace_fields:
- task_id
- trace_id
- tool_calls
- policy_denials
- stop_reason
review_score:
goal: 3
boundary: 3
autonomy_split: 2
tools: 3
state: 2
context: 2
security: 3
evaluation: 2
observability: 3
operations: 2
release_decision: "internal pilot only until trajectory evals and replay are stronger"
The contract is intentionally plain. It should be easy to review in a pull request, easy to turn into tests, and easy to compare against a production trace.
Release Decision Rules
Use these rules after scoring:
- Any
0: block release. - Any
1on security, tools, approvals, or side effects: block production release. - Average below 2: keep it as a prototype.
- Average 2-2.5: allow internal or low-risk pilot only.
- Average above 2.5 with no high-risk gaps: allow staged production rollout.
- Any incident involving the pattern: add a regression eval before expanding rollout.
If the team disagrees about a score, record the disagreement. A disagreement usually means the ownership boundary, evidence, or risk tolerance is still unclear.
Common Failure Smells
Watch for these smells during pattern selection:
- The pattern has no single owner.
- The model owns permission checks.
- The loop stops only when the model says it is done.
- The tool list is broader than the task.
- Memory writes happen as a side effect of conversation.
- The eval checks only the final answer, not the trajectory.
- The trace cannot show why a tool was called.
- Multi-agent routing is used to hide unclear responsibilities.
- The fallback path is “ask the model again.”
- Rollback requires manual reconstruction of prompts, tools, or policies.
These are not style problems. They are architecture problems.
Design Rule
Choose the simplest pattern whose risks you can bound and whose behavior you can evaluate. If you cannot test the boundary, the boundary is not real yet.