Production Runtime Overview
Production agentic systems need more than prompts, tools, and a good model. They need a runtime that owns the execution boundary.
The core rule is the same throughout this book: the model proposes, the runtime decides. The model can propose a plan, a tool call, a memory write, a retrieval query, a reply, or a stop condition. The runtime owns whether that proposal is valid, allowed, affordable, observable, durable, and safe to execute.
This is where many agent projects fail. They treat the model call as the system, then add logging, retries, approvals, evals, and rollback after the first incident. A production runtime inverts that order. It gives the agent a controlled place to operate.
This chapter owns execution control across runs: admission, state, policy, budgets, tool execution, recovery, rollout, and operator visibility. It does not own product UX, domain policy content, model training, or the internal implementation of tools and services.
Read this after Agent Harnesses. The harness explains the working environment around one agent; this chapter explains what has to be true when that work runs as a production system with queues, budgets, retries, state, rollout, rollback, and operators.
For a step-by-step path from lab to release, use Deployment Walkthrough. For framework setup and decision templates, use Real Framework Setup Notes and Templates and Worksheets.
Download the operating worksheet: runtime SLO and incident review worksheet.
Use this diagram as a control-plane map. Each box names a runtime responsibility that should have an owner, a trace record, and a rollback path before production traffic.
What The Runtime Owns
The production runtime is the control plane around model judgment.
| Runtime Concern | What It Owns |
|---|---|
| State | Goal, run status, workflow step, attempts, tool results, approvals, stop reason. |
| Policy | Permissions, risk classification, denial, approval, escalation, audit requirements. |
| Budgets | Tokens, model calls, tool calls, retries, delegations, wall-clock time, cost. |
| Tools | Schemas, permissions, timeouts, idempotency, side-effect records. |
| Memory and retrieval | Source eligibility, freshness, access control, evidence references, memory writes. |
| Observability | Trace IDs, spans, costs, latency, model and tool events, replay metadata. |
| Evaluation | Runtime checks, incident fixtures, release gates, regression datasets. |
| Recovery | Retries, fallbacks, circuit breakers, checkpoints, compensation, rollback. |
The runtime does not remove autonomy. It makes autonomy bounded enough to trust.
Runtime Responsibility Matrix
The runtime should make ownership explicit. If nobody owns one of these rows, the model will eventually own it by accident.
| Responsibility | Runtime decision | Failure when missing |
|---|---|---|
| Run admission | Should this request start, wait, refuse, or route elsewhere? | Unsafe or unsupported work enters the loop. |
| State ownership | Which state is durable, temporary, visible, or deleted? | Lost progress, hidden memory, non-replayable failures. |
| Execution mode | Should the task run synchronously, asynchronously, or as a durable workflow? | Timeouts, zombie runs, blocked users, partial side effects. |
| Proposal validation | Is the model proposal valid for schema, policy, budget, and state? | Tool calls execute because the model sounded confident. |
| Tool execution | Which credentials, timeout, idempotency key, and side-effect record apply? | Duplicate writes, leaked credentials, unclear ownership. |
| Approval state | What waits for human review, who approved it, and what exact action was approved? | Approval becomes broad permission instead of a bounded gate. |
| Cancellation | What stops immediately, what drains, and what must be compensated? | Cancelled runs keep spending or continue side effects. |
| Rollout | Which model, prompt, policy, tool schema, retriever, or harness version is active? | Regressions ship globally with no quick rollback. |
| Operations | Who is paged, what is disabled, and what evidence is preserved? | Incidents turn into manual archaeology. |
The Runtime Loop
A production runtime loop is not just observe, decide, act. It is closer to:
- receive a request, event, schedule, webhook, or workflow command;
- authenticate the caller and load task class, risk class, state, policy, budget, and trace ID;
- assemble the working set: goal, constraints, evidence, allowed tools, memory, and stop rules;
- ask the model or deterministic router for a bounded proposal;
- validate the proposal against schema, policy, budget, state, evidence, and approval rules;
- execute one bounded step through a tool, workflow, retrieval service, evaluator, or approval gate;
- checkpoint state, trace events, cost, latency, side effects, and stop reason;
- decide whether to continue, fallback, wait, escalate, refuse, compensate, or complete;
- convert important failures and near misses into eval cases.
This loop is what separates an agentic product from a demo. The model can still be creative and adaptive, but the system knows what happened and why.
Runtime Boundaries
A good runtime creates explicit boundaries:
- Decision boundary: the model can suggest actions, but software validates them.
- Authority boundary: side effects require tool schemas, permissions, budgets, and approval rules.
- State boundary: durable workflow state is separate from model context.
- Context boundary: the model sees a selected working set, not every available document or memory.
- Cost boundary: every loop, tool, retry, model call, and delegation spends from a budget.
- Policy boundary: denial and escalation are runtime outcomes, not prompt preferences.
- Recovery boundary: retries, fallbacks, replay, and rollback are designed before production traffic.
When these boundaries are missing, the model becomes the control plane by accident.
Execution Modes
Do not run every agent the same way. Match the runtime mode to the work.
| Mode | Use when | Runtime requirements |
|---|---|---|
| Synchronous request | The task is short, read-heavy, and safe to fail fast. | Tight timeout, small budget, no irreversible side effects, complete trace. |
| Async job | The task may take seconds or minutes but does not need complex compensation. | Queue, status record, cancellation, retries, idempotency, progress events. |
| Durable workflow | The task spans approvals, external systems, retries, or long-running state. | Checkpoints, resumability, compensation, replay, versioned workflow state. |
| Event-triggered run | The task starts from webhook, schedule, stream, or system event. | Deduplication, event identity, ordering policy, backpressure, audit trail. |
| Human-gated run | The task can prepare work but needs approval before execution. | Approval record, exact-action binding, pause and resume semantics. |
The wrong mode creates production bugs. A refund workflow should not depend on one HTTP request staying alive. A short classifier should not pay the complexity cost of a durable workflow engine.
Queues, Backpressure, And Concurrency
Agents consume scarce resources: model quota, tool capacity, human approval time, database connections, browser workers, and money. The runtime should control admission and concurrency before the loop starts spending.
Useful controls include per-tenant queues, route-level concurrency limits, model-provider rate limits, tool-specific bulkheads, retry budgets, dead-letter queues, and priority classes. Backpressure is not just an infrastructure concern. It is how the system refuses low-value work before it damages high-value work.
Concurrency also affects correctness. Two runs should not issue the same refund, update the same ticket, rewrite the same memory, or deploy the same service without coordination. Use locks, version checks, idempotency keys, or workflow state transitions where duplicate work would be harmful.
Rollout And Rollback
Production agents change in more ways than normal services. A release may change model, prompt, tool schema, retriever, memory policy, approval rule, sandbox profile, evaluator, or workflow code. The runtime should version those pieces and record the active version set on every run.
Rollout should be gradual for high-risk agents:
- start with offline evals;
- run shadow or replay tests where possible;
- enable a small tenant, route, or percentage;
- compare traces, costs, stop reasons, policy denials, and user-visible outcomes;
- keep a rollback path for each changed component.
Rollback must be operational, not theoretical. Operators should be able to disable a tool, pin a model, revert a prompt, tighten a policy, stop a route, drain a queue, or force human approval without redeploying the whole product.
How The Production Runtime Chapters Compose
Read the production runtime section as one operating model:
- Durable Workflows own long-running state, retries, checkpoints, approvals, compensation, and resumability.
- Observability and Evals records what happened and turns behavior into something engineers can inspect.
- Production Evaluation Feedback Loops converts production failures into regression cases and release gates.
- Cost Controls and Runtime Budgets defines how much autonomy, spend, time, and human attention a run may consume.
- Policy Enforcement keeps permission, risk, and compliance decisions outside the model.
- Event-Triggered Agents shows how agents respond to events without losing idempotency, state, and auditability.
- Mastra Runtime maps these production concerns into a concrete runtime style.
The chapters are separate because each boundary deserves attention. In a real system, they should work together.
Launch Evidence Map
Before launch, each runtime concern should produce evidence that another engineer can inspect. A green build is useful, but it does not prove the agent can be operated.
| Concern | Evidence Artifact | Release Question |
|---|---|---|
| Admission | Route table with task class, risk class, tenant scope, and refusal path. | Which requests can start, wait, route elsewhere, or refuse? |
| State | Run-state schema, checkpoint example, and deletion rule. | Can the team reconstruct what the agent believed and where it stopped? |
| Policy | Decision matrix, reason codes, and approval rules. | Can software stop an unsafe proposal before it executes? |
| Budget | Per-route budget policy and exhaustion behavior. | What happens when the task is no longer worth more spend? |
| Tools | Tool manifest with schemas, permissions, timeouts, and side-effect class. | Can a tool call be validated without trusting model prose? |
| Memory and retrieval | Source policy, freshness rule, citation rule, and memory retention class. | Can the agent explain what evidence it used and why it was allowed? |
| Observability | Trace contract and dashboard links. | Can an operator move from symptom, to trace, to owner? |
| Evaluation | Blocking eval list, incident fixtures, and release-gate output. | Which known failures would block this release? |
| Recovery | Runbook with retry, fallback, compensation, and rollback actions. | What can operators disable or restore without a full redeploy? |
The evidence does not need to be elaborate. It needs to be real, current, and linked from the release record.
Minimal Runtime Contract
Every production run should be able to produce a contract like this:
type RuntimeRun = {
runId: string;
traceId: string;
requestId: string;
actorId: string;
tenantId: string;
route: string;
goal: string;
autonomyLevel: "advisory" | "drafts_for_review" | "executes_after_approval" | "bounded_autonomous";
riskClass: "low" | "medium" | "high";
executionMode: "sync" | "async_job" | "durable_workflow" | "event_triggered";
status: "queued" | "running" | "waiting" | "succeeded" | "failed" | "refused" | "cancelled";
versionSet: {
model: string;
prompt: string;
policy: string;
toolSchema: string;
retriever?: string;
harness: string;
};
budgetPolicyVersion: string;
policyVersion: string;
workflowStep?: string;
allowedTools: string[];
idempotencyKey?: string;
approvalId?: string;
checkpointRef?: string;
stopReason?: string;
};
This is not enough to implement a full platform, but it is enough to make the hidden parts visible. If a run does not have actor, tenant, route, trace ID, risk class, autonomy level, execution mode, version set, budget policy, policy version, allowed tools, status, and stop reason, it will be hard to operate.
For high-risk work, this contract should be stored before the first model call. The run may change state, but the runtime should never be guessing who started it, what authority it has, what version is active, or why it stopped.
Operating Dashboard View
The runtime dashboard should show control, not just activity.
| Panel | Shows | Operator Action |
|---|---|---|
| Active runs | route, tenant, risk class, status, workflow step, age, and owner. | Cancel, pause, or escalate stuck work. |
| Budget pressure | token spend, tool spend, retry spend, and runs near exhaustion. | Lower concurrency, require approval, or stop low-priority work. |
| Policy decisions | denies, approvals, escalations, false allows, and override rate. | Tighten rules, review exceptions, or add eval fixtures. |
| Tool health | timeout rate, retry rate, side-effect failures, and idempotency conflicts. | Disable a tool, drain a queue, or switch fallback path. |
| Trace quality | missing spans, missing stop reasons, redaction failures, and replay gaps. | Block release until evidence is complete. |
| Release versions | active model, prompt, policy, retriever, workflow, and tool schema versions. | Pin, roll back, or canary a component. |
| Eval status | blocking failures, flaky cases, incident fixture failures, and gate history. | Stop rollout or assign fixture repair. |
If the dashboard cannot answer “what should an operator do now?”, it is a reporting page, not a runtime control surface.
Runtime SLO And Incident Loop
Service-level objectives (SLOs) make runtime quality explicit. Do not define them only for uptime. Agentic systems also need SLOs for trace coverage, policy-decision coverage, approval latency, eval-gate health, cost, and stop-reason completeness.
Use this loop after launch and during canaries. A good operating review can answer which SLO moved, which route changed, which trace proves the failure, which component can be disabled, and which eval now prevents recurrence.
| Runtime SLO | Example Target | Investigate When | Roll Back Or Pause When |
|---|---|---|---|
| Trace coverage | 99% of high-risk runs have complete run, route, policy, tool, approval, and stop-reason spans. | Any high-risk trace lacks policy or stop reason. | A release changes risky behavior without trace coverage. |
| Policy-decision coverage | 100% of write, send, memory, and external-action paths record a policy decision. | Policy span is missing or detached from tool span. | A side effect executes without policy evidence. |
| Approval latency | 95% of approval waits resolve or expire inside the business SLA. | Approval waits age without owner or expiry. | Stale approvals can resume changed actions. |
| Cost budget | Route stays within agreed per-run or per-completed-task budget. | Cost per completed task spikes after model, prompt, or retrieval change. | Budget exhaustion does not stop or degrade safely. |
| Eval-gate health | Blocking evals pass before prompt, model, policy, tool, memory, or workflow changes. | Warning eval failures cluster around one boundary. | A known incident fixture fails or becomes flaky without owner. |
| Stop-reason completeness | Every terminal run records completed, refused, failed, cancelled, timed out, blocked, or needs approval. | Terminal states contain generic done or missing reason. |
Operators cannot distinguish success, refusal, failure, or partial side effect. |
The numbers will vary by product. The important part is that each SLO names an owner and an action. A threshold without an operator decision is only a chart.
Runtime Checklist
Before a production agent handles real work, answer:
- What owns the active goal?
- Where is durable run state stored?
- Which component validates model proposals?
- Which execution mode fits this task?
- Which tools are allowed for this task class?
- Which actions require approval?
- What budget applies to the run?
- What happens when the budget is exhausted?
- What trace events are required?
- What side effects need idempotency or compensation?
- What breaker, fallback, or escalation path exists?
- What evals block release?
- What queue, concurrency limit, or backpressure policy applies?
- What component versions are recorded for every run?
- What can be rolled back without redeploying the whole system?
If those answers are vague, the system is still a prototype, even if it is already serving users.
Failure Modes
- The model owns state transitions because the runtime has no workflow state.
- Policy lives in the prompt instead of a runtime enforcement layer.
- Tool calls happen before budget, permission, schema, or approval checks.
- Retry logic repeats side effects without idempotency keys.
- Observability records final answers but not proposals, validation decisions, tool calls, and stop reasons.
- Evals test final prose while the runtime path remains untested.
- Operators cannot disable a risky tool, prompt version, model route, or workflow step quickly.
- The system can continue spending tokens, tool calls, and human attention after the task is no longer worth it.
- A synchronous request hides a long-running workflow until the first timeout or duplicate retry.
- Queues grow without backpressure, priority, cancellation, or dead-letter handling.
- Cancelled runs stop the UI but not the queued tool call or external workflow.
- A model or prompt change ships without versioned traces, targeted evals, or a rollback plan.
- Partial failure looks like success because the runtime records the final answer but not the failed side effect.
- Durable state exists, but the model context and workflow state disagree about what step is active.
Design Rule
Production runtime is where agentic architecture becomes honest. If the runtime cannot explain, bound, replay, and stop the agent, the model is not the only risk. The architecture is.
Continue with Durable Workflows for resumable execution, Observability and Evals for operational evidence, and Policy Enforcement for authority decisions.