Building a Minimal Agent Runtime
This chapter explains the small runtime behind the from-scratch mini-framework labs. You do not build it because production teams should avoid mature frameworks. You build it because a tiny runtime makes the real architecture visible.
Frameworks change APIs and vocabulary. The same responsibilities keep coming back: state, decisions, tools, policy, context, traces, evals, and stop conditions. Once you can build those primitives in a small runtime, LangGraph, Mastra AI, AutoGen-style systems, CrewAI, MCP, A2A, and custom harnesses are easier to evaluate.
What You Should Be Able To Do
After this chapter, you should be able to:
- identify the runtime primitives hidden inside agent frameworks;
- explain why the model proposes decisions but the runtime executes them;
- build a tiny loop with state, policy, tools, context, traces, and stop reasons;
- compare framework features by responsibility instead of vocabulary;
- know when the learning runtime should give way to production infrastructure.
Why Build One
Most agent failures are not mysterious model failures. They are runtime failures.
The system did not know who owned state. The tool list was too broad. Policy lived in a prompt. The loop had no stop reason. Context was assembled by dumping everything into the model. A final answer looked acceptable, but the path used a forbidden tool. None of those are fixed by changing framework names.
A minimal runtime teaches the control boundary:
goal
-> build context
-> ask for a decision
-> validate the decision
-> check policy
-> execute allowed work
-> record observation
-> evaluate stop condition
That is the shape hidden under many framework abstractions.
Use this diagram to keep the runtime responsibilities separate. The model proposes a decision, but the runtime owns validation, policy, execution, observation, and stop conditions.
What This Runtime Is Not
This is not a production framework. It does not try to solve deployment, streaming, distributed execution, persistence, authentication, workflow queues, model adapters, UI integration, tracing backends, or memory stores.
Use it as a learning scaffold. Use mature frameworks when you need production durability, operational integrations, concurrency, checkpoints, retries, hosted observability, and ecosystem support.
Runtime Readiness Record
Before adapting the mini-runtime to a real system, write down which primitives are learning-only and which are production-owned.
runtime_readiness:
purpose: "learning scaffold"
state:
owner: "mini-runtime"
production_gap: "no durable checkpoints"
tools:
owner: "tool registry"
production_gap: "no tenant-scoped authorization service"
policy:
owner: "local policy gate"
production_gap: "no central audit trail"
context:
owner: "context builder"
production_gap: "no retrieval freshness or redaction pipeline"
trace:
owner: "local trace events"
production_gap: "no hosted observability backend"
decision: "use for labs only; migrate to mature runtime before real side effects"
The record prevents the common mistake: treating a useful educational runtime as if it already has production durability, security, and operations.
Primitive 1: State
State is the source of truth for the run. The transcript is not enough. A transcript says what was said; state says what the system is trying to do, what happened, what was observed, what remains, and why the run stopped.
type StopReason =
| "success"
| "blocked"
| "approval_required"
| "budget_exhausted"
| "invalid_decision"
| "policy_denied"
| "tool_failure";
type Observation = {
id: string;
kind: "model" | "tool" | "policy" | "human" | "system";
summary: string;
data?: unknown;
};
type AgentState = {
runId: string;
goal: string;
steps: number;
maxSteps: number;
observations: Observation[];
stopReason?: StopReason;
};
Good state lets you resume, replay, debug, evaluate, and explain a run. Bad state forces operators to infer behavior from final text.
Primitive 2: Decision
A model response is a proposal. The runtime turns that proposal into a typed decision before it can affect tools, users, durable state, or external systems.
type Decision =
| { kind: "answer"; text: string }
| { kind: "tool"; name: string; input: unknown }
| { kind: "ask_human"; question: string }
| { kind: "stop"; reason: StopReason };
This is the most important split in the runtime: the model can suggest, but software validates and executes.
Primitive 3: Loop
The loop owns progress. It repeatedly builds context, asks for a decision, validates that decision, executes allowed work, records observations, and stops.
async function runAgent(
state: AgentState,
decide: (context: ContextPacket) => Promise<Decision>,
): Promise<AgentState> {
while (state.steps < state.maxSteps) {
const context = buildContext(state);
const decision = await decide(context);
const result = await handleDecision(state, decision);
state.observations.push(result.observation);
state.steps += 1;
if (result.stopReason) {
state.stopReason = result.stopReason;
return state;
}
}
state.stopReason = "budget_exhausted";
return state;
}
The loop should never run because the model keeps asking. It runs because the runtime still has budget, policy allows the next step, and stop conditions have not been met.
Primitive 4: Tool Registry
Tools are capabilities. A registry defines the capabilities the runtime can expose.
type ToolResult =
| { status: "ok"; data: unknown }
| { status: "refused"; reason: string }
| { status: "error"; reason: string };
type ToolDefinition = {
name: string;
description: string;
sideEffect: "read" | "draft" | "write";
execute(input: unknown): Promise<ToolResult>;
};
Keep tools narrow. Prefer lookup_order_summary over run_sql, draft_refund_request over post_http, and search_policy_docs over unrestricted browser or shell access.
Primitive 5: Policy Gate
The registry says what exists. Policy decides what is allowed now.
type PolicyDecision =
| { status: "allow" }
| { status: "deny"; reason: string }
| { status: "approval_required"; reason: string };
type PolicyContext = {
actorId: string;
route: string;
approvedActionIds: string[];
remainingSteps: number;
};
A useful policy gate considers actor, route, tenant, tool, side effect, approval state, data sensitivity, and budget. A prompt that says “do not do dangerous things” is not a policy gate.
Primitive 6: Context Packet
Context is not everything the system knows. It is the working set for one decision.
type ContextPacket = {
runId: string;
goal: string;
stateSummary: string;
observations: Array<{ id: string; summary: string }>;
toolsDisclosed: string[];
evidenceRefs: string[];
memoryRefs: string[];
omittedRefs: Array<{ ref: string; reason: string }>;
};
The runtime should be able to explain why each item entered the context and why other available material stayed out.
Primitive 7: Trace
Traces make behavior reviewable. Without them, debugging collapses into reading final answers and guessing.
type TraceEvent = {
runId: string;
step: number;
type:
| "context_built"
| "decision"
| "policy_decision"
| "tool_result"
| "stop";
data: unknown;
};
Trace events should connect the model decision, policy result, tool call, observation, cost, latency, and stop reason.
Primitive 8: Eval Harness
Agent evals should inspect paths, not only answers.
type EvalCase = {
caseId: string;
input: string;
expected: {
toolsCalled?: string[];
toolsNotCalled?: string[];
stopReason: StopReason;
};
};
Useful evals catch forbidden tools, missing evidence, approval bypasses, invalid decisions, repeated side effects, and budget exhaustion. A plausible final answer is not enough if the trajectory was unsafe.
How This Maps to Frameworks
| Runtime Primitive | LangGraph | Mastra AI | AutoGen-style Systems | CrewAI |
|---|---|---|---|---|
| State | graph state and checkpoints | workflow and memory state | conversation/session state | flow state |
| Decision | node output or router result | agent response or workflow step | agent message | task output |
| Loop | graph traversal | workflow/agent runtime | conversation turn loop | flow execution |
| Tool registry | tools bound to nodes or agents | tools | callable functions/tools | role tools |
| Policy gate | guard node or middleware | workflow/tool policy | manager or wrapper | flow guard or task constraint |
| Context packet | node input state | agent context and memory | message set | task context |
| Trace | callbacks and checkpoints | observability/evals | logs and messages | task and flow logs |
| Eval harness | graph-level tests | eval suites | transcript/trajectory tests | task/flow quality checks |
Frameworks can package these primitives, but they do not remove the need to design them.
The important comparison is responsibility, not API shape:
| Question | If You Build It Yourself | If You Use A Framework |
|---|---|---|
| Who owns state? | Your runtime data model and persistence plan. | The framework may provide state containers or checkpoints, but your application still defines the business state. |
| Who authorizes tools? | Your policy function, approval records, and audit trail. | The framework can expose hooks or middleware, but product policy still belongs outside the prompt. |
| Who assembles context? | Your context builder chooses memory, evidence, tools, and omissions. | The framework can provide memory abstractions, but you still need source, freshness, and privacy rules. |
| Who evaluates behavior? | Your tests inspect decisions, tools, traces, and stop reasons. | The framework can run evals, but you still decide what unsafe or low-quality behavior means. |
| Who handles production failures? | You must add retries, idempotency, durability, alerts, and incident workflow. | Mature runtimes can provide pieces of this, but they must be configured against your risk model. |
What the Labs Do
The mini-framework labs implement the primitives in three passes:
- Lab 09 - Minimal Agent Loop builds state, decisions, loop control, and stop reasons.
- Lab 10 - Tool Registry and Policy Gate adds tools, policy decisions, approval-required outcomes, and refusal paths.
- Lab 11 - Context, Memory, Trace, and Evals adds context packets, scoped memory, trace events, and trajectory evals.
Do the labs if you want implementation intuition. Read this chapter alone if you only need the mental model.
Design Rule
Build the tiny runtime to learn. Ship with mature runtime capabilities when the system must survive real users, real data, real side effects, and real incidents.