Lab 11 makes the mini-runtime inspectable with context packets, scoped memory reads, trace events, and trajectory evals.

Section
Hands-On Labs
Type
Lab
Level
Hands-on
Read
4 min
Effort
45-90 min lab
BuilderStudent

Lab 11 - Add Context, Memory, Trace, and Evals

Download the lab completion worksheet and lab production readiness worksheet before you start.

Objective

Make the mini-runtime inspectable. Add context packets, scoped memory reads, trace events, and a trajectory eval that can fail even when the final answer looks plausible.

What You Will Use

Exercise Time Budget

These estimates assume the Lab 10 runtime is already available.

Exercise Time Output
Run the baseline demo and test 10 min Passing runtime command output.
Add context and scoped memory 20 min Context packet with included and omitted memory references.
Add trace events and trajectory evals 20 min Trace path plus eval result for risky behavior.
Exercise one failure case 10-15 min Failed eval or unsafe trajectory signal.
Complete production review 10-25 min Notes for memory governance, trace redaction, and incident replay.

Setup

Start from the Lab 10 runtime. Keep tools deterministic and small.

Reference files:

  • minimal-agent-runtime/typescript/src/runtime.ts
  • minimal-agent-runtime/typescript/src/run_demo.ts
  • minimal-agent-runtime/typescript/test/runtime.spec.ts

Run the reference demo and test:

npm run mini-runtime
npm run mini-runtime:test

Add a memory fixture:

const memory = [
  { id: "mem_1", scope: "project", text: "Write tools require approval." },
  { id: "mem_2", scope: "task", text: "The current task may use read tools only." },
];

Runtime Contract

type ContextPacket = {
  runId: string;
  goal: string;
  stateSummary: string;
  observations: Array<{ summary: string }>;
  toolsDisclosed: string[];
  memoryRefs: string[];
  omittedRefs: Array<{ ref: string; reason: string }>;
};

type TraceEvent = {
  runId: string;
  step: number;
  type:
    | "context_built"
    | "decision"
    | "policy_decision"
    | "tool_result"
    | "stop";
  data: unknown;
};

type EvalCase = {
  caseId: string;
  input: string;
  expected: {
    toolsCalled?: string[];
    toolsNotCalled?: string[];
    stopReason: string;
  };
};

Guided Change

Add buildContext(state) so every model decision receives a deliberate packet:

  • active goal;
  • compact state summary;
  • recent observations;
  • disclosed tools;
  • selected memory refs;
  • omitted memory refs with reasons.

Add recordTrace(event) and emit trace events for:

  1. context built;
  2. decision proposed;
  3. policy decision;
  4. tool result;
  5. stop reason.

Baseline Run

Run a case where the agent calls a read tool and then answers. The reference demo does this with lookup_policy.

Expected Result

The demo command should show the scoped context packet. The first context_built event should include:

{
  "toolsDisclosed": ["draft_message", "lookup_policy", "send_message"],
  "memoryRefs": ["mem_1", "mem_2"],
  "omittedRefs": [
    { "ref": "mem_3", "reason": "out_of_scope" }
  ]
}

The read-tool path should include a trace like this:

context_built
decision
policy_decision
tool_result
context_built
decision
stop

The exact order can vary if your loop stops immediately after a tool, but the trace must show enough to reconstruct the path.

The unsafe trajectory case should produce:

final answer: done
stopReason: success
trajectory eval: fail
reason: forbidden tool was called: send_message
sequenceDiagram participant Runtime participant Memory participant Context participant Model participant Trace participant Eval Runtime->>Memory: Read scoped memory refs Memory-->>Runtime: Included refs and omitted refs Runtime->>Context: Build context packet Context-->>Trace: context_built Runtime->>Model: Ask for typed decision Model-->>Runtime: Decision proposal Runtime->>Trace: decision, policy_decision, tool_result, stop Runtime->>Eval: Submit final answer plus trajectory alt Trajectory allowed Eval-->>Runtime: pass else Forbidden path Eval-->>Runtime: fail with trace-backed reason end

Use this flow as the lab’s acceptance model. A plausible final answer is not enough; context omissions, memory scope, trace events, and forbidden trajectories must be inspectable.

Failure Case

Create an eval where the final answer says “done”, but the runtime called a forbidden write tool.

This is why final-answer-only evals are too weak for agentic systems.

Verify

Check these assertions manually or with npm run mini-runtime:test:

  • every run has a trace;
  • every stop has a stop reason;
  • context records included and omitted memory;
  • evals can check tools called and tools not called;
  • a forbidden trajectory fails even when final text looks acceptable.

The reference test includes an intentionally unsafe run where the final answer is done, but the trajectory eval fails because a forbidden write tool was called.

Lab Review Gate

Before moving on, verify the inspectability boundary:

Check Evidence
Context is deliberate The context packet lists goal, state summary, observations, tools, memory refs, and omissions.
Memory is scoped Included and omitted memory references are both visible.
Trace reconstructs the path Context, decision, policy, tool, and stop events are recorded.
Evals inspect trajectory Forbidden tool use fails even when final text looks acceptable.
Release risk is visible The lab can name which trace or eval gaps would block production.

Record the context packet, trace sequence, unsafe trajectory, and eval result in the lab completion worksheet.

Production Extension

Before production, add:

  • redaction before trace storage;
  • retention and deletion policy for memory and traces;
  • eval fixtures versioned with prompts, tools, models, and policies;
  • incident-to-eval workflow;
  • dashboards for stop reasons, tool errors, policy denials, cost, and latency;
  • release gates that block risky changes when trajectory evals fail.

Production Bridge

Use this table when adapting inspectability to production:

Lab Concept Production Version
ContextPacket Versioned context contract with evidence refs, memory policy, omissions, and token budget.
Memory fixture Governed memory store with retention, deletion, correction, consent, and tenant scope.
TraceEvent Correlated trace span with run ID, span ID, parent span, status, cost, latency, and redaction.
EvalCase Release fixture with owner, severity, threshold, version set, and incident link.
Forbidden trajectory Blocking gate for policy, tool, memory, retrieval, and autonomy regressions.

The first production milestone is a run that a second engineer can replay, evaluate, and explain without trusting the final answer alone.

Cross-Framework Mapping

  • In LangGraph, context and memory are state inputs, while traces can follow node transitions and checkpoints.
  • In Mastra AI, memory, evals, and observability are runtime-level capabilities that should still expose product-owned policy.
  • In AutoGen-style systems, message history must be converted into structured trace and eval data.
  • In CrewAI, flow and task records need enough structure to evaluate role behavior and final synthesis.