Lab 11 - Add Context, Memory, Trace, and Evals

Download the lab completion worksheet and lab production readiness worksheet before you start.

Objective

Make the mini-runtime inspectable. Add context packets, scoped memory reads, trace events, and a trajectory eval that can fail even when the final answer looks plausible.

What You Will Use

Language: TypeScript or Python
Framework/runtime: from-scratch educational runtime
Framework-agnostic lesson: runtime behavior must be observable and testable, not only runnable.
Pattern chapters: Context Engineering, Working Memory, Observability and Evals
Previous labs: Lab 09, Lab 10

Exercise Time Budget

These estimates assume the Lab 10 runtime is already available.

Exercise	Time	Output
Run the baseline demo and test	10 min	Passing runtime command output.
Add context and scoped memory	20 min	Context packet with included and omitted memory references.
Add trace events and trajectory evals	20 min	Trace path plus eval result for risky behavior.
Exercise one failure case	10-15 min	Failed eval or unsafe trajectory signal.
Complete production review	10-25 min	Notes for memory governance, trace redaction, and incident replay.

Setup

Start from the Lab 10 runtime. Keep tools deterministic and small.

Reference files:

minimal-agent-runtime/typescript/src/runtime.ts
minimal-agent-runtime/typescript/src/run_demo.ts
minimal-agent-runtime/typescript/test/runtime.spec.ts

Run the reference demo and test:

npm run mini-runtime
npm run mini-runtime:test

Add a memory fixture:

const memory = [
  { id: "mem_1", scope: "project", text: "Write tools require approval." },
  { id: "mem_2", scope: "task", text: "The current task may use read tools only." },
];

Runtime Contract

type ContextPacket = {
  runId: string;
  goal: string;
  stateSummary: string;
  observations: Array<{ summary: string }>;
  toolsDisclosed: string[];
  memoryRefs: string[];
  omittedRefs: Array<{ ref: string; reason: string }>;
};

type TraceEvent = {
  runId: string;
  step: number;
  type:
    | "context_built"
    | "decision"
    | "policy_decision"
    | "tool_result"
    | "stop";
  data: unknown;
};

type EvalCase = {
  caseId: string;
  input: string;
  expected: {
    toolsCalled?: string[];
    toolsNotCalled?: string[];
    stopReason: string;
  };
};

Guided Change

Add buildContext(state) so every model decision receives a deliberate packet:

active goal;
compact state summary;
recent observations;
disclosed tools;
selected memory refs;
omitted memory refs with reasons.

Add recordTrace(event) and emit trace events for:

context built;
decision proposed;
policy decision;
tool result;
stop reason.

Baseline Run

Run a case where the agent calls a read tool and then answers. The reference demo does this with lookup_policy.

Expected Result

The demo command should show the scoped context packet. The first context_built event should include:

{
  "toolsDisclosed": ["draft_message", "lookup_policy", "send_message"],
  "memoryRefs": ["mem_1", "mem_2"],
  "omittedRefs": [
    { "ref": "mem_3", "reason": "out_of_scope" }
  ]
}

The read-tool path should include a trace like this:

context_built
decision
policy_decision
tool_result
context_built
decision
stop

The exact order can vary if your loop stops immediately after a tool, but the trace must show enough to reconstruct the path.

The unsafe trajectory case should produce:

final answer: done
stopReason: success
trajectory eval: fail
reason: forbidden tool was called: send_message

sequenceDiagram participant Runtime participant Memory participant Context participant Model participant Trace participant Eval Runtime->>Memory: Read scoped memory refs Memory-->>Runtime: Included refs and omitted refs Runtime->>Context: Build context packet Context-->>Trace: context_built Runtime->>Model: Ask for typed decision Model-->>Runtime: Decision proposal Runtime->>Trace: decision, policy_decision, tool_result, stop Runtime->>Eval: Submit final answer plus trajectory alt Trajectory allowed Eval-->>Runtime: pass else Forbidden path Eval-->>Runtime: fail with trace-backed reason end

Use this flow as the lab’s acceptance model. A plausible final answer is not enough; context omissions, memory scope, trace events, and forbidden trajectories must be inspectable.

Failure Case

Create an eval where the final answer says “done”, but the runtime called a forbidden write tool.

This is why final-answer-only evals are too weak for agentic systems.

Verify

Check these assertions manually or with npm run mini-runtime:test:

every run has a trace;
every stop has a stop reason;
context records included and omitted memory;
evals can check tools called and tools not called;
a forbidden trajectory fails even when final text looks acceptable.

The reference test includes an intentionally unsafe run where the final answer is done, but the trajectory eval fails because a forbidden write tool was called.

Lab Review Gate

Before moving on, verify the inspectability boundary:

Check	Evidence
Context is deliberate	The context packet lists goal, state summary, observations, tools, memory refs, and omissions.
Memory is scoped	Included and omitted memory references are both visible.
Trace reconstructs the path	Context, decision, policy, tool, and stop events are recorded.
Evals inspect trajectory	Forbidden tool use fails even when final text looks acceptable.
Release risk is visible	The lab can name which trace or eval gaps would block production.

Record the context packet, trace sequence, unsafe trajectory, and eval result in the lab completion worksheet.

Production Extension

Before production, add:

redaction before trace storage;
retention and deletion policy for memory and traces;
eval fixtures versioned with prompts, tools, models, and policies;
incident-to-eval workflow;
dashboards for stop reasons, tool errors, policy denials, cost, and latency;
release gates that block risky changes when trajectory evals fail.

Production Bridge

Use this table when adapting inspectability to production:

Lab Concept	Production Version
`ContextPacket`	Versioned context contract with evidence refs, memory policy, omissions, and token budget.
Memory fixture	Governed memory store with retention, deletion, correction, consent, and tenant scope.
`TraceEvent`	Correlated trace span with run ID, span ID, parent span, status, cost, latency, and redaction.
`EvalCase`	Release fixture with owner, severity, threshold, version set, and incident link.
Forbidden trajectory	Blocking gate for policy, tool, memory, retrieval, and autonomy regressions.

The first production milestone is a run that a second engineer can replay, evaluate, and explain without trusting the final answer alone.

Cross-Framework Mapping

In LangGraph, context and memory are state inputs, while traces can follow node transitions and checkpoints.
In Mastra AI, memory, evals, and observability are runtime-level capabilities that should still expose product-owned policy.
In AutoGen-style systems, message history must be converted into structured trace and eval data.
In CrewAI, flow and task records need enough structure to evaluate role behavior and final synthesis.