Lab 11 - Add Context, Memory, Trace, and Evals
Download the lab completion worksheet and lab production readiness worksheet before you start.
Objective
Make the mini-runtime inspectable. Add context packets, scoped memory reads, trace events, and a trajectory eval that can fail even when the final answer looks plausible.
What You Will Use
- Language: TypeScript or Python
- Framework/runtime: from-scratch educational runtime
- Framework-agnostic lesson: runtime behavior must be observable and testable, not only runnable.
- Pattern chapters: Context Engineering, Working Memory, Observability and Evals
- Previous labs: Lab 09, Lab 10
Exercise Time Budget
These estimates assume the Lab 10 runtime is already available.
| Exercise | Time | Output |
|---|---|---|
| Run the baseline demo and test | 10 min | Passing runtime command output. |
| Add context and scoped memory | 20 min | Context packet with included and omitted memory references. |
| Add trace events and trajectory evals | 20 min | Trace path plus eval result for risky behavior. |
| Exercise one failure case | 10-15 min | Failed eval or unsafe trajectory signal. |
| Complete production review | 10-25 min | Notes for memory governance, trace redaction, and incident replay. |
Setup
Start from the Lab 10 runtime. Keep tools deterministic and small.
Reference files:
minimal-agent-runtime/typescript/src/runtime.tsminimal-agent-runtime/typescript/src/run_demo.tsminimal-agent-runtime/typescript/test/runtime.spec.ts
Run the reference demo and test:
npm run mini-runtime
npm run mini-runtime:test
Add a memory fixture:
const memory = [
{ id: "mem_1", scope: "project", text: "Write tools require approval." },
{ id: "mem_2", scope: "task", text: "The current task may use read tools only." },
];
Runtime Contract
type ContextPacket = {
runId: string;
goal: string;
stateSummary: string;
observations: Array<{ summary: string }>;
toolsDisclosed: string[];
memoryRefs: string[];
omittedRefs: Array<{ ref: string; reason: string }>;
};
type TraceEvent = {
runId: string;
step: number;
type:
| "context_built"
| "decision"
| "policy_decision"
| "tool_result"
| "stop";
data: unknown;
};
type EvalCase = {
caseId: string;
input: string;
expected: {
toolsCalled?: string[];
toolsNotCalled?: string[];
stopReason: string;
};
};
Guided Change
Add buildContext(state) so every model decision receives a deliberate packet:
- active goal;
- compact state summary;
- recent observations;
- disclosed tools;
- selected memory refs;
- omitted memory refs with reasons.
Add recordTrace(event) and emit trace events for:
- context built;
- decision proposed;
- policy decision;
- tool result;
- stop reason.
Baseline Run
Run a case where the agent calls a read tool and then answers. The reference demo does this with lookup_policy.
Expected Result
The demo command should show the scoped context packet. The first context_built event should include:
{
"toolsDisclosed": ["draft_message", "lookup_policy", "send_message"],
"memoryRefs": ["mem_1", "mem_2"],
"omittedRefs": [
{ "ref": "mem_3", "reason": "out_of_scope" }
]
}
The read-tool path should include a trace like this:
context_built
decision
policy_decision
tool_result
context_built
decision
stop
The exact order can vary if your loop stops immediately after a tool, but the trace must show enough to reconstruct the path.
The unsafe trajectory case should produce:
final answer: done
stopReason: success
trajectory eval: fail
reason: forbidden tool was called: send_message
Use this flow as the lab’s acceptance model. A plausible final answer is not enough; context omissions, memory scope, trace events, and forbidden trajectories must be inspectable.
Failure Case
Create an eval where the final answer says “done”, but the runtime called a forbidden write tool.
This is why final-answer-only evals are too weak for agentic systems.
Verify
Check these assertions manually or with npm run mini-runtime:test:
- every run has a trace;
- every stop has a stop reason;
- context records included and omitted memory;
- evals can check tools called and tools not called;
- a forbidden trajectory fails even when final text looks acceptable.
The reference test includes an intentionally unsafe run where the final answer is done, but the trajectory eval fails because a forbidden write tool was called.
Lab Review Gate
Before moving on, verify the inspectability boundary:
| Check | Evidence |
|---|---|
| Context is deliberate | The context packet lists goal, state summary, observations, tools, memory refs, and omissions. |
| Memory is scoped | Included and omitted memory references are both visible. |
| Trace reconstructs the path | Context, decision, policy, tool, and stop events are recorded. |
| Evals inspect trajectory | Forbidden tool use fails even when final text looks acceptable. |
| Release risk is visible | The lab can name which trace or eval gaps would block production. |
Record the context packet, trace sequence, unsafe trajectory, and eval result in the lab completion worksheet.
Production Extension
Before production, add:
- redaction before trace storage;
- retention and deletion policy for memory and traces;
- eval fixtures versioned with prompts, tools, models, and policies;
- incident-to-eval workflow;
- dashboards for stop reasons, tool errors, policy denials, cost, and latency;
- release gates that block risky changes when trajectory evals fail.
Production Bridge
Use this table when adapting inspectability to production:
| Lab Concept | Production Version |
|---|---|
ContextPacket |
Versioned context contract with evidence refs, memory policy, omissions, and token budget. |
| Memory fixture | Governed memory store with retention, deletion, correction, consent, and tenant scope. |
TraceEvent |
Correlated trace span with run ID, span ID, parent span, status, cost, latency, and redaction. |
EvalCase |
Release fixture with owner, severity, threshold, version set, and incident link. |
| Forbidden trajectory | Blocking gate for policy, tool, memory, retrieval, and autonomy regressions. |
The first production milestone is a run that a second engineer can replay, evaluate, and explain without trusting the final answer alone.
Cross-Framework Mapping
- In LangGraph, context and memory are state inputs, while traces can follow node transitions and checkpoints.
- In Mastra AI, memory, evals, and observability are runtime-level capabilities that should still expose product-owned policy.
- In AutoGen-style systems, message history must be converted into structured trace and eval data.
- In CrewAI, flow and task records need enough structure to evaluate role behavior and final synthesis.