Coding Agents

Coding agents operate inside software repositories. They read code, edit files, run commands, inspect failures, produce diffs, and often create commits or pull requests. Codex, Cursor Agent and Cloud Agent, Claude Code, OpenHands, and similar tools are examples of this architecture class.

The pattern is not “AI autocomplete.” It is a controlled development worker with repository context and execution privileges.

Examples

Codex CLI and Codex IDE extension
Cursor Agent, Plan Mode, and Cloud Agents
Claude Code
OpenHands and OpenHands GitHub

Core Loop

Coding agent loop

Surfaces

Local CLI: runs near the repository and can use local tools.
IDE agent: shares editor context, selected files, inline diffs, and local commands.
Cloud or background agent: clones or mounts the repository in an isolated environment and returns a branch, diff, or PR.
CI/review agent: reviews pull requests, comments on diffs, or proposes patches.
Multi-agent workspace: runs several agents on separate branches or worktrees.

Architecture Concerns

Coding agents need unusually clear boundaries because they can change source code and run commands.

Design for:

Repository instructions: coding standards, commands, architecture constraints, and review expectations.
Workspace isolation: branch, worktree, container, or cloud environment per task.
Approval policy: which commands and file edits need human approval.
Test signal: fast checks first, then broader regression checks.
Diff review: humans inspect changed behavior, not just final prose.
Secret handling: no credentials in prompts, logs, or generated code.
Dependency policy: explicit approval before adding packages or changing lockfiles.

Coding Agent As A Service

A mature coding agent behaves like a bounded engineering service.

It should have:

a task contract;
a repository working set;
a writable workspace;
a tool permission profile;
a test strategy;
a state record;
a handoff artifact;
a review gate.

For example, a PR review agent may own only review comments. A migration agent may own one branch and one dependency upgrade. A security-fix agent may own one validated finding and one patch. These boundaries matter because coding agents can otherwise become broad agents with shell access and vague goals.

Treat the agent as a service with a contract:

Contract Field	Example
Input	issue, PR, failing test, migration request, security finding.
Allowed files	target package, test files, docs, config.
Disallowed files	secrets, generated assets, unrelated modules, deployment config.
Tools	read files, search, edit, test, typecheck, inspect CI.
Approval required	dependency install, lockfile change, broad refactor, deployment action.
Output	diff, test result, summary, risks, review notes.
Stop condition	tests pass, blocked reason, retry budget exhausted, human approval needed.

This is where coding agents connect to Agents As Services.

Workspace Isolation

The workspace is the blast-radius boundary.

Use one isolated workspace per task:

branch;
git worktree;
container;
virtual machine;
cloud workspace;
forked repository;
disposable checkout.

The agent should not edit directly on a developer’s dirty working tree unless the user explicitly asks for that mode. Parallel agents should not share a writable workspace. If two agents need to touch the same area, coordinate through branches, PRs, or an explicit merge step.

Good isolation gives you:

easy diff review;
rollback;
reproducible test runs;
safer command execution;
simpler conflict handling;
clean handoff to humans.

Branch, Worktree, Session, And CI Lifecycle

A production coding agent should make its lifecycle explicit. The task record, session state, workspace, branch, PR, and CI run are different artifacts with different owners.

Stage	Artifact	Owner	Required Evidence
task intake	issue, prompt, failing test, security finding, or migration request	human or scheduler	acceptance criteria, allowed scope, forbidden scope
session start	agent session record	agent runtime	model, tools, repo instructions, working set, permission profile
workspace allocation	branch, worktree, container, or cloud checkout	workspace manager	clean base ref, isolation ID, dependency cache policy
context gathering	inspected files and commands	agent	files read, symbols searched, assumptions, skipped areas
patching	diff on isolated branch	agent	changed files, rationale, generated files, dependency changes
local verification	test, build, typecheck, lint, or screenshot output	agent and test runner	command, exit code, relevant failure summary
PR handoff	draft PR, patch, or review artifact	agent	summary, verification, risks, open questions, rollback note
CI evaluation	CI run tied to branch or PR	CI system	jobs, logs, failures, artifacts, retry count
review decision	human review, policy gate, or maintainer merge	maintainer	approvals, requested changes, merge or rejection reason
cleanup	archived session and disposed workspace	workspace manager	branch state, worktree removal, retained traces

Do not collapse these artifacts into a chat transcript. A later engineer should be able to answer: what did the agent try, where did it try it, what changed, what verified it, and who accepted it?

Coding Agent Trace Contract

Keep a compact trace for each coding task.

type CodingAgentRun = {
  runId: string;
  repo: string;
  baseRef: string;
  branch: string;
  workspaceId: string;
  task: {
    source: "issue" | "pr_review" | "failing_test" | "security_finding" | "user_request";
    acceptanceCriteria: string[];
    allowedPaths: string[];
    forbiddenPaths: string[];
  };
  session: {
    instructionsLoaded: string[];
    toolsAllowed: string[];
    approvalRequiredFor: string[];
  };
  activity: Array<{
    kind: "read" | "search" | "edit" | "command" | "test" | "ci" | "handoff";
    target: string;
    result: "success" | "failed" | "blocked" | "skipped";
  }>;
  verification: Array<{
    command: string;
    exitCode: number;
    summary: string;
  }>;
  finalStatus: "ready_for_review" | "needs_human" | "blocked" | "abandoned";
  risks: string[];
};

This trace is not extra bureaucracy. It is the minimum record needed to review a coding agent the same way a team reviews any other contributor.

Repository Context

Coding agents fail when they see either too little code or too much code.

Use a curated working set:

identify primary files;
search for symbols, imports, references, tests, and docs;
rank secondary files;
load small relevant files;
summarize or excerpt large files;
keep unrelated files out of context.

The agent should also load repository instructions:

coding style;
architecture rules;
test commands;
package manager;
branch and commit rules;
security constraints;
generated-file rules;
review expectations.

Repository instructions should be durable. Do not rely on a human repeating the same guidance in every task.

Shell Command Discipline

Shell access is powerful and risky.

Commands should be treated like tools:

validate before execution;
capture stdout, stderr, exit code, and duration;
record the working directory;
limit output size;
redact secrets;
distinguish read-only commands from mutating commands;
require approval for dangerous operations;
prefer project scripts over ad hoc commands.

Good command output is structured enough for the agent to act on. A failing test should become a diagnostic: file, line, test name, expected value, actual value, and likely owning module.

Avoid commands that hide too much:

broad cleanup commands;
global installs;
destructive git operations;
shell scripts with unclear side effects;
commands that require interactive input;
commands that mutate external systems.

The agent should explain why it is running a command when the command has side effects.

CI Feedback Loop

CI is one of the best evaluators for coding agents.

A coding agent should treat CI as an external evaluator, not as an afterthought. It should reduce each failure to a reproducible local command or an explicit environmental blocker before continuing.

A useful loop is:

agent creates a branch or worktree;
agent makes the smallest coherent change;
agent runs fast local checks;
agent opens or updates a draft PR;
CI runs broader tests;
CI failures are parsed into structured diagnostics;
agent patches targeted failures within a retry budget;
agent stops when green, blocked, or approval is needed.

Do not let the agent chase CI forever. Define:

maximum attempts;
maximum runtime;
allowed files;
allowed test commands;
when to ask a human;
when to revert its own last change;
when to mark the task blocked.

CI feedback should improve the patch, not produce an endless loop of speculative edits.

The retry budget should be explicit:

interface CiFailure {
  file?: string;
  test?: string;
  message: string;
}

async function repairWithCi(task: CodingTask, maxAttempts = 3) {
  for (let attempt = 1; attempt <= maxAttempts; attempt += 1) {
    await applySmallPatch(task);
    const result = await runCiChecks(task.branch);

    if (result.status === 'green') {
      return { status: 'ready_for_review', attempts: attempt };
    }

    const failures: CiFailure[] = parseCiFailures(result.logs);
    if (failures.length === 0 || result.status === 'flaky') {
      return { status: 'needs_human', reason: 'unclear_ci_failure' };
    }

    task.feedback = failures.slice(0, 5);
  }

  return { status: 'blocked', reason: 'retry_budget_exhausted' };
}

The agent is allowed to repair, but not forever.

Background Agents

Background coding agents are useful when work is long-running or naturally asynchronous.

Use them for:

dependency upgrades;
lint migrations;
test failure triage;
mechanical refactors;
documentation updates;
low-risk bug fixes with good tests.

Avoid background autonomy for:

ambiguous product changes;
architecture changes without review;
security-sensitive code without a validated finding;
production deployment;
secrets, credentials, or access control changes.

Background agents should notify humans only at meaningful states:

needs clarification;
needs approval;
CI failed beyond retry budget;
PR ready for review;
blocked by missing permission;
conflict with main branch.

The point is not to remove humans. The point is to stop requiring humans to babysit waiting time.

Resumable State

Long-running coding agents need durable state outside the model context.

Useful state artifacts:

task goal;
acceptance criteria;
files inspected;
files changed;
commands run;
test results;
decisions made;
known risks;
open questions;
retry count;
current blocker.

This can live in a task record, PR description, branch notes, agent state file, issue comment, or durable workflow state. The important part is that a human or a later agent can resume without reconstructing the whole conversation.

PR Review Agents

PR review is one of the best production shapes for coding agents because the boundary is clear.

A review agent can:

inspect changed files;
compare against repository rules;
run targeted checks;
identify missing tests;
flag security risks;
suggest smaller diffs;
write review comments.

It should not automatically merge its own approval. It should not block on style preferences unless those preferences are codified. It should cite files, lines, tests, and evidence.

Good review comments are specific:

what is wrong;
why it matters;
where it occurs;
how to verify;
whether it is blocking or advisory.

The review agent is a second set of eyes, not the final authority.

Use When

The task can be verified with tests, builds, type checks, screenshots, or review.
The desired change can be described in concrete acceptance criteria.
The agent can inspect enough repository context to follow local patterns.
You can isolate work and review the resulting diff.

Avoid When

The repository lacks tests or runnable checks and the change is high risk.
The task is vague, political, or primarily product discovery.
The agent needs broad production credentials.
Multiple agents would edit the same files without coordination.

Evaluation Strategy

Evaluate coding agents through artifacts and trajectories.

Check:

diff correctness;
test behavior;
build and typecheck status;
changed-files scope;
architectural fit;
dependency changes;
generated code quality;
command trajectory;
secret exposure;
review usefulness;
handoff quality.

Use baselines:

no-agent baseline for simple tasks;
single-agent baseline before multi-agent coding workflows;
human review outcomes;
CI pass rate;
revert or follow-up fix rate.

Coding-agent evals should include negative cases:

task is too vague;
tests are missing;
requested change touches forbidden files;
CI failure is flaky;
dependency upgrade requires approval;
security finding is not reproducible;
generated code would solve the symptom but violate architecture.

The correct behavior is sometimes to stop and ask for a human decision.

Operating Patterns

Ask for a plan before large edits.
Make the agent cite files and commands it used.
Prefer small tasks with clear completion criteria.
Use worktrees or branches for parallel agents.
Require tests or type checks before commit.
Review generated code like human code.
Keep durable repo guidance in a project instruction file.
Use isolated workspaces for parallel or background work.
Treat shell commands as auditable tool calls.
Keep retry budgets for CI-driven repair loops.
Require explicit handoff artifacts for long-running work.

Failure Modes

Plausible code that compiles but violates architecture.
Broad refactors that mix behavior changes with formatting.
Tests updated to match broken behavior.
Hidden dependency changes.
Shell commands that mutate local state unexpectedly.
Agents fighting over the same files.
Review fatigue when diffs are too large.
CI loops that patch symptoms without understanding failures.
Background agents that continue after the task definition changes.
Context windows filled with unrelated files.
PR comments that sound plausible but cite no evidence.
Human handoff that omits what was tried and why it failed.

Production Checklist

Branch or worktree per task.
Clear allowed and forbidden files.
Tool permission profile for read, edit, shell, network, and git operations.
Repository instruction file loaded by default.
Curated file context, not whole-repo context.
Fast local checks before broad CI.
CI diagnostics parsed into structured feedback.
Retry budget and stop rules.
Draft PR or review artifact for human inspection.
Secret redaction in prompts, command output, traces, and summaries.
Handoff summary with changed files, commands, results, risks, and open questions.

Design Rule

The coding agent should never be the only reviewer of its own code. It can propose, edit, test, and explain. A separate check, reviewer, or policy gate should decide whether the change lands.