Unrolling the Codex Agent Loop Without Losing Your Mind.

A practical, code-first guide to the “Codex loop”: streaming tool calls, state, compaction, caching, and production guardrails—so your agents stop vibe-coding and start shipping.
Table of contents
- Why “agent loop” matters (and why most implementations rot)
- The Codex loop in one sentence (and the 5 moving parts)
- Unrolling the loop: a mental model you can implement today
- Reference architecture: the Driver, State, Tools, Sandbox, Policy
- A minimal loop (TypeScript) that actually works (streaming + tool calls)
- Tool design that doesn’t sabotage you (schemas, safety, determinism)
- Long-running tasks: compaction + caching (without surprises)
- Observability and evals: making the loop measurable
- Comparisons: Codex-style loop vs LangGraph vs AutoGen vs CrewAI vs SWE-agent
- Production checklist + key takeaways
1) Why “agent loop” matters (and why most implementations rot)
We’ve all seen the demo: an agent edits a file, runs tests, fixes bugs, opens a PR, and everyone claps.
Then reality shows up:
- The agent “forgets” the constraint from 3 turns ago.
- It calls tools in a loop like a Roomba stuck in a corner.
- Your logs say “tool_call: true” and absolutely nothing else.
- Your VP asks: “Can we ship this without setting the repo on fire?”
The uncomfortable truth: agents don’t fail because the model “isn’t smart enough.”
They fail because the loop is under-designed.
The OpenAI Codex write-up is valuable precisely because it breaks the magic trick into an implementable system: prompt construction, tool permissions, streaming tool calls, state management, and context control—i.e., the boring parts that decide whether you ship.
2) The Codex loop in one sentence (and the 5 moving parts)
One sentence:
Repeatedly ask the model what to do next, execute the requested tool calls in a constrained environment, feed results back, and stop only when the model produces a final answer.
That sounds obvious—until you implement it and discover there are five separate systems hiding inside:
- Driver: orchestrates “ask → tool → observe → continue”
- State: what we keep, what we compact, what we discard
- Tools: shell, file ops, tests, network, linters, etc.
- Sandbox + permissions: what the agent may do (and what it can’t)
- Policy & prompt assembly: repo instructions, safety rules, task framing
Codex’s loop emphasizes structured prompt assembly (repo instructions + environment + policy), plus an explicit permissions model for actions/commands. Translation: it’s not “prompt magic,” it’s systems.
3) Unrolling the loop: a mental model you can implement today
Here’s the “unrolled” version we use when building real systems:
Step A — Build the input window
- System/developer instructions (your policy)
- Repo/workspace instructions (like
AGENTS.md, coding standards, “how to run tests”) - User request (“Fix bug X”, “Refactor module Y”)
- Current working state (files changed, test output, TODO list)
Step B — Ask the model (streaming)
Streaming isn’t just UX candy. It’s operational:
- show partial progress
- intercept tool calls as they’re produced
- avoid waiting for one huge blob response
Step C — Execute tool calls (with policy)
- Validate tool name + arguments
- Enforce allowlists/denylists (commands, paths, network)
- Run in a sandbox
Step D — Append observations back into state
Tool output becomes new input items.
Step E — Stop condition
Stop when:
- the model has no pending tool calls, and
- you’ve received a completed response (not truncated / incomplete)
If this feels like building a tiny operating system… yeah. Congrats. You’re now a loop engineer.
4) Reference architecture: Driver, State, Tools, Sandbox, Policy
Driver (the “air traffic controller”)
Responsibilities:
- stream model output
- detect tool calls
- execute tools
- feed results back
- enforce max iterations + time budgets
- enforce “done means done” (completion criteria)
State (the “shared brain”)
Keep:
- current goal + constraints
- plan / task list
- tool outputs (summarized)
- diffs / key file excerpts (not entire repos)
Decide early:
- what gets compacted
- what must remain verbatim (e.g., user requirements)
Important correction: statefulness is a strategy, not a vibe. Pick one:
- Stateful-by-ID: send only new items each turn +
previous_response_id - Stateless: send a curated window every turn (and no
previous_response_id)
Mixing both can accidentally double-inject context and inflate costs.
Tools (the “hands”)
Best tools are:
- deterministic
- well-scoped
- have structured inputs/outputs
- return actionable errors
Sandbox + permissions (the “guardrails”)
This is where you prevent:
rm -rf- credential exfiltration
- repo-wide rewrites
- network wandering
- “helpful” data-leaks into logs
A sandbox is not optional. It’s your seatbelt. And yes, it’s annoying—like seatbelts.
Policy & prompt assembly (the “adult supervision”)
Our hot take: your best prompt is the one your repo can own. Put it in version control, not in someone’s Notion page that hasn’t been opened since Q2.
5) A minimal loop (TypeScript) that actually works
Below is a corrected “Codex-style” loop using the OpenAI JS SDK + the Responses API.
Key fixes vs the earlier draft:
- Tool-call arguments are streamed in deltas, so we accumulate them and only parse JSON when they’re done.
- We avoid duplicating history by using stateful-by-ID mode: we send only new items each turn and rely on
previous_response_id.
This is still intentionally minimal. In production you’ll add a real sandbox runner, command allowlists, path policies, timeouts, and structured telemetry.
import OpenAI from "openai";
import { z } from "zod";
const client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
// ---------- Tool schema ----------
const RunShellSchema = z.object({
// In production, don’t accept “any string command”.
// Prefer structured commands (command + args) and validate hard.
command: z.string().min(1),
cwd: z.string().optional(),
});
type ToolResult =
| { ok: true; stdout: string; stderr?: string; exitCode: number }
| { ok: false; error: string; exitCode?: number };
// ---------- Sandbox runner (placeholder) ----------
async function runShellSandboxed(args: z.infer<typeof RunShellSchema>): Promise<ToolResult> {
const { command } = args;
// ✅ Better than a denylist: allowlist *known safe commands*.
// This is intentionally tiny; expand per environment.
const allowed = [
"npm test",
"npm run test",
"npm run lint",
"pnpm test",
"pnpm lint",
"pytest",
"ruff check .",
"git diff",
];
const normalized = command.trim();
if (!allowed.includes(normalized)) {
return {
ok: false,
error: `Command not allowed: "${normalized}". Allowed: ${allowed.join(", ")}`,
exitCode: 126,
};
}
// 🔥 Replace this with a real sandbox execution (Docker/Firecracker/etc).
// Return structured output.
return { ok: true, stdout: `Pretend we ran: ${normalized}\n(ok)`, exitCode: 0 };
}
// ---------- Tool definition (Responses API) ----------
const tools = [
{
type: "function" as const,
name: "run_shell",
description: "Run an allowlisted command in the project sandbox and return stdout/stderr/exitCode.",
parameters: {
type: "object",
properties: {
command: { type: "string", description: "Allowlisted command (exact match)." },
cwd: { type: "string", description: "Optional working directory (restricted in production)." },
},
required: ["command"],
additionalProperties: false,
},
},
];
// ---------- Helper types for streaming ----------
type PendingCall = { name: string; argsJson: string };
export async function codexStyleLoop(userTask: string) {
let previous_response_id: string | undefined;
// Send initial instructions once, then rely on previous_response_id for statefulness.
const bootstrapItems: any[] = [
{
role: "developer",
content: [
{
type: "input_text",
text:
[
"You are a coding agent operating in a sandbox.",
"Follow repo instructions if provided.",
"Be careful: make small changes, run tests when appropriate, and explain decisions briefly.",
"When you need to run commands, call the run_shell tool with an allowlisted command.",
].join("\n"),
},
],
},
{ role: "user", content: [{ type: "input_text", text: userTask }] },
];
const maxTurns = 12;
// We keep “new items” per turn. In stateful mode, we don’t resend the entire history.
let newItems: any[] = bootstrapItems;
for (let turn = 1; turn <= maxTurns; turn++) {
const pending: Record<string, PendingCall> = {};
const toolCalls: Array<{ call_id: string; name: string; arguments: any }> = [];
let finalText = "";
let sawCompleted = false;
const stream = await client.responses.create({
model: "gpt-5.1-codex-max",
input: newItems,
tools,
stream: true,
previous_response_id,
});
for await (const event of stream) {
// Stream text output
if (event.type === "response.output_text.delta") {
finalText += event.delta;
process.stdout.write(event.delta);
}
// When a function_call output item is created, remember it
if (event.type === "response.output_item.added" && event.item.type === "function_call") {
pending[event.item.id] = { name: event.item.name, argsJson: "" };
}
// Arguments arrive in deltas
if (event.type === "response.function_call_arguments.delta") {
const call = pending[event.item_id];
if (call) call.argsJson += event.delta;
}
// Done = safe point to parse JSON args
if (event.type === "response.function_call_arguments.done") {
const call = pending[event.item_id];
if (call) {
let args: any;
try {
args = JSON.parse(call.argsJson);
} catch (e) {
args = { __parse_error: String(e), __raw: call.argsJson };
}
toolCalls.push({ call_id: event.item_id, name: call.name, arguments: args });
}
}
if (event.type === "response.completed") {
previous_response_id = event.response.id;
sawCompleted = true;
}
}
// If we didn’t get a clean completion, treat it as “incomplete” and decide how to proceed.
// (In production, this is where you’d apply compaction / retry policies.)
if (!sawCompleted) {
return { ok: false, error: "Model response did not complete cleanly (stream ended early)." };
}
// Stop if there are no tool calls.
if (toolCalls.length === 0) {
return { ok: true, answer: finalText };
}
// Execute tool calls and build the next turn’s input
const nextItems: any[] = [];
for (const call of toolCalls) {
if (call.name !== "run_shell") {
nextItems.push({
type: "function_call_output",
call_id: call.call_id,
output: JSON.stringify({ ok: false, error: `Unknown tool: ${call.name}` }),
});
continue;
}
const parsed = RunShellSchema.safeParse(call.arguments);
if (!parsed.success) {
nextItems.push({
type: "function_call_output",
call_id: call.call_id,
output: JSON.stringify({ ok: false, error: parsed.error.message }),
});
continue;
}
const result = await runShellSandboxed(parsed.data);
nextItems.push({
type: "function_call_output",
call_id: call.call_id,
output: JSON.stringify(result),
});
}
// Optional “keep going” nudge (small, safe, avoids runaway)
nextItems.push({
role: "user",
content: [
{
type: "input_text",
text:
[
"Continue.",
"If you changed code, run an allowlisted test/lint command (or explain why not).",
"If blocked, explain what you need.",
].join(" "),
},
],
});
newItems = nextItems;
}
return { ok: false, error: `Max turns (${maxTurns}) reached.` };
}What this code demonstrates (for real)
- Streaming: we build output as deltas arrive
- Tool calling: tool calls are output items, arguments are streamed and must be accumulated
- Statefulness without duplication: we rely on
previous_response_idand only send new items per turn - A safer starting posture: allowlisting, structured tool output, bounded turns
6) Tool design that doesn’t sabotage you
Tool rule 1: Make tools boring
Your model should do the thinking; tools should do the doing.
Good tools:
read_file(path)apply_patch(diff)run_tests(target)search_repo(query)
Bad tools:
do_the_thing(task: string)(aka “please hallucinate in production”)
Tool rule 2: Prefer diffs over raw edits
Agents are way more reliable when they produce diffs. You can validate:
- file paths touched
- max lines changed
- forbidden patterns
- formatting checks
Tool rule 3: Return machine-parseable output
Even if the model reads it, your driver needs it too:
- exit codes
- file lists
- test summaries
- error classification
Tool rule 4: Explicit permissions beat “be careful”
Policy must be executable, not aspirational.
If the agent can run arbitrary shell, it will eventually run:
- something destructive
- something leaky
- something that “works on my machine” and fails in CI
7) Long-running tasks: compaction + caching
When your agent does real work, the context window grows like sourdough starter. Eventually:
- latency climbs
- cost climbs
- the model starts “forgetting” early constraints
Compaction: shrink the window without losing requirements
OpenAI’s conversation state guidance describes compaction as a way to keep long-running threads manageable when inputs grow. Practically:
- Keep user requirements verbatim
- Compact tool outputs aggressively (summaries + pointers to artifacts)
- Compact repeated logs (tests, lint) into “last known status”
Corrected wording: We’re not going to name a specific endpoint here unless you wire it up from the docs—because SDK surfaces and endpoints can evolve. The reliable concept is: compaction exists as a supported pattern, and you should design for it.
Prompt caching: stop paying for the same prefix
Provider-side prompt caching (when available) rewards consistency:
- keep your stable “policy + repo instructions” prefix consistent
- append only the changing state (tool results, diffs, progress) at the end
Corrected wording: we’re not claiming a specific “cache bucketing identifier” parameter—just the practical implementation advice: stable prefix good, constantly mutating prefix bad.
8) Observability and evals: making the loop measurable
If you can’t answer these, you don’t have an agent—you have an expensive surprise generator:
- How many tool calls per task?
- Where does time go (model vs tools vs retries)?
- What’s the pass rate on “tests green”?
- What percent of runs hit max-iterations?
- Which repos/projects produce the most failures?
Minimum viable telemetry
Log per turn:
- model used
- tokens in/out
- tool calls (name, args hash, duration, exit code)
- diff stats (#files, #lines)
- outcome label (success / partial / failed)
Evals that actually matter
For coding agents, use task-level evals:
- build passes
- unit tests pass
- lint passes
- no forbidden files touched
- PR description includes reproduction steps
9) Comparisons: Codex-style loop vs popular frameworks
Let’s keep this spicy and fair.
Codex-style “unrolled loop”
Strengths
- You own the loop: policy, state, sandbox, compaction, caching
- Fits production constraints cleanly (great for platform teams)
- Streaming + tool execution is first-class in the Responses model
Trade-off
- You’re building infrastructure (not just prompting)
LangGraph
LangGraph is designed for graph-based workflows—explicit state, branching, loops.
Great when
- you want deterministic control flow
- you want explicit states and transitions
- you’re orchestrating many steps that shouldn’t be “model decides everything”
Trade-off
- you still need sandbox/policies; graphs don’t secure shell access
AutoGen
AutoGen is a multi-agent conversation framework: roles, chats, humans-in-the-loop, tools.
Great when
- you want multi-agent collaboration patterns fast
- you’re prototyping “agent teams”
Trade-off
- multi-agent loops can multiply complexity fast (more turns, more tool calls, more coordination)
CrewAI
CrewAI focuses on orchestrating role-playing agents into a cohesive “crew.”
Great when
- you want a simple delegation mental model
- your team needs quick onboarding
Trade-off
- you’ll still want a serious driver layer for safety, state, and constraints
SWE-agent
SWE-agent is “agent-loop honest”: it lives in real repos, uses tools, and is evaluation-driven.
Great when
- you’re studying coding-agent ergonomics and real-world loop design
- you care about measurable outcomes (tests, diffs, repo changes)
Trade-off
- platform teams often want tighter integration with internal systems than a CLI-first workflow
Practical takeaway
Frameworks help you organize the loop. Codex-style unrolling helps you own the loop.
10) Production checklist
Loop safety
- Max turns + max tool calls
- Command allowlists (not denylists)
- Path allowlists (no touching secrets)
- Network policy (often default-deny)
- Timeouts + output size caps on tools
Reliability
- Deterministic tools
- Diff-based edits
- Always run tests if code changed (or require justification)
Context control
- Compaction strategy (what stays verbatim vs summarized)
- Stable prompt prefix (for efficiency and fewer surprises)
Observability
- Per-turn logs + traces
- Outcome labels and failure taxonomy
- Regression eval suite (tests, lint, policy checks)
Key takeaways
- The model is not the agent. The loop is the agent.
- Unroll the loop into explicit driver/state/tool/policy layers, and you gain control.
- Design tools to be boring and deterministic—the driver validates, the sandbox constrains, the model decides.
- Plan for long horizons: compaction + stable prefixes or you’ll drown in your own logs.
- Measure everything: if you can’t chart failure modes, you can’t improve them.
— Cohorte Team
January 26,2026.