Unrolling the Codex Agent Loop Without Losing Your Mind.

Stop agent loops from rotting: learn the Codex-style driver/state/tool pattern + a working TS loop with streaming tools and guardrails.

A practical, code-first guide to the “Codex loop”: streaming tool calls, state, compaction, caching, and production guardrails—so your agents stop vibe-coding and start shipping.

Why “agent loop” matters (and why most implementations rot)
The Codex loop in one sentence (and the 5 moving parts)
Unrolling the loop: a mental model you can implement today
Reference architecture: the Driver, State, Tools, Sandbox, Policy
A minimal loop (TypeScript) that actually works (streaming + tool calls)
Tool design that doesn’t sabotage you (schemas, safety, determinism)
Long-running tasks: compaction + caching (without surprises)
Observability and evals: making the loop measurable
Comparisons: Codex-style loop vs LangGraph vs AutoGen vs CrewAI vs SWE-agent
Production checklist + key takeaways

1) Why “agent loop” matters (and why most implementations rot)

We’ve all seen the demo: an agent edits a file, runs tests, fixes bugs, opens a PR, and everyone claps.

Then reality shows up:

The agent “forgets” the constraint from 3 turns ago.
It calls tools in a loop like a Roomba stuck in a corner.
Your logs say “tool_call: true” and absolutely nothing else.
Your VP asks: “Can we ship this without setting the repo on fire?”

The uncomfortable truth: agents don’t fail because the model “isn’t smart enough.”
They fail because the loop is under-designed.

The OpenAI Codex write-up is valuable precisely because it breaks the magic trick into an implementable system: prompt construction, tool permissions, streaming tool calls, state management, and context control—i.e., the boring parts that decide whether you ship.

2) The Codex loop in one sentence (and the 5 moving parts)

One sentence:

Repeatedly ask the model what to do next, execute the requested tool calls in a constrained environment, feed results back, and stop only when the model produces a final answer.

That sounds obvious—until you implement it and discover there are five separate systems hiding inside:

Driver: orchestrates “ask → tool → observe → continue”
State: what we keep, what we compact, what we discard
Tools: shell, file ops, tests, network, linters, etc.
Sandbox + permissions: what the agent may do (and what it can’t)
Policy & prompt assembly: repo instructions, safety rules, task framing

Codex’s loop emphasizes structured prompt assembly (repo instructions + environment + policy), plus an explicit permissions model for actions/commands. Translation: it’s not “prompt magic,” it’s systems.

3) Unrolling the loop: a mental model you can implement today

Here’s the “unrolled” version we use when building real systems:

Step A — Build the input window

System/developer instructions (your policy)
Repo/workspace instructions (like AGENTS.md, coding standards, “how to run tests”)
User request (“Fix bug X”, “Refactor module Y”)
Current working state (files changed, test output, TODO list)

Step B — Ask the model (streaming)

Streaming isn’t just UX candy. It’s operational:

show partial progress
intercept tool calls as they’re produced
avoid waiting for one huge blob response

Step C — Execute tool calls (with policy)

Validate tool name + arguments
Enforce allowlists/denylists (commands, paths, network)
Run in a sandbox

Step D — Append observations back into state

Tool output becomes new input items.

Step E — Stop condition

Stop when:

the model has no pending tool calls, and
you’ve received a completed response (not truncated / incomplete)

If this feels like building a tiny operating system… yeah. Congrats. You’re now a loop engineer.

4) Reference architecture: Driver, State, Tools, Sandbox, Policy

Driver (the “air traffic controller”)

Responsibilities:

stream model output
detect tool calls
execute tools
feed results back
enforce max iterations + time budgets
enforce “done means done” (completion criteria)

State (the “shared brain”)

Keep:

current goal + constraints
plan / task list
tool outputs (summarized)
diffs / key file excerpts (not entire repos)

Decide early:

what gets compacted
what must remain verbatim (e.g., user requirements)

Important correction: statefulness is a strategy, not a vibe. Pick one:

Stateful-by-ID: send only new items each turn + previous_response_id
Stateless: send a curated window every turn (and no previous_response_id)

Mixing both can accidentally double-inject context and inflate costs.

Tools (the “hands”)

Best tools are:

deterministic
well-scoped
have structured inputs/outputs
return actionable errors

Sandbox + permissions (the “guardrails”)

This is where you prevent:

rm -rf
credential exfiltration
repo-wide rewrites
network wandering
“helpful” data-leaks into logs

A sandbox is not optional. It’s your seatbelt. And yes, it’s annoying—like seatbelts.

Policy & prompt assembly (the “adult supervision”)

Our hot take: your best prompt is the one your repo can own. Put it in version control, not in someone’s Notion page that hasn’t been opened since Q2.

5) A minimal loop (TypeScript) that actually works

Below is a corrected “Codex-style” loop using the OpenAI JS SDK + the Responses API.

Key fixes vs the earlier draft:

Tool-call arguments are streamed in deltas, so we accumulate them and only parse JSON when they’re done.
We avoid duplicating history by using stateful-by-ID mode: we send only new items each turn and rely on previous_response_id.

This is still intentionally minimal. In production you’ll add a real sandbox runner, command allowlists, path policies, timeouts, and structured telemetry.

import OpenAI from "openai";
import { z } from "zod";

const client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

// ---------- Tool schema ----------
const RunShellSchema = z.object({
  // In production, don’t accept “any string command”.
  // Prefer structured commands (command + args) and validate hard.
  command: z.string().min(1),
  cwd: z.string().optional(),
});

type ToolResult =
  | { ok: true; stdout: string; stderr?: string; exitCode: number }
  | { ok: false; error: string; exitCode?: number };

// ---------- Sandbox runner (placeholder) ----------
async function runShellSandboxed(args: z.infer<typeof RunShellSchema>): Promise<ToolResult> {
  const { command } = args;

  // ✅ Better than a denylist: allowlist *known safe commands*.
  // This is intentionally tiny; expand per environment.
  const allowed = [
    "npm test",
    "npm run test",
    "npm run lint",
    "pnpm test",
    "pnpm lint",
    "pytest",
    "ruff check .",
    "git diff",
  ];

  const normalized = command.trim();
  if (!allowed.includes(normalized)) {
    return {
      ok: false,
      error: `Command not allowed: "${normalized}". Allowed: ${allowed.join(", ")}`,
      exitCode: 126,
    };
  }

  // 🔥 Replace this with a real sandbox execution (Docker/Firecracker/etc).
  // Return structured output.
  return { ok: true, stdout: `Pretend we ran: ${normalized}\n(ok)`, exitCode: 0 };
}

// ---------- Tool definition (Responses API) ----------
const tools = [
  {
    type: "function" as const,
    name: "run_shell",
    description: "Run an allowlisted command in the project sandbox and return stdout/stderr/exitCode.",
    parameters: {
      type: "object",
      properties: {
        command: { type: "string", description: "Allowlisted command (exact match)." },
        cwd: { type: "string", description: "Optional working directory (restricted in production)." },
      },
      required: ["command"],
      additionalProperties: false,
    },
  },
];

// ---------- Helper types for streaming ----------
type PendingCall = { name: string; argsJson: string };

export async function codexStyleLoop(userTask: string) {
  let previous_response_id: string | undefined;

  // Send initial instructions once, then rely on previous_response_id for statefulness.
  const bootstrapItems: any[] = [
    {
      role: "developer",
      content: [
        {
          type: "input_text",
          text:
            [
              "You are a coding agent operating in a sandbox.",
              "Follow repo instructions if provided.",
              "Be careful: make small changes, run tests when appropriate, and explain decisions briefly.",
              "When you need to run commands, call the run_shell tool with an allowlisted command.",
            ].join("\n"),
        },
      ],
    },
    { role: "user", content: [{ type: "input_text", text: userTask }] },
  ];

  const maxTurns = 12;

  // We keep “new items” per turn. In stateful mode, we don’t resend the entire history.
  let newItems: any[] = bootstrapItems;

  for (let turn = 1; turn <= maxTurns; turn++) {
    const pending: Record<string, PendingCall> = {};
    const toolCalls: Array<{ call_id: string; name: string; arguments: any }> = [];

    let finalText = "";
    let sawCompleted = false;

    const stream = await client.responses.create({
      model: "gpt-5.1-codex-max",
      input: newItems,
      tools,
      stream: true,
      previous_response_id,
    });

    for await (const event of stream) {
      // Stream text output
      if (event.type === "response.output_text.delta") {
        finalText += event.delta;
        process.stdout.write(event.delta);
      }

      // When a function_call output item is created, remember it
      if (event.type === "response.output_item.added" && event.item.type === "function_call") {
        pending[event.item.id] = { name: event.item.name, argsJson: "" };
      }

      // Arguments arrive in deltas
      if (event.type === "response.function_call_arguments.delta") {
        const call = pending[event.item_id];
        if (call) call.argsJson += event.delta;
      }

      // Done = safe point to parse JSON args
      if (event.type === "response.function_call_arguments.done") {
        const call = pending[event.item_id];
        if (call) {
          let args: any;
          try {
            args = JSON.parse(call.argsJson);
          } catch (e) {
            args = { __parse_error: String(e), __raw: call.argsJson };
          }
          toolCalls.push({ call_id: event.item_id, name: call.name, arguments: args });
        }
      }

      if (event.type === "response.completed") {
        previous_response_id = event.response.id;
        sawCompleted = true;
      }
    }

    // If we didn’t get a clean completion, treat it as “incomplete” and decide how to proceed.
    // (In production, this is where you’d apply compaction / retry policies.)
    if (!sawCompleted) {
      return { ok: false, error: "Model response did not complete cleanly (stream ended early)." };
    }

    // Stop if there are no tool calls.
    if (toolCalls.length === 0) {
      return { ok: true, answer: finalText };
    }

    // Execute tool calls and build the next turn’s input
    const nextItems: any[] = [];

    for (const call of toolCalls) {
      if (call.name !== "run_shell") {
        nextItems.push({
          type: "function_call_output",
          call_id: call.call_id,
          output: JSON.stringify({ ok: false, error: `Unknown tool: ${call.name}` }),
        });
        continue;
      }

      const parsed = RunShellSchema.safeParse(call.arguments);
      if (!parsed.success) {
        nextItems.push({
          type: "function_call_output",
          call_id: call.call_id,
          output: JSON.stringify({ ok: false, error: parsed.error.message }),
        });
        continue;
      }

      const result = await runShellSandboxed(parsed.data);
      nextItems.push({
        type: "function_call_output",
        call_id: call.call_id,
        output: JSON.stringify(result),
      });
    }

    // Optional “keep going” nudge (small, safe, avoids runaway)
    nextItems.push({
      role: "user",
      content: [
        {
          type: "input_text",
          text:
            [
              "Continue.",
              "If you changed code, run an allowlisted test/lint command (or explain why not).",
              "If blocked, explain what you need.",
            ].join(" "),
        },
      ],
    });

    newItems = nextItems;
  }

  return { ok: false, error: `Max turns (${maxTurns}) reached.` };
}

What this code demonstrates (for real)

Streaming: we build output as deltas arrive
Tool calling: tool calls are output items, arguments are streamed and must be accumulated
Statefulness without duplication: we rely on previous_response_id and only send new items per turn
A safer starting posture: allowlisting, structured tool output, bounded turns

6) Tool design that doesn’t sabotage you

Tool rule 1: Make tools boring

Your model should do the thinking; tools should do the doing.

Good tools:

read_file(path)
apply_patch(diff)
run_tests(target)
search_repo(query)

Bad tools:

do_the_thing(task: string) (aka “please hallucinate in production”)

Tool rule 2: Prefer diffs over raw edits

Agents are way more reliable when they produce diffs. You can validate:

file paths touched
max lines changed
forbidden patterns
formatting checks

Tool rule 3: Return machine-parseable output

Even if the model reads it, your driver needs it too:

exit codes
file lists
test summaries
error classification

Tool rule 4: Explicit permissions beat “be careful”

Policy must be executable, not aspirational.

If the agent can run arbitrary shell, it will eventually run:

something destructive
something leaky
something that “works on my machine” and fails in CI

7) Long-running tasks: compaction + caching

When your agent does real work, the context window grows like sourdough starter. Eventually:

latency climbs
cost climbs
the model starts “forgetting” early constraints

Compaction: shrink the window without losing requirements

OpenAI’s conversation state guidance describes compaction as a way to keep long-running threads manageable when inputs grow. Practically:

Keep user requirements verbatim
Compact tool outputs aggressively (summaries + pointers to artifacts)
Compact repeated logs (tests, lint) into “last known status”

Corrected wording: We’re not going to name a specific endpoint here unless you wire it up from the docs—because SDK surfaces and endpoints can evolve. The reliable concept is: compaction exists as a supported pattern, and you should design for it.

Prompt caching: stop paying for the same prefix

Provider-side prompt caching (when available) rewards consistency:

keep your stable “policy + repo instructions” prefix consistent
append only the changing state (tool results, diffs, progress) at the end

Corrected wording: we’re not claiming a specific “cache bucketing identifier” parameter—just the practical implementation advice: stable prefix good, constantly mutating prefix bad.

8) Observability and evals: making the loop measurable

If you can’t answer these, you don’t have an agent—you have an expensive surprise generator:

How many tool calls per task?
Where does time go (model vs tools vs retries)?
What’s the pass rate on “tests green”?
What percent of runs hit max-iterations?
Which repos/projects produce the most failures?

Minimum viable telemetry

Log per turn:

model used
tokens in/out
tool calls (name, args hash, duration, exit code)
diff stats (#files, #lines)
outcome label (success / partial / failed)

Evals that actually matter

For coding agents, use task-level evals:

build passes
unit tests pass
lint passes
no forbidden files touched
PR description includes reproduction steps

9) Comparisons: Codex-style loop vs popular frameworks

Let’s keep this spicy and fair.

Codex-style “unrolled loop”

Strengths

You own the loop: policy, state, sandbox, compaction, caching
Fits production constraints cleanly (great for platform teams)
Streaming + tool execution is first-class in the Responses model

Trade-off

You’re building infrastructure (not just prompting)

LangGraph

LangGraph is designed for graph-based workflows—explicit state, branching, loops.

Great when

you want deterministic control flow
you want explicit states and transitions
you’re orchestrating many steps that shouldn’t be “model decides everything”

Trade-off

you still need sandbox/policies; graphs don’t secure shell access

AutoGen

AutoGen is a multi-agent conversation framework: roles, chats, humans-in-the-loop, tools.

Great when

you want multi-agent collaboration patterns fast
you’re prototyping “agent teams”

Trade-off

multi-agent loops can multiply complexity fast (more turns, more tool calls, more coordination)

CrewAI

CrewAI focuses on orchestrating role-playing agents into a cohesive “crew.”

Great when

you want a simple delegation mental model
your team needs quick onboarding

Trade-off

you’ll still want a serious driver layer for safety, state, and constraints

SWE-agent

SWE-agent is “agent-loop honest”: it lives in real repos, uses tools, and is evaluation-driven.

Great when

you’re studying coding-agent ergonomics and real-world loop design
you care about measurable outcomes (tests, diffs, repo changes)

Trade-off

platform teams often want tighter integration with internal systems than a CLI-first workflow

Practical takeaway
Frameworks help you organize the loop. Codex-style unrolling helps you own the loop.

10) Production checklist

Loop safety

Max turns + max tool calls
Command allowlists (not denylists)
Path allowlists (no touching secrets)
Network policy (often default-deny)
Timeouts + output size caps on tools

Reliability

Deterministic tools
Diff-based edits
Always run tests if code changed (or require justification)

Context control

Compaction strategy (what stays verbatim vs summarized)
Stable prompt prefix (for efficiency and fewer surprises)

Observability

Per-turn logs + traces
Outcome labels and failure taxonomy
Regression eval suite (tests, lint, policy checks)

Key takeaways

The model is not the agent. The loop is the agent.
Unroll the loop into explicit driver/state/tool/policy layers, and you gain control.
Design tools to be boring and deterministic—the driver validates, the sandbox constrains, the model decides.
Plan for long horizons: compaction + stable prefixes or you’ll drown in your own logs.
Measure everything: if you can’t chart failure modes, you can’t improve them.

— Cohorte Team
January 26,2026.

Unrolling the Codex Agent Loop Without Losing Your Mind.

Table of contents

1) Why “agent loop” matters (and why most implementations rot)

2) The Codex loop in one sentence (and the 5 moving parts)

3) Unrolling the loop: a mental model you can implement today

Step A — Build the input window

Step B — Ask the model (streaming)

Step C — Execute tool calls (with policy)

Step D — Append observations back into state

Step E — Stop condition

4) Reference architecture: Driver, State, Tools, Sandbox, Policy

Driver (the “air traffic controller”)

State (the “shared brain”)

Tools (the “hands”)

Sandbox + permissions (the “guardrails”)

Policy & prompt assembly (the “adult supervision”)

5) A minimal loop (TypeScript) that actually works

What this code demonstrates (for real)

6) Tool design that doesn’t sabotage you

Tool rule 1: Make tools boring

Tool rule 2: Prefer diffs over raw edits

Tool rule 3: Return machine-parseable output

Tool rule 4: Explicit permissions beat “be careful”

7) Long-running tasks: compaction + caching

Compaction: shrink the window without losing requirements

Prompt caching: stop paying for the same prefix

8) Observability and evals: making the loop measurable

Minimum viable telemetry

Evals that actually matter

9) Comparisons: Codex-style loop vs popular frameworks

Codex-style “unrolled loop”

LangGraph

AutoGen

CrewAI

SWE-agent

10) Production checklist

Loop safety

Reliability

Context control

Observability

Key takeaways