Unrolling the Codex Agent Loop Without Losing Your Mind.

Stop agent loops from rotting: learn the Codex-style driver/state/tool pattern + a working TS loop with streaming tools and guardrails.

A practical, code-first guide to the “Codex loop”: streaming tool calls, state, compaction, caching, and production guardrails—so your agents stop vibe-coding and start shipping.

Table of contents

  1. Why “agent loop” matters (and why most implementations rot)
  2. The Codex loop in one sentence (and the 5 moving parts)
  3. Unrolling the loop: a mental model you can implement today
  4. Reference architecture: the Driver, State, Tools, Sandbox, Policy
  5. A minimal loop (TypeScript) that actually works (streaming + tool calls)
  6. Tool design that doesn’t sabotage you (schemas, safety, determinism)
  7. Long-running tasks: compaction + caching (without surprises)
  8. Observability and evals: making the loop measurable
  9. Comparisons: Codex-style loop vs LangGraph vs AutoGen vs CrewAI vs SWE-agent
  10. Production checklist + key takeaways

1) Why “agent loop” matters (and why most implementations rot)

We’ve all seen the demo: an agent edits a file, runs tests, fixes bugs, opens a PR, and everyone claps.

Then reality shows up:

  • The agent “forgets” the constraint from 3 turns ago.
  • It calls tools in a loop like a Roomba stuck in a corner.
  • Your logs say “tool_call: true” and absolutely nothing else.
  • Your VP asks: “Can we ship this without setting the repo on fire?”

The uncomfortable truth: agents don’t fail because the model “isn’t smart enough.”
They fail because the loop is under-designed.

The OpenAI Codex write-up is valuable precisely because it breaks the magic trick into an implementable system: prompt construction, tool permissions, streaming tool calls, state management, and context control—i.e., the boring parts that decide whether you ship.

2) The Codex loop in one sentence (and the 5 moving parts)

One sentence:

Repeatedly ask the model what to do next, execute the requested tool calls in a constrained environment, feed results back, and stop only when the model produces a final answer.

That sounds obvious—until you implement it and discover there are five separate systems hiding inside:

  1. Driver: orchestrates “ask → tool → observe → continue”
  2. State: what we keep, what we compact, what we discard
  3. Tools: shell, file ops, tests, network, linters, etc.
  4. Sandbox + permissions: what the agent may do (and what it can’t)
  5. Policy & prompt assembly: repo instructions, safety rules, task framing

Codex’s loop emphasizes structured prompt assembly (repo instructions + environment + policy), plus an explicit permissions model for actions/commands. Translation: it’s not “prompt magic,” it’s systems.

3) Unrolling the loop: a mental model you can implement today

Here’s the “unrolled” version we use when building real systems:

Step A — Build the input window

  • System/developer instructions (your policy)
  • Repo/workspace instructions (like AGENTS.md, coding standards, “how to run tests”)
  • User request (“Fix bug X”, “Refactor module Y”)
  • Current working state (files changed, test output, TODO list)

Step B — Ask the model (streaming)

Streaming isn’t just UX candy. It’s operational:

  • show partial progress
  • intercept tool calls as they’re produced
  • avoid waiting for one huge blob response

Step C — Execute tool calls (with policy)

  • Validate tool name + arguments
  • Enforce allowlists/denylists (commands, paths, network)
  • Run in a sandbox

Step D — Append observations back into state

Tool output becomes new input items.

Step E — Stop condition

Stop when:

  • the model has no pending tool calls, and
  • you’ve received a completed response (not truncated / incomplete)

If this feels like building a tiny operating system… yeah. Congrats. You’re now a loop engineer.

4) Reference architecture: Driver, State, Tools, Sandbox, Policy

Driver (the “air traffic controller”)

Responsibilities:

  • stream model output
  • detect tool calls
  • execute tools
  • feed results back
  • enforce max iterations + time budgets
  • enforce “done means done” (completion criteria)

State (the “shared brain”)

Keep:

  • current goal + constraints
  • plan / task list
  • tool outputs (summarized)
  • diffs / key file excerpts (not entire repos)

Decide early:

  • what gets compacted
  • what must remain verbatim (e.g., user requirements)

Important correction: statefulness is a strategy, not a vibe. Pick one:

  • Stateful-by-ID: send only new items each turn + previous_response_id
  • Stateless: send a curated window every turn (and no previous_response_id)

Mixing both can accidentally double-inject context and inflate costs.

Tools (the “hands”)

Best tools are:

  • deterministic
  • well-scoped
  • have structured inputs/outputs
  • return actionable errors

Sandbox + permissions (the “guardrails”)

This is where you prevent:

  • rm -rf
  • credential exfiltration
  • repo-wide rewrites
  • network wandering
  • “helpful” data-leaks into logs

A sandbox is not optional. It’s your seatbelt. And yes, it’s annoying—like seatbelts.

Policy & prompt assembly (the “adult supervision”)

Our hot take: your best prompt is the one your repo can own. Put it in version control, not in someone’s Notion page that hasn’t been opened since Q2.

5) A minimal loop (TypeScript) that actually works

Below is a corrected “Codex-style” loop using the OpenAI JS SDK + the Responses API.

Key fixes vs the earlier draft:

  • Tool-call arguments are streamed in deltas, so we accumulate them and only parse JSON when they’re done.
  • We avoid duplicating history by using stateful-by-ID mode: we send only new items each turn and rely on previous_response_id.

This is still intentionally minimal. In production you’ll add a real sandbox runner, command allowlists, path policies, timeouts, and structured telemetry.

import OpenAI from "openai";
import { z } from "zod";

const client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

// ---------- Tool schema ----------
const RunShellSchema = z.object({
  // In production, don’t accept “any string command”.
  // Prefer structured commands (command + args) and validate hard.
  command: z.string().min(1),
  cwd: z.string().optional(),
});

type ToolResult =
  | { ok: true; stdout: string; stderr?: string; exitCode: number }
  | { ok: false; error: string; exitCode?: number };

// ---------- Sandbox runner (placeholder) ----------
async function runShellSandboxed(args: z.infer<typeof RunShellSchema>): Promise<ToolResult> {
  const { command } = args;

  // ✅ Better than a denylist: allowlist *known safe commands*.
  // This is intentionally tiny; expand per environment.
  const allowed = [
    "npm test",
    "npm run test",
    "npm run lint",
    "pnpm test",
    "pnpm lint",
    "pytest",
    "ruff check .",
    "git diff",
  ];

  const normalized = command.trim();
  if (!allowed.includes(normalized)) {
    return {
      ok: false,
      error: `Command not allowed: "${normalized}". Allowed: ${allowed.join(", ")}`,
      exitCode: 126,
    };
  }

  // 🔥 Replace this with a real sandbox execution (Docker/Firecracker/etc).
  // Return structured output.
  return { ok: true, stdout: `Pretend we ran: ${normalized}\n(ok)`, exitCode: 0 };
}

// ---------- Tool definition (Responses API) ----------
const tools = [
  {
    type: "function" as const,
    name: "run_shell",
    description: "Run an allowlisted command in the project sandbox and return stdout/stderr/exitCode.",
    parameters: {
      type: "object",
      properties: {
        command: { type: "string", description: "Allowlisted command (exact match)." },
        cwd: { type: "string", description: "Optional working directory (restricted in production)." },
      },
      required: ["command"],
      additionalProperties: false,
    },
  },
];

// ---------- Helper types for streaming ----------
type PendingCall = { name: string; argsJson: string };

export async function codexStyleLoop(userTask: string) {
  let previous_response_id: string | undefined;

  // Send initial instructions once, then rely on previous_response_id for statefulness.
  const bootstrapItems: any[] = [
    {
      role: "developer",
      content: [
        {
          type: "input_text",
          text:
            [
              "You are a coding agent operating in a sandbox.",
              "Follow repo instructions if provided.",
              "Be careful: make small changes, run tests when appropriate, and explain decisions briefly.",
              "When you need to run commands, call the run_shell tool with an allowlisted command.",
            ].join("\n"),
        },
      ],
    },
    { role: "user", content: [{ type: "input_text", text: userTask }] },
  ];

  const maxTurns = 12;

  // We keep “new items” per turn. In stateful mode, we don’t resend the entire history.
  let newItems: any[] = bootstrapItems;

  for (let turn = 1; turn <= maxTurns; turn++) {
    const pending: Record<string, PendingCall> = {};
    const toolCalls: Array<{ call_id: string; name: string; arguments: any }> = [];

    let finalText = "";
    let sawCompleted = false;

    const stream = await client.responses.create({
      model: "gpt-5.1-codex-max",
      input: newItems,
      tools,
      stream: true,
      previous_response_id,
    });

    for await (const event of stream) {
      // Stream text output
      if (event.type === "response.output_text.delta") {
        finalText += event.delta;
        process.stdout.write(event.delta);
      }

      // When a function_call output item is created, remember it
      if (event.type === "response.output_item.added" && event.item.type === "function_call") {
        pending[event.item.id] = { name: event.item.name, argsJson: "" };
      }

      // Arguments arrive in deltas
      if (event.type === "response.function_call_arguments.delta") {
        const call = pending[event.item_id];
        if (call) call.argsJson += event.delta;
      }

      // Done = safe point to parse JSON args
      if (event.type === "response.function_call_arguments.done") {
        const call = pending[event.item_id];
        if (call) {
          let args: any;
          try {
            args = JSON.parse(call.argsJson);
          } catch (e) {
            args = { __parse_error: String(e), __raw: call.argsJson };
          }
          toolCalls.push({ call_id: event.item_id, name: call.name, arguments: args });
        }
      }

      if (event.type === "response.completed") {
        previous_response_id = event.response.id;
        sawCompleted = true;
      }
    }

    // If we didn’t get a clean completion, treat it as “incomplete” and decide how to proceed.
    // (In production, this is where you’d apply compaction / retry policies.)
    if (!sawCompleted) {
      return { ok: false, error: "Model response did not complete cleanly (stream ended early)." };
    }

    // Stop if there are no tool calls.
    if (toolCalls.length === 0) {
      return { ok: true, answer: finalText };
    }

    // Execute tool calls and build the next turn’s input
    const nextItems: any[] = [];

    for (const call of toolCalls) {
      if (call.name !== "run_shell") {
        nextItems.push({
          type: "function_call_output",
          call_id: call.call_id,
          output: JSON.stringify({ ok: false, error: `Unknown tool: ${call.name}` }),
        });
        continue;
      }

      const parsed = RunShellSchema.safeParse(call.arguments);
      if (!parsed.success) {
        nextItems.push({
          type: "function_call_output",
          call_id: call.call_id,
          output: JSON.stringify({ ok: false, error: parsed.error.message }),
        });
        continue;
      }

      const result = await runShellSandboxed(parsed.data);
      nextItems.push({
        type: "function_call_output",
        call_id: call.call_id,
        output: JSON.stringify(result),
      });
    }

    // Optional “keep going” nudge (small, safe, avoids runaway)
    nextItems.push({
      role: "user",
      content: [
        {
          type: "input_text",
          text:
            [
              "Continue.",
              "If you changed code, run an allowlisted test/lint command (or explain why not).",
              "If blocked, explain what you need.",
            ].join(" "),
        },
      ],
    });

    newItems = nextItems;
  }

  return { ok: false, error: `Max turns (${maxTurns}) reached.` };
}

What this code demonstrates (for real)

  • Streaming: we build output as deltas arrive
  • Tool calling: tool calls are output items, arguments are streamed and must be accumulated
  • Statefulness without duplication: we rely on previous_response_id and only send new items per turn
  • A safer starting posture: allowlisting, structured tool output, bounded turns

6) Tool design that doesn’t sabotage you

Tool rule 1: Make tools boring

Your model should do the thinking; tools should do the doing.

Good tools:

  • read_file(path)
  • apply_patch(diff)
  • run_tests(target)
  • search_repo(query)

Bad tools:

  • do_the_thing(task: string) (aka “please hallucinate in production”)

Tool rule 2: Prefer diffs over raw edits

Agents are way more reliable when they produce diffs. You can validate:

  • file paths touched
  • max lines changed
  • forbidden patterns
  • formatting checks

Tool rule 3: Return machine-parseable output

Even if the model reads it, your driver needs it too:

  • exit codes
  • file lists
  • test summaries
  • error classification

Tool rule 4: Explicit permissions beat “be careful”

Policy must be executable, not aspirational.

If the agent can run arbitrary shell, it will eventually run:

  • something destructive
  • something leaky
  • something that “works on my machine” and fails in CI

7) Long-running tasks: compaction + caching

When your agent does real work, the context window grows like sourdough starter. Eventually:

  • latency climbs
  • cost climbs
  • the model starts “forgetting” early constraints

Compaction: shrink the window without losing requirements

OpenAI’s conversation state guidance describes compaction as a way to keep long-running threads manageable when inputs grow. Practically:

  • Keep user requirements verbatim
  • Compact tool outputs aggressively (summaries + pointers to artifacts)
  • Compact repeated logs (tests, lint) into “last known status”

Corrected wording: We’re not going to name a specific endpoint here unless you wire it up from the docs—because SDK surfaces and endpoints can evolve. The reliable concept is: compaction exists as a supported pattern, and you should design for it.

Prompt caching: stop paying for the same prefix

Provider-side prompt caching (when available) rewards consistency:

  • keep your stable “policy + repo instructions” prefix consistent
  • append only the changing state (tool results, diffs, progress) at the end

Corrected wording: we’re not claiming a specific “cache bucketing identifier” parameter—just the practical implementation advice: stable prefix good, constantly mutating prefix bad.

8) Observability and evals: making the loop measurable

If you can’t answer these, you don’t have an agent—you have an expensive surprise generator:

  • How many tool calls per task?
  • Where does time go (model vs tools vs retries)?
  • What’s the pass rate on “tests green”?
  • What percent of runs hit max-iterations?
  • Which repos/projects produce the most failures?

Minimum viable telemetry

Log per turn:

  • model used
  • tokens in/out
  • tool calls (name, args hash, duration, exit code)
  • diff stats (#files, #lines)
  • outcome label (success / partial / failed)

Evals that actually matter

For coding agents, use task-level evals:

  • build passes
  • unit tests pass
  • lint passes
  • no forbidden files touched
  • PR description includes reproduction steps

9) Comparisons: Codex-style loop vs popular frameworks

Let’s keep this spicy and fair.

Codex-style “unrolled loop”

Strengths

  • You own the loop: policy, state, sandbox, compaction, caching
  • Fits production constraints cleanly (great for platform teams)
  • Streaming + tool execution is first-class in the Responses model

Trade-off

  • You’re building infrastructure (not just prompting)

LangGraph

LangGraph is designed for graph-based workflows—explicit state, branching, loops.

Great when

  • you want deterministic control flow
  • you want explicit states and transitions
  • you’re orchestrating many steps that shouldn’t be “model decides everything”

Trade-off

  • you still need sandbox/policies; graphs don’t secure shell access

AutoGen

AutoGen is a multi-agent conversation framework: roles, chats, humans-in-the-loop, tools.

Great when

  • you want multi-agent collaboration patterns fast
  • you’re prototyping “agent teams”

Trade-off

  • multi-agent loops can multiply complexity fast (more turns, more tool calls, more coordination)

CrewAI

CrewAI focuses on orchestrating role-playing agents into a cohesive “crew.”

Great when

  • you want a simple delegation mental model
  • your team needs quick onboarding

Trade-off

  • you’ll still want a serious driver layer for safety, state, and constraints

SWE-agent

SWE-agent is “agent-loop honest”: it lives in real repos, uses tools, and is evaluation-driven.

Great when

  • you’re studying coding-agent ergonomics and real-world loop design
  • you care about measurable outcomes (tests, diffs, repo changes)

Trade-off

  • platform teams often want tighter integration with internal systems than a CLI-first workflow

Practical takeaway
Frameworks help you organize the loop. Codex-style unrolling helps you own the loop.

10) Production checklist

Loop safety

  • Max turns + max tool calls
  • Command allowlists (not denylists)
  • Path allowlists (no touching secrets)
  • Network policy (often default-deny)
  • Timeouts + output size caps on tools

Reliability

  • Deterministic tools
  • Diff-based edits
  • Always run tests if code changed (or require justification)

Context control

  • Compaction strategy (what stays verbatim vs summarized)
  • Stable prompt prefix (for efficiency and fewer surprises)

Observability

  • Per-turn logs + traces
  • Outcome labels and failure taxonomy
  • Regression eval suite (tests, lint, policy checks)

Key takeaways

  • The model is not the agent. The loop is the agent.
  • Unroll the loop into explicit driver/state/tool/policy layers, and you gain control.
  • Design tools to be boring and deterministic—the driver validates, the sandbox constrains, the model decides.
  • Plan for long horizons: compaction + stable prefixes or you’ll drown in your own logs.
  • Measure everything: if you can’t chart failure modes, you can’t improve them.

Cohorte Team
January 26,2026.