ReWOO vs. ReAct: Which Agent Pattern Should Power Your AI Stack in 2025?

AI teams love ReWOO for speed, ReAct for control. See which wins in 2025—plus get production-hardened code and benchmark truths.

A field guide for engineering leaders architecture trade-offs, production-ready code, and hard-won tips to ship faster (and cheaper)

We’ve all been there: your agent keeps thrashing tools, latency spikes, the token bill looks like a phone number, and the exec ask is still “Can we ship this quarter?” In this guide we go deep on ReAct (reason→act→observe) and ReWOO (plan→parallel work→solve), compare when each pattern shines, and give you drop-in, production-savvy code you can adapt right now. We also fact-check the common claims (token efficiency, accuracy bumps), link the originals, and add the safety rails you’ll wish you’d had last week.

TL;DR (for busy VPs & staff engineers)

  • ReAct interleaves thinking with tool use. It’s simple, adaptable, and great for interactive tasks. But it can balloon tokens and latency because every tool call pauses the model mid-thought.
  • ReWOO plans first, runs tools in parallel, then synthesizes an answer from evidence you control. It commonly reduces tokens and wall-clock on multi-tool tasks—but planning quality and validation matter. Paper reports ~5× token efficiency and ~+4% HotpotQA accuracy vs baselines (benchmark-specific; measure on your data).
  • Rule of thumb: If your task requires multiple tools or long chains, start ReWOO. If it’s highly interactive or step-wise with human feedback, start ReAct. Validate with evals.

1) Concepts in one minute

ReAct (Reason + Act)

The model alternates: Think → Act (tool) → Observe → Think → …, until it decides to finish. It’s intuitive, easy to implement, and works well when the next step depends tightly on the last observation. Downsides: repeated re-prompting and tool latency in the loop.

ReWOO (Reasoning WithOut Observation)

The model first plans the needed steps (and which tools), then you execute steps in parallel, collect evidence, and a final solver composes the answer using only that evidence. Benefits: fewer LLM turns, parallelism, better controllability/citations. Costs: you must validate plans/evidence or risk “confident nonsense.”

2) Production-ready skeletons (drop-in)

Below we show portable patterns. We avoid provider-specific SDK code by injecting a call_llm() adapter (so you can swap OpenAI/Anthropic/Vertex/etc. without rewriting the agent). We enforce JSON I/O, tool allow-lists, timeouts, and budget caps.

Minimal assumptions: each tool has a signature tool(input: str) -> str.

2.1 ReAct loop (safe & sturdy)

import json
from typing import Dict, Callable

class ToolError(Exception): pass

def react_agent(
    question: str,
    tools: Dict[str, Callable[[str], str]],
    call_llm: Callable[[str], str],
    max_turns: int = 6,
    max_tokens_budget: int = 120_000,
) -> str:
    """
    Portable ReAct agent.
    - Enforces JSON output: {"thought": "...", "action": {"tool": "<name>|finish", "input": "<text>"}}
    - Uses an allow-list of tools.
    - Guards against budget blowups and unknown tools.
    """
    transcript: list[dict] = []
    spent_tokens = 0

    for _ in range(max_turns):
        prompt = (
            "You are a ReAct agent. OUTPUT STRICT JSON ONLY:\n"
            '{"thought":"...", "action":{"tool":"<name>|finish","input":"<text>"},'
            '"final_answer":""}\n'
            "Tools: " + ", ".join(sorted(tools.keys())) + "\n"
            "Rules: Prefer minimal steps. If answerable now, use tool='finish'.\n"
            f"Transcript: {json.dumps(transcript)[:4000]}\n"
            f"Question: {question}\n"
        )
        step = call_llm(prompt)                          # provider-specific under the hood
        spent_tokens += len(step)                        # rough proxy; replace with provider usage
        if spent_tokens > max_tokens_budget:
            return "Budget exceeded; partial transcript suppressed."

        try:
            data = json.loads(step)
            action = data["action"]
        except Exception as e:
            return f"Malformed model output: {e}"

        if action["tool"] == "finish":
            return data.get("final_answer") or action.get("input", "")

        tool_name = action["tool"]
        tool = tools.get(tool_name)
        if not tool:
            raise ToolError(f"Unknown tool: {tool_name}")

        try:
            observation = tool(action["input"])
        except Exception as e:
            observation = f"TOOL_ERROR: {e}"

        transcript.append(
            {"thought": data.get("thought", ""), "tool": tool_name, "input": action["input"], "observation": observation}
        )

    return "No answer after max_turns."

Why this holds up in prod: JSON schema (less regex sadness), explicit allow-list, simple “token budget” fuse, and resilience to tool failures. Matches the ReAct idea from the paper; you supply better prompts/tools as needed.

2.2 ReWOO (plan → parallel work → solve), with validation

from concurrent.futures import ThreadPoolExecutor, as_completed
from typing import Dict, Callable, Any
import json, time

def plan(question: str, call_llm: Callable[[str], str], allowed_tools: set[str]) -> dict:
    schema = (
        "Return JSON ONLY:\n"
        '{"steps":[{"id":"s1","tool":"<ALLOWED_TOOL>","input":"..."}],'
        '"final":"Describe how to combine step IDs, e.g. [s1][s3]."}'
    )
    raw = call_llm(
        "You are a planner. Create the MINIMAL set of steps to answer.\n"
        f"Allowed tools: {sorted(allowed_tools)}\n{schema}\nQuestion: {question}"
    )
    p = json.loads(raw)
    # Basic validation
    assert isinstance(p.get("steps"), list) and p["steps"], "Empty plan"
    for s in p["steps"]:
        assert s["tool"] in allowed_tools, f"Unknown tool: {s['tool']}"
        assert s["id"].startswith("s"), "Each step needs an id like 's1'"
    return p

def run_workers(p: dict, tools: Dict[str, Callable[[str], str]], timeout_s: float = 20.0) -> dict[str, Any]:
    results: dict[str, Any] = {}
    with ThreadPoolExecutor(max_workers=min(8, len(p["steps"]))) as ex:
        futs = {ex.submit(tools[s["tool"]], s["input"]): s["id"] for s in p["steps"]}
        for f in as_completed(futs, timeout=timeout_s):
            sid = futs[f]
            try:
                results[sid] = f.result(timeout=timeout_s)
            except Exception as e:
                results[sid] = f"ERROR: {e}"
    return results

def solve(question: str, p: dict, evidence: dict[str, Any], call_llm: Callable[[str], str]) -> str:
    # Enforce evidence-only answering; require step-id citations.
    ctx = {"question": question, "plan": p, "evidence": evidence}
    answer = call_llm(
        "You are a solver. Use ONLY the evidence below. Cite step IDs like [s1]. "
        "If evidence is missing, say so and stop.\n" + json.dumps(ctx)
    )
    # Optional: verify cited IDs exist.
    cited = {sid for sid in evidence.keys() if f"[{sid}]" in answer}
    if not cited and evidence:
        answer = "No valid citations found; refusing."
    return answer

def rewoo_agent(question: str, tools: Dict[str, Callable[[str], str]], call_llm: Callable[[str], str]) -> str:
    allowed = set(tools.keys())
    p = plan(question, call_llm, allowed_tools=allowed)
    ev = run_workers(p, tools)
    return solve(question, p, ev, call_llm)

Why this works in practice: single LLM turn to plan, parallel tool fan-out, one LLM turn to synthesize with explicit citations. This reflects the ReWOO paper’s decoupling and tends to cut both tokens and wall-clock on multi-step tasks. Your mileage varies with planning quality and tool latency.

3) When to choose which (and why)

When to choose ReAct vs ReWOO (summary)
Scenario Choose Why
Conversational agents; step depends on last observation ReAct Fine-grained adaptivity; humans-in-the-loop flow well. (ReAct paper)
Multi-tool pipelines; long chains; strict provenance ReWOO Parallel tool calls, fewer LLM turns, evidence-bounded answers. (ReWOO paper)
Latency or token budget is tight ReWOO (often) Planning compresses tokens; parallel I/O reduces wall-clock time. (Benchmark-dependent.) (ReWOO paper)
Messy tasks with unknown next step ReAct Interleaving reasoning and acting lets the model adapt using the latest observation. (ReAct blog)

About the famous numbers: The ReWOO paper reports ~5× token efficiency and ~+4% HotpotQA accuracy vs baselines. Treat these as illustrative, benchmark-specific—you must validate on your workloads.

4) Implementation patterns we actually recommend

4.1 Safety rails (both patterns)

  • Tool allow-list & per-tool validators. Never allow arbitrary function names; validate inputs (URLs, SQL, shell).
  • Timeouts, retries, and budgets. Guard the loop with token/time caps and circuit-breakers.
  • Evidence-only composing. In ReWOO, require step-ID citations and verify them; refuse if missing.
  • Prompt-injection hardening. Don’t pass raw webpage/tool output into the model; sanitize and/or extract structured fields first.
  • Observability. Log plan, tool calls (with durations), usage, and final citations for every run (PII-safe).

4.2 LangGraph & “plan-and-execute”

If you prefer graph primitives, LangGraph ships a plan-and-execute tutorial and examples that mirror this guide. The ergonomics are good for state machines, retries, and eventing.

5) Practical use cases (with patterns)

A. Web research + code analysis (3 tools)

  • Tools: web_search(q), fetch(url), python_sandbox(code)
  • Pattern: ReWOO. Plan first, run fetch for the top K URLs in parallel, then summarize with citations like [s2].
  • Why: Parallel I/O dominates; single synth pass is cheaper and more controllable.

B. Interactive data cleaning with a human

  • Tools: df_preview(), apply_transform(expr)
  • Pattern: ReAct. Human feedback changes the next action (“Undo that,” “try regex capture”), so interleaving wins.

C. Ops runbooks (incident management)

  • Tools: pagerduty.search, k8s.logs, k8s.rollout
  • Pattern: Hybrid. Use ReWOO for the fan-out evidence collection step, then hand off to a ReAct loop with a human for remediation.

6) Benchmarks, claims, and reality

  • ReAct is established; great for QA (HotpotQA/FEVER) and decision-making tasks (ALFWorld/WebShop).
  • ReWOO introduced the decoupled plan/work/solve idea and reports efficiency/accuracy gains, plus robustness to tool failure. Treat as promising—but verify end-to-end on your infra and tools.
  • LangGraph “planning agents.” Official guidance suggests plan-and-execute can be faster/cheaper for many tasks; again, data- and tool-dependent.

7) Copy-paste adapters (wire any provider)

These helpers make the above agent code portable:

# Example adapters you can implement once per provider

def call_llm_openai(prompt: str) -> str:
    # from openai import OpenAI
    # client = OpenAI()
    # resp = client.chat.completions.create(
    #     model="gpt-4.1-mini",
    #     messages=[{"role":"user","content":prompt}],
    # )
    # return resp.choices[0].message.content
    raise NotImplementedError

def call_llm_anthropic(prompt: str) -> str:
    # from anthropic import Anthropic
    # client = Anthropic()
    # resp = client.messages.create(model="claude-3-5-sonnet-20241022",
    #     messages=[{"role":"user","content":prompt}],
    #     max_tokens=800)
    # return resp.content[0].text
    raise NotImplementedError

(API method names evolve—keep this adapter layer so agent code stays stable.)

8) Dev checklist

  • Choose pattern (ReAct vs ReWOO) based on tool fan-out and interactivity.
  • Enforce JSON contracts for model output.
  • Create a tool registry (name → function; validators; timeouts).
  • Add token & time budgets; log every step.
  • For ReWOO, require citations and verify them before returning.
  • Ship with evals (golden tasks + ops metrics: p95 latency, cost).
  • Reassess regularly; some workflows migrate from ReAct → ReWOO as they stabilize.

Key takeaways

  • Use ReAct when the next step depends on the last observation (and/or humans are in the loop).
  • Use ReWOO when you can plan, fan-out tools in parallel, and compose from evidence.
  • Demand JSON, allow-lists, timeouts, budgets, and verified citations.
  • Treat benchmark gains as signals, not guarantees. Measure on your stack.

Both ReAct and ReWOO are must-know patterns. ReAct gives you agility; ReWOO gives you scale. Great agent stacks in 2025 will use both, with sensible routing, ruthless observability, and strong guardrails.

— Cohorte Team
October 27, 2025