DSPy, De-Risked: A Practical Guide to LLM System Programming & Auto-Optimisation.

Build sharper LLM systems with DSPy: tool-using agents, real metrics, and MIPROv2 auto-optimisation—actionable code, faster results, 2025 ready.

Why this post and how to use it

We wrote this as a no-nonsense, developer-first guide to DSPy—the framework for programming LLM systems (not just prompting). You’ll get:

  • A clean mental model for DSPy programs
  • Correct, up-to-date patterns for tools, ReAct, metrics, and optimizers
  • Copy-pasteable examples (RAG, tools, evaluation, auto-optimization)
  • Real-world tips for speed, cost, and safety
  • Where DSPy fits vs. LangChain, LangGraph, and LlamaIndex

Everything here was checked against current DSPy docs and examples. We cite the exact pages beneath each major section so you can jump into the source.

TL;DR (Key Takeaways)

  • DSPy = system programming for LLMs. You write programs (modules with typed signatures), then evaluate and optimize them—no hand-tuned prompt soup.
  • Tools/ReAct require plain functions with type hints. Pass list[Callable] to dspy.ReAct(signature, tools=[...], max_iters=...). Don’t pass arbitrary objects.
  • Evaluate early with real metrics (Evaluate, answer_exact_match, SemanticF1), then auto-optimize with MIPROv2. Save and reload your compiled program.
  • DSPy doesn’t ship a “built-in BM25 retriever.” Use any retriever you like by wrapping it as a tool or composing with your own code/RAG.
  • It plays well with others (you can compose with LangChain/LlamaIndex or orchestrate with LangGraph), but they’re separate abstractions—don’t conflate APIs.

1) The DSPy mental model (10 seconds)

  • Signatures describe inputs/outputs of a task.
  • Modules implement behaviors (e.g., Predict, ChainOfThought, ReAct).
  • Programs are modules wired together.
  • Evaluate runs your program on a dev set with a metric.
  • Optimizers (e.g., MIPROv2) mutate instructions/few-shots to improve the score automatically.

2) Quick start you won’t have to rewrite later

Secure key handling first—avoid hard-coding secrets.

pip install -U dspy
export OPENAI_API_KEY=...  # or your provider key
# dspystart.py
import dspy

# 1) Configure the LM (respect env var for keys)
lm = dspy.LM("openai/gpt-4o-mini")
dspy.configure(lm=lm)  # can also pass tracing config here later

This is the current pattern in DSPy quickstarts/tutorials.

3) Hello, program: signatures, modules, and a baseline.

import dspy
from dspy import Signature, Predict

class AnswerQuestion(Signature):
    """Short factual QA."""
    question: str
    answer: str

qa = Predict(AnswerQuestion)

print(qa(question="Who founded Stanford?").answer)
  • Signature declares the I/O; Predict builds a minimal program for the signature.
  • Replace Predict with ChainOfThought if you want a deliberate reasoning step.

Evaluate it so you have a baseline:

import dspy
from dspy.evaluate import answer_exact_match

devset = [
    dict(question="Capital of France?", answer="Paris"),
    dict(question="2+2?", answer="4"),
]

evaluate = dspy.Evaluate(devset=devset, metric=answer_exact_match)
score = evaluate(qa)
print("Baseline score:", score)

Evaluate + off-the-shelf metrics (exact match, semantic F1) are first-class.

4) Turn it into a tool-using agent (ReAct) the correct way

Rules that save hours:

  • Tools = plain functions with type hints.
  • Pass them as tools=[func1, func2].
  • Cap loops with max_iters to avoid runaway tool calls.
from typing import List
import dspy
from dspy import Signature

# 1) Define your task signature
class SearchAndAnswer(Signature):
    query: str
    answer: str

# 2) Define tools as plain functions with type hints
def search_web(query: str, k: int = 5) -> str:
    """Return concatenated snippets for top-k search results."""
    # call your search service here
    return "snippet 1 ... snippet 2 ..."

def summarize(text: str, max_words: int = 80) -> str:
    return text.split(".")[0][:max_words]

# 3) Build a ReAct agent with tools
agent = dspy.ReAct(SearchAndAnswer, tools=[search_web, summarize], max_iters=6)

print(agent(query="What is DSPy and why use it?").answer)

This aligns with the official ReAct module signature: dspy.ReAct(signature, tools: list[Callable], max_iters=10).

Note: DSPy doesn’t ship a built-in BM25 retriever. If you want BM25, wrap your retriever as a tool (like search_web) or build a small RAG stage and expose it via a function.

5) RAG, briefly—DSPy style

You can write a tiny RAG pipeline—split/index with your favorite library, fetch top-k passages, and expose a retrieve(query) tool for ReAct. The official tutorials show RAG patterns that are easy to adapt.

def retrieve(query: str, k: int = 6) -> str:
    """Return top-k passages from your store."""
    # e.g., call into FAISS/Chroma/Elasticsearch/etc.
    return "passage A\n\npassage B\n\n..."

agent = dspy.ReAct(SearchAndAnswer, tools=[retrieve, summarize], max_iters=6)

6) Make it measurable: metrics that matter

You can swap answer_exact_match for semantic metrics (SemanticF1) when exact strings don’t tell the full story. Write custom metrics as standard Python callables—DSPy encourages it.

from dspy.evaluate import SemanticF1

evaluate = dspy.Evaluate(devset=devset, metric=SemanticF1())
print("SemF1 score:", evaluate(agent))

7) Auto-optimize with MIPROv2 (instructions + few-shots)

Let DSPy mutate instructions/few-shots to improve your metric—objectively.

from dspy.teleprompt import MIPROv2
from dspy.evaluate import answer_exact_match

# our base "student" program
student = Predict(AnswerQuestion)

# the optimizer
teleprompter = MIPROv2(
    metric=answer_exact_match,   # or your custom callable
    auto="medium"                # search effort
)

optimized = teleprompter.compile(student=student, trainset=devset)

# Persist and reload
optimized.save("optimized.json")
better = dspy.load("optimized.json")
  • What gets tuned? Both instructions and few-shots (if defined) using Bayesian search.
  • Why it helps: you stop guessing, let the metric and optimizer do the heavy lifting.

8) Production notes (the opinions we earned the hard way)

  • Secrets: Use env vars/secret stores; never inline keys in notebooks or examples.
  • ReAct safety: Always set max_iters; add guardrails to tools (timeouts, input sanitization).
  • Tracing & debuggability: DSPy integrates with MLflow; great for understanding failures and regression tracking. Sanitize sensitive inputs when tracing.
  • Model portability: Favor provider-qualified names ("openai/gpt-4o-mini") and keep a cheap local model for tests.
  • RAG reality: The framework won’t pick a retriever for you. Treat retrieval like a first-class dependency with its own evals.

9) Comparisons (where DSPy fits)

  • DSPy vs. LangChain: LangChain is a “lego box” of integrations/chains. DSPy is system-programming + auto-optimization: signatures, modules, metrics, optimizers. Compose them if you like (e.g., wrap LangChain tools as functions for ReAct).
  • DSPy vs. LangGraph: LangGraph gives you explicit state machines/graphs for agents. DSPy focuses on programmatic modules and learning-driven optimization; you can orchestrate DSPy modules inside a LangGraph if your app needs complex control flow. They’re complementary, not interchangeable.
  • DSPy vs. LlamaIndex: LlamaIndex is strong for retrieval plumbing. DSPy is strongest in programming + eval + optimization. Use LlamaIndex to feed DSPy programs, or expose its retriever as a tool.

(We’re careful here: these are different abstractions; “seamless” ≠ “same.”)

10) Common mistakes & how to avoid them

  1. Passing objects as tools
    • Fix: Use plain functions with type hints; pass them directly to ReAct.
  2. No metric, no dev set
    • You’ll fly blind and overfit anecdotes. Use Evaluate with a simple metric today; refine later.
  3. Believing there’s a built-in retriever
    • There isn’t. Wrap your own retriever or RAG function and test it explicitly.
  4. Inline keys + tracing
    • Don’t log secrets by accident if you enable tracing. Redact before you trace.

Appendix: Full minimal example (baseline → eval → optimize)

import dspy
from dspy import Signature, Predict
from dspy.evaluate import answer_exact_match
from dspy.teleprompt import MIPROv2

# 1) LM config
lm = dspy.LM("openai/gpt-4o-mini")
dspy.configure(lm=lm)

# 2) Signature and baseline program
class AnswerQuestion(Signature):
    question: str
    answer: str

program = Predict(AnswerQuestion)

# 3) Dev set + Evaluate
devset = [
    dict(question="Capital of France?", answer="Paris"),
    dict(question="Who wrote Hamlet?", answer="Shakespeare"),
]

evaluate = dspy.Evaluate(devset=devset, metric=answer_exact_match)
print("Baseline:", evaluate(program))

# 4) Auto-optimize with MIPROv2
miprov2 = MIPROv2(metric=answer_exact_match, auto="medium")
optimized = miprov2.compile(student=program, trainset=devset)

# 5) Save and load
optimized.save("qa_optimized.json")
reloaded = dspy.load("qa_optimized.json")
print("After optimization:", evaluate(reloaded))

All APIs above reflect current docs for ReAct, Evaluate, metrics, and MIPROv2, and the save/load behavior shown in the optimizer/evaluation pages.

Final checklist so you ship confidently

  • Keys via env vars, not inline
  • Tools are plain functions with type hints
  • ReAct(..., tools=[...], max_iters=N) used
  • Dev set + Evaluate + real metric wired
  • MIPROv2 run with your chosen metric
  • Program persisted with save(...) and dspy.load(...)
  • Retrieval treated as a first-class dependency with its own evals

— Cohorte Team
November 10, 2025.