Voice Agents in Production: The LangSmith Debugging Playbook (Turns, Traces, Audio).

Trace voice agents end-to-end with LangSmith + Pipecat + OTEL. Debug turns, STT/LLM/TTS latency, tool errors, and attach audio safely in production.

Trace every STT → LLM → TTS hop (with turn boundaries and optional audio), replay real sessions, and turn “why did it say that?” from a Slack mystery into a link you can inspect.

Voice agents are the most unfair type of AI system.

A chatbot fails quietly in text.
A voice agent fails out loud, in real time, while the user says:

User: “Hello?”
Agent: “Absolutely—here’s a detailed explanation of quantum tunneling.”
User: “…I asked if you ship to Athens.”

So if we’re building voice in production, we need observability that understands voice-native workflows: turns, conversations, audio artifacts, and the full chain from mic → STT → LLM → TTS.

This guide is a practical, corrected, copy/paste-safe walkthrough of debugging voice agents with LangSmith using Pipecat + OpenTelemetry (OTEL) tracing—based on LangSmith’s official Pipecat tracing doc.

Why voice agents are uniquely hard to debug
The debugging stack we recommend
Correct setup: OTEL → LangSmith (with “EU endpoint” notes)
Pipecat tracing quickstart (with version-safe guidance)
Reading a voice trace like a detective
The 8 failure modes (and how traces prove them)
Audio-aware debugging (attach recordings safely)
Shipping-grade tips: sampling, performance, privacy
Comparisons: LangSmith vs Langfuse vs Phoenix vs Helicone vs generic APM
Key takeaways

1 Why voice agents are uniquely hard to debug

Voice agents aren’t “an LLM call.” They’re a pipeline:

VAD decides when speech starts/stops
STT transcribes (often partial + final)
LLM reasons over conversation state (streaming, tool calls, etc.)
TTS synthesizes speech (streaming, interruption/cancel)
Your transport plays audio back (buffering and timing can lie to you)

When something goes wrong, it’s rarely “the model hallucinated.” It’s usually:

VAD clipped the last word
STT misheard the entity (“refund” → “refill”)
Tool failed and the LLM improvised
Turn boundaries got corrupted and you answered the previous question
TTS wasn’t canceled and the agent spoke over the user

So we trace voice like a pipeline, not like a single request.

2 The debugging stack we recommend

The stack:

OpenTelemetry traces (standard, vendor-agnostic telemetry)
Pipecat pipeline spans (turns + STT/LLM/TTS steps)
LangSmith for LLM/agent-native trace visualization (messages, turns, artifacts, evaluation)

LangSmith’s “Trace Pipecat applications” guide uses OTEL plus a custom span processor that maps Pipecat spans into LangSmith’s trace format (including conversation/turn structure and optional audio attachment).

3 Correct setup: OpenTelemetry → LangSmith

Install LangSmith OTEL support (important)

LangSmith’s OpenTelemetry tracing docs explicitly reference installing langsmith[otel] (and using recent versions).

pip install "langsmith[otel]" opentelemetry-exporter-otlp python-dotenv

Environment variables (US + EU endpoints)

LangSmith OTEL ingestion endpoint is .../otel, and EU orgs use a different endpoint.

# --- LangSmith OTEL Ingestion (US) ---
OTEL_EXPORTER_OTLP_ENDPOINT=https://api.smith.langchain.com/otel

# --- OR if your LangSmith org is EU-hosted ---
# OTEL_EXPORTER_OTLP_ENDPOINT=https://eu.api.smith.langchain.com/otel

# Headers: x-api-key is the canonical key for OTEL ingestion.
# You can also include project routing (exact format may vary by SDK/tooling).
OTEL_EXPORTER_OTLP_HEADERS=x-api-key=<YOUR_LANGSMITH_API_KEY>,LANGSMITH_PROJECT=pipecat-voice

# Your model keys, as needed:
OPENAI_API_KEY=<YOUR_OPENAI_API_KEY>

Two important guardrails:

For OTEL ingestion, the header key is x-api-key (not “LANGSMITH_API_KEY” inside OTLP headers).
If you’re EU-hosted, use the EU endpoint—otherwise you’ll debug “missing spans” for hours.

4 Pipecat tracing quickstart with version-safe guidance

Pipecat is evolving fast. Import paths can drift between releases. So we’ll do this in a way that doesn’t break your weekend:

We’ll show the core wiring (stable): pipeline ordering + task flags + OTEL exporter + span processor.
We’ll avoid pretending there’s exactly one correct import path for every Pipecat version.
Where Pipecat modules differ in your release, consult Pipecat’s own docs / your installed package structure.

Install Pipecat + required extras

LangSmith’s Pipecat tracing guide references installing Pipecat plus extras depending on the services you use.

pip install langsmith pipecat-ai opentelemetry-exporter-otlp python-dotenv

If you’re using audio recording features, you may also need additional packages (e.g., numpy, scipy). The LangSmith Pipecat guide mentions extra dependencies for recordings.

Add LangSmith’s Pipecat span processor

LangSmith’s guide uses a custom span processor (often provided as langsmith_processor.py) that:

converts Pipecat spans into LangSmith-meaningful traces,
supports conversation/turn tracking,
and can register audio recordings for attachment.

Recommendation: vendor that file into your repo and treat it like production code (version it, test it, review diffs when LangSmith updates it).

Minimal “runs-and-traces” skeleton

Below is a structure-first skeleton (the important pieces are correct). You’ll plug in the specific Pipecat service classes for your STT/LLM/TTS and your chosen transport.

import asyncio
import uuid
from dotenv import load_dotenv

load_dotenv()

# ✅ Import the LangSmith span processor from the official guide implementation
# (Typically a local file: langsmith_processor.py)
from langsmith_processor import span_processor  # noqa: F401

# NOTE: Pipecat imports vary by version.
# Use the correct imports for your installed release:
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.task import PipelineParams, PipelineTask

# Also import:
# - your Transport (mic/speaker or WebRTC)
# - your STT service
# - your LLM service
# - your TTS service
# - (optional) audio recorder


async def main():
    conversation_id = str(uuid.uuid4())

    # 1) Create your transport + services (version-specific)
    transport = make_transport_somehow()
    stt = make_stt_service()
    llm = make_llm_service()
    tts = make_tts_service()

    # 2) Build pipeline in the “voice order”
    pipeline = Pipeline([
        transport.input(),
        stt,
        llm,          # or your context aggregator → llm chain
        tts,
        transport.output(),
    ])

    # 3) Create task with tracing + turn tracking
    task = PipelineTask(
        pipeline,
        params=PipelineParams(enable_metrics=True),
        enable_tracing=True,
        enable_turn_tracking=True,
        conversation_id=conversation_id,
    )

    # 4) Run
    runner = PipelineRunner()
    await runner.run(task)


if __name__ == "__main__":
    asyncio.run(main())

Why we wrote it this way: the wiring (tracing flags, conversation_id, pipeline ordering) is stable and matches LangSmith’s guide; the service/transport imports are the piece most likely to drift across Pipecat versions.

5 Reading a voice trace like a detective

When a trace shows up in LangSmith, don’t start with the final assistant message. That’s how we get emotionally manipulated by confident audio.

Read it like this:

Conversation span
- Is this the right session? Check conversation_id.
Turn spans
- Did the system split turns correctly? Are there overlaps?
STT spans
- What did we transcribe (partial vs final)? Any missing words?
LLM spans
- What context did the LLM actually see (system prompt, previous turns, tool outputs)?
TTS spans
- Was synthesis delayed? Was it interrupted/canceled correctly?

Once we read traces like a pipeline, debugging becomes… boring. And boring is good.

6 The 8 failure modes and how traces prove them

1 “It ignored what I said”

Usually: VAD clipped speech or STT produced a partial transcript that never got corrected.

Trace proof: compare STT output vs expected utterance.

Fix: tune VAD; gate on final transcript; merge partials safely.

2 “It answered the previous question”

Usually: turn tracking split incorrectly or aggregation appended the wrong message.

Trace proof: turn spans + messages shown to LLM.

Fix: keep enable_turn_tracking=True and validate aggregation behavior.

3 “It hallucinated a tool result”

Usually: tool failed or timed out and LLM improvised.

Trace proof: missing tool output in LLM inputs.

Fix: write tool failures into context explicitly (“ToolError: …”), don’t swallow.

4 “Latency spikes randomly”

Usually: STT chunking, network jitter, cold starts, or TTS bottlenecks.

Trace proof: per-span timing (STT vs LLM vs TTS).

Fix: caching, prewarm, reduce tool calls in the critical path.

5 “It talks over me”

Usually: TTS cancellation isn’t wired to user speech/VAD events.

Trace proof: overlapping TTS spans and new user turn spans.

Fix: interruption policy: cancel TTS on user speech + mark interruption event.

6 “Correct in text, wrong in voice”

Usually: STT mishears domain terms.

Fix: vocabulary biasing, post-STT correction, confirm key entities.

7 “Works locally, not in prod”

Usually: wrong OTEL endpoint/headers, missing exporter, or EU/US mismatch.

Trace proof: no spans arriving; missing exporter config.

Fix: verify endpoint and x-api-key header; set EU endpoint if needed.

8 “We can’t reproduce it”

Usually: no replay (audio/transcript), missing correlation IDs.

Fix: attach audio (with privacy controls), log conversation_id everywhere.

7 Audio-aware debugging

LangSmith’s Pipecat guide includes patterns for:

recording the full conversation or per-turn audio,
registering recordings with the span processor,
and saving recordings before the conversation span completes.

Here’s the conceptual pattern you should implement (names may vary by your Pipecat version, but the steps matter):

from pathlib import Path

recordings_dir = Path("./recordings")
recordings_dir.mkdir(parents=True, exist_ok=True)

recording_path = recordings_dir / f"{conversation_id}.wav"
audio_recorder = AudioRecorder(str(recording_path))  # Pipecat recorder class (version-specific)

# ✅ Register the recording so LangSmith can attach it to the trace
span_processor.register_recording(conversation_id, str(recording_path))

pipeline = Pipeline([
    transport.input(),
    stt,
    llm,
    tts,
    audio_recorder,      # ensure recorder is in the pipeline
    transport.output(),
])

await runner.run(task)

# ✅ IMPORTANT: Save BEFORE the conversation span fully closes (guide warns about timing)
audio_recorder.save_recording()

Privacy note: attach audio only when you truly need it—voice logs are extremely sensitive.

8 Shipping-grade tips: sampling, performance, privacy

Sampling: trace smarter, not louder

100% tracing in dev/staging
In prod, sample:
- errors
- latency outliers
- rollout cohorts (new STT model, new TTS voice)
- opt-in debug sessions

Performance: don’t let observability become the bottleneck

keep span enrichment light
don’t attach huge artifacts by default
prefer per-turn audio snippets vs full-call recordings when feasible

Security & privacy

transcripts can contain PII (names, addresses, card details)
audio is even worse (biometrics + content)
do not store audio without consent + retention policy
redact sensitive intents (billing, auth, medical)

9 Comparisons

LangSmith

Best if you want:

LLM/agent-native traces
turn-aware visibility
OTEL ingestion and Pipecat tracing recipe

Langfuse (open source)

Strong choice for open-source LLM observability; supports OpenTelemetry integration.

Phoenix (Arize, open source)

Open-source tracing/evaluation workflows; supports LLM trace patterns and OTEL-based approaches.

Helicone

Great for gateway/proxy-style LLM observability and integrations (including OpenLLMetry paths), but don’t assume it’s identical to a generic OTLP backend.

Generic APM (Datadog, etc.)

Excellent infrastructure visibility and OTEL pipelines—often missing “turns/messages/evals” semantics unless you build them.

10 Key takeaways

Voice debugging is pipeline debugging. Trace STT/LLM/TTS as one story.
Turn tracking is non-negotiable. If turns are wrong, everything downstream lies.
Audio artifacts turn arguments into diagnoses. Attach recordings carefully and intentionally.
OTEL setup must be exact. Use x-api-key, the correct /otel endpoint, and EU endpoint when relevant.
Be honest about version drift. Pin Pipecat versions or avoid brittle imports in copy/paste snippets.

— Cohorte Team
February 16, 2026