Voice Agents in Production: The LangSmith Debugging Playbook (Turns, Traces, Audio).

Trace every STT → LLM → TTS hop (with turn boundaries and optional audio), replay real sessions, and turn “why did it say that?” from a Slack mystery into a link you can inspect.
Voice agents are the most unfair type of AI system.
A chatbot fails quietly in text.
A voice agent fails out loud, in real time, while the user says:
User: “Hello?”
Agent: “Absolutely—here’s a detailed explanation of quantum tunneling.”
User: “…I asked if you ship to Athens.”
So if we’re building voice in production, we need observability that understands voice-native workflows: turns, conversations, audio artifacts, and the full chain from mic → STT → LLM → TTS.
This guide is a practical, corrected, copy/paste-safe walkthrough of debugging voice agents with LangSmith using Pipecat + OpenTelemetry (OTEL) tracing—based on LangSmith’s official Pipecat tracing doc.
Table of Contents
- Why voice agents are uniquely hard to debug
- The debugging stack we recommend
- Correct setup: OTEL → LangSmith (with “EU endpoint” notes)
- Pipecat tracing quickstart (with version-safe guidance)
- Reading a voice trace like a detective
- The 8 failure modes (and how traces prove them)
- Audio-aware debugging (attach recordings safely)
- Shipping-grade tips: sampling, performance, privacy
- Comparisons: LangSmith vs Langfuse vs Phoenix vs Helicone vs generic APM
- Key takeaways
1 Why voice agents are uniquely hard to debug
Voice agents aren’t “an LLM call.” They’re a pipeline:
- VAD decides when speech starts/stops
- STT transcribes (often partial + final)
- LLM reasons over conversation state (streaming, tool calls, etc.)
- TTS synthesizes speech (streaming, interruption/cancel)
- Your transport plays audio back (buffering and timing can lie to you)
When something goes wrong, it’s rarely “the model hallucinated.” It’s usually:
- VAD clipped the last word
- STT misheard the entity (“refund” → “refill”)
- Tool failed and the LLM improvised
- Turn boundaries got corrupted and you answered the previous question
- TTS wasn’t canceled and the agent spoke over the user
So we trace voice like a pipeline, not like a single request.
2 The debugging stack we recommend
The stack:
- OpenTelemetry traces (standard, vendor-agnostic telemetry)
- Pipecat pipeline spans (turns + STT/LLM/TTS steps)
- LangSmith for LLM/agent-native trace visualization (messages, turns, artifacts, evaluation)
LangSmith’s “Trace Pipecat applications” guide uses OTEL plus a custom span processor that maps Pipecat spans into LangSmith’s trace format (including conversation/turn structure and optional audio attachment).
3 Correct setup: OpenTelemetry → LangSmith
Install LangSmith OTEL support (important)
LangSmith’s OpenTelemetry tracing docs explicitly reference installing langsmith[otel] (and using recent versions).
pip install "langsmith[otel]" opentelemetry-exporter-otlp python-dotenv
Environment variables (US + EU endpoints)
LangSmith OTEL ingestion endpoint is .../otel, and EU orgs use a different endpoint.
# --- LangSmith OTEL Ingestion (US) ---
OTEL_EXPORTER_OTLP_ENDPOINT=https://api.smith.langchain.com/otel
# --- OR if your LangSmith org is EU-hosted ---
# OTEL_EXPORTER_OTLP_ENDPOINT=https://eu.api.smith.langchain.com/otel
# Headers: x-api-key is the canonical key for OTEL ingestion.
# You can also include project routing (exact format may vary by SDK/tooling).
OTEL_EXPORTER_OTLP_HEADERS=x-api-key=<YOUR_LANGSMITH_API_KEY>,LANGSMITH_PROJECT=pipecat-voice
# Your model keys, as needed:
OPENAI_API_KEY=<YOUR_OPENAI_API_KEY>
Two important guardrails:
- For OTEL ingestion, the header key is
x-api-key(not “LANGSMITH_API_KEY” inside OTLP headers). - If you’re EU-hosted, use the EU endpoint—otherwise you’ll debug “missing spans” for hours.
4 Pipecat tracing quickstart with version-safe guidance
Pipecat is evolving fast. Import paths can drift between releases. So we’ll do this in a way that doesn’t break your weekend:
- We’ll show the core wiring (stable): pipeline ordering + task flags + OTEL exporter + span processor.
- We’ll avoid pretending there’s exactly one correct import path for every Pipecat version.
- Where Pipecat modules differ in your release, consult Pipecat’s own docs / your installed package structure.
Install Pipecat + required extras
LangSmith’s Pipecat tracing guide references installing Pipecat plus extras depending on the services you use.
pip install langsmith pipecat-ai opentelemetry-exporter-otlp python-dotenv
If you’re using audio recording features, you may also need additional packages (e.g., numpy, scipy). The LangSmith Pipecat guide mentions extra dependencies for recordings.
Add LangSmith’s Pipecat span processor
LangSmith’s guide uses a custom span processor (often provided as langsmith_processor.py) that:
- converts Pipecat spans into LangSmith-meaningful traces,
- supports conversation/turn tracking,
- and can register audio recordings for attachment.
Recommendation: vendor that file into your repo and treat it like production code (version it, test it, review diffs when LangSmith updates it).
Minimal “runs-and-traces” skeleton
Below is a structure-first skeleton (the important pieces are correct). You’ll plug in the specific Pipecat service classes for your STT/LLM/TTS and your chosen transport.
import asyncio
import uuid
from dotenv import load_dotenv
load_dotenv()
# ✅ Import the LangSmith span processor from the official guide implementation
# (Typically a local file: langsmith_processor.py)
from langsmith_processor import span_processor # noqa: F401
# NOTE: Pipecat imports vary by version.
# Use the correct imports for your installed release:
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.task import PipelineParams, PipelineTask
# Also import:
# - your Transport (mic/speaker or WebRTC)
# - your STT service
# - your LLM service
# - your TTS service
# - (optional) audio recorder
async def main():
conversation_id = str(uuid.uuid4())
# 1) Create your transport + services (version-specific)
transport = make_transport_somehow()
stt = make_stt_service()
llm = make_llm_service()
tts = make_tts_service()
# 2) Build pipeline in the “voice order”
pipeline = Pipeline([
transport.input(),
stt,
llm, # or your context aggregator → llm chain
tts,
transport.output(),
])
# 3) Create task with tracing + turn tracking
task = PipelineTask(
pipeline,
params=PipelineParams(enable_metrics=True),
enable_tracing=True,
enable_turn_tracking=True,
conversation_id=conversation_id,
)
# 4) Run
runner = PipelineRunner()
await runner.run(task)
if __name__ == "__main__":
asyncio.run(main())
Why we wrote it this way: the wiring (tracing flags, conversation_id, pipeline ordering) is stable and matches LangSmith’s guide; the service/transport imports are the piece most likely to drift across Pipecat versions.
5 Reading a voice trace like a detective
When a trace shows up in LangSmith, don’t start with the final assistant message. That’s how we get emotionally manipulated by confident audio.
Read it like this:
- Conversation span
- Is this the right session? Check conversation_id.
- Turn spans
- Did the system split turns correctly? Are there overlaps?
- STT spans
- What did we transcribe (partial vs final)? Any missing words?
- LLM spans
- What context did the LLM actually see (system prompt, previous turns, tool outputs)?
- TTS spans
- Was synthesis delayed? Was it interrupted/canceled correctly?
Once we read traces like a pipeline, debugging becomes… boring. And boring is good.
6 The 8 failure modes and how traces prove them
1 “It ignored what I said”
Usually: VAD clipped speech or STT produced a partial transcript that never got corrected.
Trace proof: compare STT output vs expected utterance.
Fix: tune VAD; gate on final transcript; merge partials safely.
2 “It answered the previous question”
Usually: turn tracking split incorrectly or aggregation appended the wrong message.
Trace proof: turn spans + messages shown to LLM.
Fix: keep enable_turn_tracking=True and validate aggregation behavior.
3 “It hallucinated a tool result”
Usually: tool failed or timed out and LLM improvised.
Trace proof: missing tool output in LLM inputs.
Fix: write tool failures into context explicitly (“ToolError: …”), don’t swallow.
4 “Latency spikes randomly”
Usually: STT chunking, network jitter, cold starts, or TTS bottlenecks.
Trace proof: per-span timing (STT vs LLM vs TTS).
Fix: caching, prewarm, reduce tool calls in the critical path.
5 “It talks over me”
Usually: TTS cancellation isn’t wired to user speech/VAD events.
Trace proof: overlapping TTS spans and new user turn spans.
Fix: interruption policy: cancel TTS on user speech + mark interruption event.
6 “Correct in text, wrong in voice”
Usually: STT mishears domain terms.
Fix: vocabulary biasing, post-STT correction, confirm key entities.
7 “Works locally, not in prod”
Usually: wrong OTEL endpoint/headers, missing exporter, or EU/US mismatch.
Trace proof: no spans arriving; missing exporter config.
Fix: verify endpoint and x-api-key header; set EU endpoint if needed.
8 “We can’t reproduce it”
Usually: no replay (audio/transcript), missing correlation IDs.
Fix: attach audio (with privacy controls), log conversation_id everywhere.
7 Audio-aware debugging
LangSmith’s Pipecat guide includes patterns for:
- recording the full conversation or per-turn audio,
- registering recordings with the span processor,
- and saving recordings before the conversation span completes.
Here’s the conceptual pattern you should implement (names may vary by your Pipecat version, but the steps matter):
from pathlib import Path
recordings_dir = Path("./recordings")
recordings_dir.mkdir(parents=True, exist_ok=True)
recording_path = recordings_dir / f"{conversation_id}.wav"
audio_recorder = AudioRecorder(str(recording_path)) # Pipecat recorder class (version-specific)
# ✅ Register the recording so LangSmith can attach it to the trace
span_processor.register_recording(conversation_id, str(recording_path))
pipeline = Pipeline([
transport.input(),
stt,
llm,
tts,
audio_recorder, # ensure recorder is in the pipeline
transport.output(),
])
await runner.run(task)
# ✅ IMPORTANT: Save BEFORE the conversation span fully closes (guide warns about timing)
audio_recorder.save_recording()
Privacy note: attach audio only when you truly need it—voice logs are extremely sensitive.
8 Shipping-grade tips: sampling, performance, privacy
Sampling: trace smarter, not louder
- 100% tracing in dev/staging
- In prod, sample:
- errors
- latency outliers
- rollout cohorts (new STT model, new TTS voice)
- opt-in debug sessions
Performance: don’t let observability become the bottleneck
- keep span enrichment light
- don’t attach huge artifacts by default
- prefer per-turn audio snippets vs full-call recordings when feasible
Security & privacy
- transcripts can contain PII (names, addresses, card details)
- audio is even worse (biometrics + content)
- do not store audio without consent + retention policy
- redact sensitive intents (billing, auth, medical)
9 Comparisons
LangSmith
Best if you want:
- LLM/agent-native traces
- turn-aware visibility
- OTEL ingestion and Pipecat tracing recipe
Langfuse (open source)
Strong choice for open-source LLM observability; supports OpenTelemetry integration.
Phoenix (Arize, open source)
Open-source tracing/evaluation workflows; supports LLM trace patterns and OTEL-based approaches.
Helicone
Great for gateway/proxy-style LLM observability and integrations (including OpenLLMetry paths), but don’t assume it’s identical to a generic OTLP backend.
Generic APM (Datadog, etc.)
Excellent infrastructure visibility and OTEL pipelines—often missing “turns/messages/evals” semantics unless you build them.
10 Key takeaways
- Voice debugging is pipeline debugging. Trace STT/LLM/TTS as one story.
- Turn tracking is non-negotiable. If turns are wrong, everything downstream lies.
- Audio artifacts turn arguments into diagnoses. Attach recordings carefully and intentionally.
- OTEL setup must be exact. Use
x-api-key, the correct/otelendpoint, and EU endpoint when relevant. - Be honest about version drift. Pin Pipecat versions or avoid brittle imports in copy/paste snippets.
— Cohorte Team
February 16, 2026