Traditional observability gives you three things: logs, metrics, traces. The "three pillars" framework works fine for stateless web services. It works partially for LLM apps. The parts where it falls short are the parts that will hurt you in production — and the parts where teams either over-log to compensate or, more often, under-log and discover it the hard way.
This post is the playbook we use. It's been refined across a dozen production AI builds — fraud triage, regulatory drafting, leadership assessments, content generation. Same shape every time. Different things logged, different alarms set.
What's different about LLM observability
Three things separate LLM apps from traditional services from an observability standpoint:
- Non-determinism — the same input can produce a different output. Your trace can't just be the function call graph; it has to capture the actual content of the model's response.
- Quality is a multi-dimensional metric — "did it work" is no longer just status code 200. It's "was the output correct, was the citation accurate, was the tone right, did it refuse appropriately, did it hallucinate."
- The cost-quality-latency triangle — every prompt change shifts all three. Without telemetry on all three you can't make informed trade-offs.
The minimum viable observability stack
Before you reach for LangSmith or Helicone or any vendor, the things that must be in your own logs:
- Request ID + parent trace ID (linked to the broader transaction)
- Model ID with provider snapshot version, not just logical alias
- System prompt hash (content-addressable reference to the actual prompt)
- Full message trace — every user/assistant/system/tool message
- Tool calls and their responses, in order
- Sampling parameters (temperature, top_p, max_tokens)
- Latency to first token, latency to final token
- Token counts: input, output, cached (if applicable)
- Cost in USD, computed at write time
- Outcome — structured result, plus a "did the downstream system accept this" boolean
Notice what's not there: the user's full session history, every retrieved document in full, every internal model thought-chain step. We'll get to those.
What to skip (or log lazily)
The temptation is to log everything because storage is cheap. The cost isn't storage — it's the signal-to-noise ratio when you're debugging. Things we deliberately don't log eagerly:
- Full document contents in RAG calls — we log document IDs and chunk IDs, plus a content hash. The actual text is fetched from the source-of-truth store if a debugger needs it. Otherwise you're 10x'ing your trace volume for no daily benefit.
- Embeddings vectors — never log them inline. Reference the embedding ID. Vectors blow up your logs and aren't useful in raw form anyway.
- Streaming token-by-token output — log the assembled final response, not the stream. Streams matter for UX, not for replay.
- Personally identifiable information that isn't needed for debugging — redact at log-write time. Once it's in your log store it's a compliance liability.
Traces, not logs
The single most important shift is from logs (timestamped strings) to traces (structured spans with parent-child relationships). An LLM app worth observing has at minimum these spans per request:
request
├── input_validation
├── retrieval
│ ├── embedding
│ ├── vector_search
│ └── rerank
├── llm_call (× N if multi-step)
│ ├── prompt_assembly
│ ├── model_inference
│ └── response_parsing
├── tool_calls (× M if any)
│ ├── tool_lookup
│ └── tool_execute
├── post_processing
└── downstream_dispatch
OpenTelemetry handles this well. The instrumentation overhead is small if you do it right, and the payoff is immediate when something goes wrong.
The eval harness is observability
Most teams treat eval as a development-time concern. It isn't. It's the most important observability surface you have, because it's the only one that tells you whether the AI is actually doing its job.
Your eval harness needs to run in three modes:
- Pre-deploy — every prompt or model change runs against a golden dataset; regressions block deploy. Standard CI gate.
- Continuous on production traffic — sample, say, 1% of production calls; run them through automated scorers; track metrics over time. This is your drift early-warning system.
- Human-in-the-loop spot checks — flagged or uncertain calls go to a review queue; humans label; labels feed back into the golden dataset. Closes the loop.
The scorers themselves can be: rule-based (does the output have the required JSON keys?), classifier-based (is the tone polite?), or LLM-as-judge (is this answer factually consistent with the cited source?). Mix all three.
Drift detection: the unglamorous superpower
Models change. Provider snapshots roll forward. Your retrieval corpus grows. User behaviour shifts. Any of these will gradually degrade your output quality, and you won't notice until a customer complains.
Drift detection means continuously asking: is the distribution of outputs today similar to the distribution last week? The signals that matter:
- Output distribution drift — for classification tasks, has the proportion of each class shifted?
- Length / verbosity drift — average response length creeping up usually means a quality regression
- Refusal-rate drift — sudden spike in "I can't help with that" responses usually means a prompt or model change broke something
- Latency drift — model providers tighten their inference paths constantly; you should know within an hour, not a week
- Eval-score drift — your continuous eval pipeline should be the canary, not the parrot
Alarms worth waking someone up for
Page-worthy events for an LLM app, in our experience:
- Eval score on production traffic drops by more than 2 standard deviations from baseline
- Refusal rate increases by more than 3× in a 1-hour window
- p95 latency increases by more than 50% in a 1-hour window
- Cost per request increases by more than 30% over the daily average
- Downstream "did this work" boolean fails on more than 1% of calls (was <0.1%)
Everything else goes in a dashboard, not a pager. Page fatigue kills observability culture faster than any tool can save it.
What we ship in week one
On every new AI build, before any user sees the system, we ship:
- OpenTelemetry instrumentation with the span graph above
- Structured per-call records in a queryable store (BigQuery, ClickHouse, or similar)
- A pre-deploy eval gate in CI
- A continuous eval pipeline running 1% of production traffic against scorers
- A dashboard with the seven metrics that matter: eval score, latency p50/p95, cost per request, refusal rate, error rate, output-length distribution, and tool-call success rate
- Alarms wired to PagerDuty or equivalent, with documented runbooks for each
This isn't optional. It's the bare minimum for running an LLM in production responsibly. Skip it and your "we shipped AI" announcement becomes a "we broke AI in production and didn't notice for two weeks" Slack thread.
The vendor question
Do you need LangSmith, Helicone, Langfuse, Braintrust, or one of the other LLM-observability vendors? Maybe. They're useful. They're not magic.
What they buy you: pre-built ingestion for OpenAI/Anthropic SDKs, decent UI for trace exploration, eval tooling, and dataset management. What they don't buy you: discipline. If you don't know what to log and why, a vendor dashboard will not save you.
Our recommendation: start with first-party OpenTelemetry and your existing data warehouse. Layer on a vendor when you have enough volume that their UI saves your team real hours. For most teams that crossover happens around 100k requests/day.
Don't fly blind in production.
We build observability into every AI engagement from day one. Book a 30-minute call and we'll review your instrumentation gaps.
Book a Discovery Call