Observability for LLM apps: what to log, what to skip

Traditional observability gives you three things: logs, metrics, traces. The "three pillars" framework works fine for stateless web services. It works partially for LLM apps. The parts where it falls short are the parts that will hurt you in production — and the parts where teams either over-log to compensate or, more often, under-log and discover it the hard way.

This post is the playbook we use. It's been refined across a dozen production AI builds — fraud triage, regulatory drafting, leadership assessments, content generation. Same shape every time. Different things logged, different alarms set.

What's different about LLM observability

Three things separate LLM apps from traditional services from an observability standpoint:

Non-determinism — the same input can produce a different output. Your trace can't just be the function call graph; it has to capture the actual content of the model's response.
Quality is a multi-dimensional metric — "did it work" is no longer just status code 200. It's "was the output correct, was the citation accurate, was the tone right, did it refuse appropriately, did it hallucinate."
The cost-quality-latency triangle — every prompt change shifts all three. Without telemetry on all three you can't make informed trade-offs.

The minimum viable observability stack

Before you reach for LangSmith or Helicone or any vendor, the things that must be in your own logs:

Per-call data we always capture

Request ID + parent trace ID (linked to the broader transaction)
Model ID with provider snapshot version, not just logical alias
System prompt hash (content-addressable reference to the actual prompt)
Full message trace — every user/assistant/system/tool message
Tool calls and their responses, in order
Sampling parameters (temperature, top_p, max_tokens)
Latency to first token, latency to final token
Token counts: input, output, cached (if applicable)
Cost in USD, computed at write time
Outcome — structured result, plus a "did the downstream system accept this" boolean

Notice what's not there: the user's full session history, every retrieved document in full, every internal model thought-chain step. We'll get to those.

What to skip (or log lazily)

The temptation is to log everything because storage is cheap. The cost isn't storage — it's the signal-to-noise ratio when you're debugging. Things we deliberately don't log eagerly:

Full document contents in RAG calls — we log document IDs and chunk IDs, plus a content hash. The actual text is fetched from the source-of-truth store if a debugger needs it. Otherwise you're 10x'ing your trace volume for no daily benefit.
Embeddings vectors — never log them inline. Reference the embedding ID. Vectors blow up your logs and aren't useful in raw form anyway.
Streaming token-by-token output — log the assembled final response, not the stream. Streams matter for UX, not for replay.
Personally identifiable information that isn't needed for debugging — redact at log-write time. Once it's in your log store it's a compliance liability.

Traces, not logs

The single most important shift is from logs (timestamped strings) to traces (structured spans with parent-child relationships). An LLM app worth observing has at minimum these spans per request:

request
├── input_validation
├── retrieval
│   ├── embedding
│   ├── vector_search
│   └── rerank
├── llm_call (× N if multi-step)
│   ├── prompt_assembly
│   ├── model_inference
│   └── response_parsing
├── tool_calls (× M if any)
│   ├── tool_lookup
│   └── tool_execute
├── post_processing
└── downstream_dispatch

OpenTelemetry handles this well. The instrumentation overhead is small if you do it right, and the payoff is immediate when something goes wrong.

The eval harness is observability

Most teams treat eval as a development-time concern. It isn't. It's the most important observability surface you have, because it's the only one that tells you whether the AI is actually doing its job.

Your eval harness needs to run in three modes:

Pre-deploy — every prompt or model change runs against a golden dataset; regressions block deploy. Standard CI gate.
Continuous on production traffic — sample, say, 1% of production calls; run them through automated scorers; track metrics over time. This is your drift early-warning system.
Human-in-the-loop spot checks — flagged or uncertain calls go to a review queue; humans label; labels feed back into the golden dataset. Closes the loop.

The scorers themselves can be: rule-based (does the output have the required JSON keys?), classifier-based (is the tone polite?), or LLM-as-judge (is this answer factually consistent with the cited source?). Mix all three.

Drift detection: the unglamorous superpower

Models change. Provider snapshots roll forward. Your retrieval corpus grows. User behaviour shifts. Any of these will gradually degrade your output quality, and you won't notice until a customer complains.

Drift detection means continuously asking: is the distribution of outputs today similar to the distribution last week? The signals that matter:

Output distribution drift — for classification tasks, has the proportion of each class shifted?
Length / verbosity drift — average response length creeping up usually means a quality regression
Refusal-rate drift — sudden spike in "I can't help with that" responses usually means a prompt or model change broke something
Latency drift — model providers tighten their inference paths constantly; you should know within an hour, not a week
Eval-score drift — your continuous eval pipeline should be the canary, not the parrot

Alarms worth waking someone up for

Page-worthy events for an LLM app, in our experience:

Eval score on production traffic drops by more than 2 standard deviations from baseline
Refusal rate increases by more than 3× in a 1-hour window
p95 latency increases by more than 50% in a 1-hour window
Cost per request increases by more than 30% over the daily average
Downstream "did this work" boolean fails on more than 1% of calls (was <0.1%)

Everything else goes in a dashboard, not a pager. Page fatigue kills observability culture faster than any tool can save it.

What we ship in week one

On every new AI build, before any user sees the system, we ship:

OpenTelemetry instrumentation with the span graph above
Structured per-call records in a queryable store (BigQuery, ClickHouse, or similar)
A pre-deploy eval gate in CI
A continuous eval pipeline running 1% of production traffic against scorers
A dashboard with the seven metrics that matter: eval score, latency p50/p95, cost per request, refusal rate, error rate, output-length distribution, and tool-call success rate
Alarms wired to PagerDuty or equivalent, with documented runbooks for each

This isn't optional. It's the bare minimum for running an LLM in production responsibly. Skip it and your "we shipped AI" announcement becomes a "we broke AI in production and didn't notice for two weeks" Slack thread.

The vendor question

Do you need LangSmith, Helicone, Langfuse, Braintrust, or one of the other LLM-observability vendors? Maybe. They're useful. They're not magic.

What they buy you: pre-built ingestion for OpenAI/Anthropic SDKs, decent UI for trace exploration, eval tooling, and dataset management. What they don't buy you: discipline. If you don't know what to log and why, a vendor dashboard will not save you.

Our recommendation: start with first-party OpenTelemetry and your existing data warehouse. Layer on a vendor when you have enough volume that their UI saves your team real hours. For most teams that crossover happens around 100k requests/day.

Shipping LLM features?

Don't fly blind in production.

We build observability into every AI engagement from day one. Book a 30-minute call and we'll review your instrumentation gaps.

Book a Discovery Call

Observability for LLM apps: what to log, what to skip

What's different about LLM observability

The minimum viable observability stack

What to skip (or log lazily)

Traces, not logs

The eval harness is observability

Drift detection: the unglamorous superpower

Alarms worth waking someone up for

What we ship in week one

The vendor question

Don't fly blind in production.

Continue reading

Designing audit trails for AI agents in banking

RAG vs. fine-tuning: a buyer's decision tree

The CTO's AI buy-vs-build framework