All Insights
EngineeringObservability·14 min read·February 2026

Observability for LLM apps: what to log, what to skip

A pragmatic guide to traces, evals and drift detection for LLM-powered features in production. The instrumentation we ship before any AI feature goes live — and the over-engineered logging we deliberately avoid.

OS
OlloSoft Engineering
Published February 19, 2026

Traditional observability gives you three things: logs, metrics, traces. The "three pillars" framework works fine for stateless web services. It works partially for LLM apps. The parts where it falls short are the parts that will hurt you in production — and the parts where teams either over-log to compensate or, more often, under-log and discover it the hard way.

This post is the playbook we use. It's been refined across a dozen production AI builds — fraud triage, regulatory drafting, leadership assessments, content generation. Same shape every time. Different things logged, different alarms set.

What's different about LLM observability

Three things separate LLM apps from traditional services from an observability standpoint:

  1. Non-determinism — the same input can produce a different output. Your trace can't just be the function call graph; it has to capture the actual content of the model's response.
  2. Quality is a multi-dimensional metric — "did it work" is no longer just status code 200. It's "was the output correct, was the citation accurate, was the tone right, did it refuse appropriately, did it hallucinate."
  3. The cost-quality-latency triangle — every prompt change shifts all three. Without telemetry on all three you can't make informed trade-offs.

The minimum viable observability stack

Before you reach for LangSmith or Helicone or any vendor, the things that must be in your own logs:

Per-call data we always capture
  • Request ID + parent trace ID (linked to the broader transaction)
  • Model ID with provider snapshot version, not just logical alias
  • System prompt hash (content-addressable reference to the actual prompt)
  • Full message trace — every user/assistant/system/tool message
  • Tool calls and their responses, in order
  • Sampling parameters (temperature, top_p, max_tokens)
  • Latency to first token, latency to final token
  • Token counts: input, output, cached (if applicable)
  • Cost in USD, computed at write time
  • Outcome — structured result, plus a "did the downstream system accept this" boolean

Notice what's not there: the user's full session history, every retrieved document in full, every internal model thought-chain step. We'll get to those.

What to skip (or log lazily)

The temptation is to log everything because storage is cheap. The cost isn't storage — it's the signal-to-noise ratio when you're debugging. Things we deliberately don't log eagerly:

Traces, not logs

The single most important shift is from logs (timestamped strings) to traces (structured spans with parent-child relationships). An LLM app worth observing has at minimum these spans per request:

request
├── input_validation
├── retrieval
│   ├── embedding
│   ├── vector_search
│   └── rerank
├── llm_call (× N if multi-step)
│   ├── prompt_assembly
│   ├── model_inference
│   └── response_parsing
├── tool_calls (× M if any)
│   ├── tool_lookup
│   └── tool_execute
├── post_processing
└── downstream_dispatch

OpenTelemetry handles this well. The instrumentation overhead is small if you do it right, and the payoff is immediate when something goes wrong.

The eval harness is observability

Most teams treat eval as a development-time concern. It isn't. It's the most important observability surface you have, because it's the only one that tells you whether the AI is actually doing its job.

Your eval harness needs to run in three modes:

  1. Pre-deploy — every prompt or model change runs against a golden dataset; regressions block deploy. Standard CI gate.
  2. Continuous on production traffic — sample, say, 1% of production calls; run them through automated scorers; track metrics over time. This is your drift early-warning system.
  3. Human-in-the-loop spot checks — flagged or uncertain calls go to a review queue; humans label; labels feed back into the golden dataset. Closes the loop.

The scorers themselves can be: rule-based (does the output have the required JSON keys?), classifier-based (is the tone polite?), or LLM-as-judge (is this answer factually consistent with the cited source?). Mix all three.

Drift detection: the unglamorous superpower

Models change. Provider snapshots roll forward. Your retrieval corpus grows. User behaviour shifts. Any of these will gradually degrade your output quality, and you won't notice until a customer complains.

Drift detection means continuously asking: is the distribution of outputs today similar to the distribution last week? The signals that matter:

Alarms worth waking someone up for

Page-worthy events for an LLM app, in our experience:

Everything else goes in a dashboard, not a pager. Page fatigue kills observability culture faster than any tool can save it.

What we ship in week one

On every new AI build, before any user sees the system, we ship:

  1. OpenTelemetry instrumentation with the span graph above
  2. Structured per-call records in a queryable store (BigQuery, ClickHouse, or similar)
  3. A pre-deploy eval gate in CI
  4. A continuous eval pipeline running 1% of production traffic against scorers
  5. A dashboard with the seven metrics that matter: eval score, latency p50/p95, cost per request, refusal rate, error rate, output-length distribution, and tool-call success rate
  6. Alarms wired to PagerDuty or equivalent, with documented runbooks for each

This isn't optional. It's the bare minimum for running an LLM in production responsibly. Skip it and your "we shipped AI" announcement becomes a "we broke AI in production and didn't notice for two weeks" Slack thread.

The vendor question

Do you need LangSmith, Helicone, Langfuse, Braintrust, or one of the other LLM-observability vendors? Maybe. They're useful. They're not magic.

What they buy you: pre-built ingestion for OpenAI/Anthropic SDKs, decent UI for trace exploration, eval tooling, and dataset management. What they don't buy you: discipline. If you don't know what to log and why, a vendor dashboard will not save you.

Our recommendation: start with first-party OpenTelemetry and your existing data warehouse. Layer on a vendor when you have enough volume that their UI saves your team real hours. For most teams that crossover happens around 100k requests/day.

Shipping LLM features?

Don't fly blind in production.

We build observability into every AI engagement from day one. Book a 30-minute call and we'll review your instrumentation gaps.

Book a Discovery Call

Continue reading