AI Agent Observability

The discipline of capturing, reconstructing, and evaluating AI agent behavior in production — making dark code answerable and agent decisions auditable.

Last updated: 2026-04-16

Why It Matters

Sarah Guo’s dark code thesis ends with a specific accountability test: “Can you say what your system did with your data on a specific Tuesday in March?” Observability is the engineering infrastructure that makes that question answerable.

Traditional observability (distributed tracing, log aggregation, error monitoring) was built for systems whose complexity was deliberate — you could enumerate what the software does. Agent systems generate execution paths at runtime, selected by LLMs from natural language instructions. That’s a different problem requiring different primitives.

The core new primitive: the trace — a complete reconstruction of a single agent interaction: every LLM call, tool invocation, retrieval step, and intermediate decision, with timing, token counts, context, and outcome. Think of it as a call stack for an agent.


The OpenTelemetry Standard

The industry is converging on OpenTelemetry (OTel) as the vendor-neutral standard for agent telemetry. The GenAI SIG has defined semantic conventions specifically for LLM and agent spans. Status as of 2026: Development (emerging, not fully stable, but widely adopted).

Key gen_ai.* Attributes

AttributePurpose
gen_ai.operation.nameOperation type: chat, invoke_agent, create_agent, execute_tool, retrieval
gen_ai.provider.nameProvider: anthropic, openai, aws.bedrock, gcp.gemini, etc.
gen_ai.agent.idUnique agent identifier
gen_ai.agent.nameHuman-readable agent name
gen_ai.agent.versionAgent version
gen_ai.conversation.idSession/thread identifier — critical for multi-turn attribution
gen_ai.request.modelModel requested
gen_ai.response.modelModel that actually responded
gen_ai.usage.input_tokensInput tokens consumed
gen_ai.usage.output_tokensOutput tokens produced
gen_ai.usage.cache_read.input_tokensTokens served from cache
gen_ai.response.finish_reasonsWhy generation stopped
error.typeError class if operation failed

Span Types

create_agent — span for instantiating an agent. Span kind: CLIENT. invoke_agent — span for executing an agent. Span kind: CLIENT (remote) or INTERNAL (in-process).

A typical trace tree:

invoke_agent: user request
  ├── chat: LLM call #1 (model, prompt, tokens, finish_reason)
  ├── execute_tool: search_web (input, output, latency)
  ├── chat: LLM call #2 (with tool result in context)
  └── execute_tool: write_file (path, outcome)

Trace Propagation (Multi-Agent)

For multi-agent systems, W3C TraceContext propagates a unique trace ID across service boundaries via HTTP headers:

traceparent: 00-{trace_id}-{span_id}-{sampled}

The extract-inject pattern: each service extracts parent context from incoming headers and injects current context into outbound calls. This is how you answer “which root request caused this downstream agent action?”

Known gap: As of 2026, context doesn’t automatically propagate to MCP servers during tool calls in some frameworks (e.g., Llama Stack) — requires manual header injection.


What to Capture (vs. Traditional Observability)

TraditionalAgent Observability adds
Latency, error rate, throughputDecision path — why this tool was chosen
Request/responseContext at decision time — what was in the window
Error tracesChain of reasoning — intermediate steps
Service identityAgent attribution — which human/workflow initiated this
Drift detection — is behavior changing on the same inputs?

Sensitive data caveat: prompts often contain PII and confidential data. Recording gen_ai.input.messages and gen_ai.output.messages is opt-in in the spec — not captured by default. Sanitization before export to observability backends is an open engineering problem for regulated industries.


Instrumentation Approaches

ApproachSetupDepthTrade-off
Proxy (e.g., Helicone)URL change onlyAPI-level onlyFastest setup, shallowest insight
SDK (e.g., Langfuse, Braintrust)Import + initComprehensiveModerate complexity
OpenTelemetry (e.g., OpenLLMetry, Phoenix)OTel collectorFull span tree, integrates with existing infraMost portable, slightly more setup

Auto-instrumentation exists for OpenAI, Anthropic, LangChain, LlamaIndex — wraps API calls in spans automatically with <1ms overhead.


Tool Landscape (2026)

ToolStrengthSelf-hostNotes
LangfuseOpen-source, trace viewer, prompt versioning, cost trackingYes (ClickHouse + Postgres)Best for full data control
BraintrustEval + observability unified; “Loop” auto-generates scorers; converts traces to test cases in 1 clickYes / VPC / cloudSOC 2 Type II; most eval-native
Arize PhoenixOpen-source, strong eval integrationYesOTel-native
LangSmithLangChain-native, deep trace + evalCloudTight LangChain integration
HeliconeFastest setup (proxy), semantic caching, cost reductionYesShallowest insight
GalileoLuna-2 evaluators, cost-effective quality scoringNoSpecialized for eval
OpikSDK + OTel, open-sourceYesGood for existing OTel infra
OpenLLMetryOTel-native, MCP server, works with Honeycomb/Datadog/any OTel backendYesMost infra-agnostic
FiddlerBias detection, explainability, complianceVPCBest for regulated industries

Braintrust is already referenced in wiki/concepts/agent-evaluation for the Fintool case — its strength is straddling eval and observability, which is intentional: in agents, they’re the same problem.


The Dark Code Accountability Map

Guo’s dark code incidents mapped to observability primitives:

IncidentMissing telemetryOTel primitive that fixes it
Cross-tenant data leak via agent cacheNo trace linking agent step → cache write → other tenant readgen_ai.conversation.id + span parent chain
Agent bypassing human review stepNo audit log of skipped stepsgen_ai.operation.name + span status on required gates
Prompt injection via web formNo trace showing external content influencing decisionsgen_ai.input.messages (opt-in) + tool call spans
”Who did this?” unanswerableNo agent attributiongen_ai.agent.id + W3C TraceContext propagation

The SOC 2 problem: a compliance report describes controls. Observability proves the controls actually fired — for specific interactions, at specific times.


What’s Still Unsolved

  1. PII in traces — prompts contain personal data; sanitization before export is hard and framework-specific
  2. Quality vs. completion — a trace tells you the agent ran; it doesn’t tell you if the answer was good (eval layer required on top)
  3. Multi-agent attribution at scale — W3C TraceContext handles propagation theoretically; agent frameworks don’t all support it consistently yet
  4. Natural language control planes — if behavior is determined by a prompt interpreted in context, tracing the prompt isn’t enough; you need to capture what the model understood from it (intermediate reasoning)
  5. Sampling strategy — agents are prolific (100–1000x human request volume); tracing everything is expensive; sampling without losing the incidents that matter is unsolved

Connections

  • dark-code — observability is the direct engineering response; the accountability test Guo poses requires traces to answer
  • agent-evaluation — eval and observability are the same problem in agent systems; Braintrust bridges both; domain-specific metrics require trace data
  • agentic-engineering — Guardrails layer (Layer 4) relies on observability to detect and respond to violations
  • agentic-system-failure-modes — mistake #6 (no evals) and drift detection both require observability infrastructure
  • context-engineering — traces capture context at decision time; replay capability requires context snapshots
  • agent-first-software — with 100–1000x agent volume, observability infrastructure must scale accordingly; cost tracking becomes critical
  • model-context-protocol — MCP server calls are a key span type; context propagation to MCP tools is an open problem

Sources