AI Agent Observability

The discipline of capturing, reconstructing, and evaluating AI agent behavior in production — making dark code answerable and agent decisions auditable.

Last updated: 2026-04-16

Why It Matters

Sarah Guo’s dark code thesis ends with a specific accountability test: “Can you say what your system did with your data on a specific Tuesday in March?” Observability is the engineering infrastructure that makes that question answerable.

Traditional observability (distributed tracing, log aggregation, error monitoring) was built for systems whose complexity was deliberate — you could enumerate what the software does. Agent systems generate execution paths at runtime, selected by LLMs from natural language instructions. That’s a different problem requiring different primitives.

The core new primitive: the trace — a complete reconstruction of a single agent interaction: every LLM call, tool invocation, retrieval step, and intermediate decision, with timing, token counts, context, and outcome. Think of it as a call stack for an agent.

The OpenTelemetry Standard

The industry is converging on OpenTelemetry (OTel) as the vendor-neutral standard for agent telemetry. The GenAI SIG has defined semantic conventions specifically for LLM and agent spans. Status as of 2026: Development (emerging, not fully stable, but widely adopted).

Key `gen_ai.*` Attributes

Attribute	Purpose
`gen_ai.operation.name`	Operation type: `chat`, `invoke_agent`, `create_agent`, `execute_tool`, `retrieval`
`gen_ai.provider.name`	Provider: `anthropic`, `openai`, `aws.bedrock`, `gcp.gemini`, etc.
`gen_ai.agent.id`	Unique agent identifier
`gen_ai.agent.name`	Human-readable agent name
`gen_ai.agent.version`	Agent version
`gen_ai.conversation.id`	Session/thread identifier — critical for multi-turn attribution
`gen_ai.request.model`	Model requested
`gen_ai.response.model`	Model that actually responded
`gen_ai.usage.input_tokens`	Input tokens consumed
`gen_ai.usage.output_tokens`	Output tokens produced
`gen_ai.usage.cache_read.input_tokens`	Tokens served from cache
`gen_ai.response.finish_reasons`	Why generation stopped
`error.type`	Error class if operation failed

Span Types

create_agent — span for instantiating an agent. Span kind: CLIENT. invoke_agent — span for executing an agent. Span kind: CLIENT (remote) or INTERNAL (in-process).

A typical trace tree:

invoke_agent: user request
  ├── chat: LLM call #1 (model, prompt, tokens, finish_reason)
  ├── execute_tool: search_web (input, output, latency)
  ├── chat: LLM call #2 (with tool result in context)
  └── execute_tool: write_file (path, outcome)

Trace Propagation (Multi-Agent)

For multi-agent systems, W3C TraceContext propagates a unique trace ID across service boundaries via HTTP headers:

traceparent: 00-{trace_id}-{span_id}-{sampled}

The extract-inject pattern: each service extracts parent context from incoming headers and injects current context into outbound calls. This is how you answer “which root request caused this downstream agent action?”

Known gap: As of 2026, context doesn’t automatically propagate to MCP servers during tool calls in some frameworks (e.g., Llama Stack) — requires manual header injection.

What to Capture (vs. Traditional Observability)

Traditional	Agent Observability adds
Latency, error rate, throughput	Decision path — why this tool was chosen
Request/response	Context at decision time — what was in the window
Error traces	Chain of reasoning — intermediate steps
Service identity	Agent attribution — which human/workflow initiated this
—	Drift detection — is behavior changing on the same inputs?

Sensitive data caveat: prompts often contain PII and confidential data. Recording gen_ai.input.messages and gen_ai.output.messages is opt-in in the spec — not captured by default. Sanitization before export to observability backends is an open engineering problem for regulated industries.

Instrumentation Approaches

Approach	Setup	Depth	Trade-off
Proxy (e.g., Helicone)	URL change only	API-level only	Fastest setup, shallowest insight
SDK (e.g., Langfuse, Braintrust)	Import + init	Comprehensive	Moderate complexity
OpenTelemetry (e.g., OpenLLMetry, Phoenix)	OTel collector	Full span tree, integrates with existing infra	Most portable, slightly more setup

Auto-instrumentation exists for OpenAI, Anthropic, LangChain, LlamaIndex — wraps API calls in spans automatically with <1ms overhead.

Tool Landscape (2026)

Tool	Strength	Self-host	Notes
Langfuse	Open-source, trace viewer, prompt versioning, cost tracking	Yes (ClickHouse + Postgres)	Best for full data control
Braintrust	Eval + observability unified; “Loop” auto-generates scorers; converts traces to test cases in 1 click	Yes / VPC / cloud	SOC 2 Type II; most eval-native
Arize Phoenix	Open-source, strong eval integration	Yes	OTel-native
LangSmith	LangChain-native, deep trace + eval	Cloud	Tight LangChain integration
Helicone	Fastest setup (proxy), semantic caching, cost reduction	Yes	Shallowest insight
Galileo	Luna-2 evaluators, cost-effective quality scoring	No	Specialized for eval
Opik	SDK + OTel, open-source	Yes	Good for existing OTel infra
OpenLLMetry	OTel-native, MCP server, works with Honeycomb/Datadog/any OTel backend	Yes	Most infra-agnostic
Fiddler	Bias detection, explainability, compliance	VPC	Best for regulated industries

Braintrust is already referenced in wiki/concepts/agent-evaluation for the Fintool case — its strength is straddling eval and observability, which is intentional: in agents, they’re the same problem.

The Dark Code Accountability Map

Guo’s dark code incidents mapped to observability primitives:

Incident	Missing telemetry	OTel primitive that fixes it
Cross-tenant data leak via agent cache	No trace linking agent step → cache write → other tenant read	`gen_ai.conversation.id` + span parent chain
Agent bypassing human review step	No audit log of skipped steps	`gen_ai.operation.name` + span status on required gates
Prompt injection via web form	No trace showing external content influencing decisions	`gen_ai.input.messages` (opt-in) + tool call spans
”Who did this?” unanswerable	No agent attribution	`gen_ai.agent.id` + W3C TraceContext propagation

The SOC 2 problem: a compliance report describes controls. Observability proves the controls actually fired — for specific interactions, at specific times.

What’s Still Unsolved

PII in traces — prompts contain personal data; sanitization before export is hard and framework-specific
Quality vs. completion — a trace tells you the agent ran; it doesn’t tell you if the answer was good (eval layer required on top)
Multi-agent attribution at scale — W3C TraceContext handles propagation theoretically; agent frameworks don’t all support it consistently yet
Natural language control planes — if behavior is determined by a prompt interpreted in context, tracing the prompt isn’t enough; you need to capture what the model understood from it (intermediate reasoning)
Sampling strategy — agents are prolific (100–1000x human request volume); tracing everything is expensive; sampling without losing the incidents that matter is unsolved

Connections

dark-code — observability is the direct engineering response; the accountability test Guo poses requires traces to answer
agent-evaluation — eval and observability are the same problem in agent systems; Braintrust bridges both; domain-specific metrics require trace data
agentic-engineering — Guardrails layer (Layer 4) relies on observability to detect and respond to violations
agentic-system-failure-modes — mistake #6 (no evals) and drift detection both require observability infrastructure
context-engineering — traces capture context at decision time; replay capability requires context snapshots
agent-first-software — with 100–1000x agent volume, observability infrastructure must scale accordingly; cost tracking becomes critical
model-context-protocol — MCP server calls are a key span type; context propagation to MCP tools is an open problem

Sources

AI Agent Observability: Evolving Standards and Best Practices — OpenTelemetry Blog — added 2026-04-16
Semantic Conventions for GenAI Agent and Framework Spans — OpenTelemetry — added 2026-04-16
AI Observability Tools: A Buyer’s Guide 2026 — Braintrust — added 2026-04-16
Distributed Tracing for Agentic Workflows with OpenTelemetry — Red Hat Developer — added 2026-04-16

second-brain

Explorer

ai-agent-observability

AI Agent Observability

Why It Matters

The OpenTelemetry Standard

Key `gen_ai.*` Attributes

Span Types

Trace Propagation (Multi-Agent)

What to Capture (vs. Traditional Observability)

Instrumentation Approaches

Tool Landscape (2026)

The Dark Code Accountability Map

What’s Still Unsolved

Connections

Sources

Graph View

Table of Contents

Backlinks

Chat

second-brain

Explorer

ai-agent-observability

AI Agent Observability

Why It Matters

The OpenTelemetry Standard

Key gen_ai.* Attributes

Span Types

Trace Propagation (Multi-Agent)

What to Capture (vs. Traditional Observability)

Instrumentation Approaches

Tool Landscape (2026)

The Dark Code Accountability Map

What’s Still Unsolved

Connections

Sources

Graph View

Table of Contents

Backlinks

Chat

Key `gen_ai.*` Attributes