AI Agent Observability
The discipline of capturing, reconstructing, and evaluating AI agent behavior in production — making dark code answerable and agent decisions auditable.
Last updated: 2026-04-16
Why It Matters
Sarah Guo’s dark code thesis ends with a specific accountability test: “Can you say what your system did with your data on a specific Tuesday in March?” Observability is the engineering infrastructure that makes that question answerable.
Traditional observability (distributed tracing, log aggregation, error monitoring) was built for systems whose complexity was deliberate — you could enumerate what the software does. Agent systems generate execution paths at runtime, selected by LLMs from natural language instructions. That’s a different problem requiring different primitives.
The core new primitive: the trace — a complete reconstruction of a single agent interaction: every LLM call, tool invocation, retrieval step, and intermediate decision, with timing, token counts, context, and outcome. Think of it as a call stack for an agent.
The OpenTelemetry Standard
The industry is converging on OpenTelemetry (OTel) as the vendor-neutral standard for agent telemetry. The GenAI SIG has defined semantic conventions specifically for LLM and agent spans. Status as of 2026: Development (emerging, not fully stable, but widely adopted).
Key gen_ai.* Attributes
| Attribute | Purpose |
|---|---|
gen_ai.operation.name | Operation type: chat, invoke_agent, create_agent, execute_tool, retrieval |
gen_ai.provider.name | Provider: anthropic, openai, aws.bedrock, gcp.gemini, etc. |
gen_ai.agent.id | Unique agent identifier |
gen_ai.agent.name | Human-readable agent name |
gen_ai.agent.version | Agent version |
gen_ai.conversation.id | Session/thread identifier — critical for multi-turn attribution |
gen_ai.request.model | Model requested |
gen_ai.response.model | Model that actually responded |
gen_ai.usage.input_tokens | Input tokens consumed |
gen_ai.usage.output_tokens | Output tokens produced |
gen_ai.usage.cache_read.input_tokens | Tokens served from cache |
gen_ai.response.finish_reasons | Why generation stopped |
error.type | Error class if operation failed |
Span Types
create_agent — span for instantiating an agent. Span kind: CLIENT.
invoke_agent — span for executing an agent. Span kind: CLIENT (remote) or INTERNAL (in-process).
A typical trace tree:
invoke_agent: user request
├── chat: LLM call #1 (model, prompt, tokens, finish_reason)
├── execute_tool: search_web (input, output, latency)
├── chat: LLM call #2 (with tool result in context)
└── execute_tool: write_file (path, outcome)
Trace Propagation (Multi-Agent)
For multi-agent systems, W3C TraceContext propagates a unique trace ID across service boundaries via HTTP headers:
traceparent: 00-{trace_id}-{span_id}-{sampled}
The extract-inject pattern: each service extracts parent context from incoming headers and injects current context into outbound calls. This is how you answer “which root request caused this downstream agent action?”
Known gap: As of 2026, context doesn’t automatically propagate to MCP servers during tool calls in some frameworks (e.g., Llama Stack) — requires manual header injection.
What to Capture (vs. Traditional Observability)
| Traditional | Agent Observability adds |
|---|---|
| Latency, error rate, throughput | Decision path — why this tool was chosen |
| Request/response | Context at decision time — what was in the window |
| Error traces | Chain of reasoning — intermediate steps |
| Service identity | Agent attribution — which human/workflow initiated this |
| — | Drift detection — is behavior changing on the same inputs? |
Sensitive data caveat: prompts often contain PII and confidential data. Recording gen_ai.input.messages and gen_ai.output.messages is opt-in in the spec — not captured by default. Sanitization before export to observability backends is an open engineering problem for regulated industries.
Instrumentation Approaches
| Approach | Setup | Depth | Trade-off |
|---|---|---|---|
| Proxy (e.g., Helicone) | URL change only | API-level only | Fastest setup, shallowest insight |
| SDK (e.g., Langfuse, Braintrust) | Import + init | Comprehensive | Moderate complexity |
| OpenTelemetry (e.g., OpenLLMetry, Phoenix) | OTel collector | Full span tree, integrates with existing infra | Most portable, slightly more setup |
Auto-instrumentation exists for OpenAI, Anthropic, LangChain, LlamaIndex — wraps API calls in spans automatically with <1ms overhead.
Tool Landscape (2026)
| Tool | Strength | Self-host | Notes |
|---|---|---|---|
| Langfuse | Open-source, trace viewer, prompt versioning, cost tracking | Yes (ClickHouse + Postgres) | Best for full data control |
| Braintrust | Eval + observability unified; “Loop” auto-generates scorers; converts traces to test cases in 1 click | Yes / VPC / cloud | SOC 2 Type II; most eval-native |
| Arize Phoenix | Open-source, strong eval integration | Yes | OTel-native |
| LangSmith | LangChain-native, deep trace + eval | Cloud | Tight LangChain integration |
| Helicone | Fastest setup (proxy), semantic caching, cost reduction | Yes | Shallowest insight |
| Galileo | Luna-2 evaluators, cost-effective quality scoring | No | Specialized for eval |
| Opik | SDK + OTel, open-source | Yes | Good for existing OTel infra |
| OpenLLMetry | OTel-native, MCP server, works with Honeycomb/Datadog/any OTel backend | Yes | Most infra-agnostic |
| Fiddler | Bias detection, explainability, compliance | VPC | Best for regulated industries |
Braintrust is already referenced in wiki/concepts/agent-evaluation for the Fintool case — its strength is straddling eval and observability, which is intentional: in agents, they’re the same problem.
The Dark Code Accountability Map
Guo’s dark code incidents mapped to observability primitives:
| Incident | Missing telemetry | OTel primitive that fixes it |
|---|---|---|
| Cross-tenant data leak via agent cache | No trace linking agent step → cache write → other tenant read | gen_ai.conversation.id + span parent chain |
| Agent bypassing human review step | No audit log of skipped steps | gen_ai.operation.name + span status on required gates |
| Prompt injection via web form | No trace showing external content influencing decisions | gen_ai.input.messages (opt-in) + tool call spans |
| ”Who did this?” unanswerable | No agent attribution | gen_ai.agent.id + W3C TraceContext propagation |
The SOC 2 problem: a compliance report describes controls. Observability proves the controls actually fired — for specific interactions, at specific times.
What’s Still Unsolved
- PII in traces — prompts contain personal data; sanitization before export is hard and framework-specific
- Quality vs. completion — a trace tells you the agent ran; it doesn’t tell you if the answer was good (eval layer required on top)
- Multi-agent attribution at scale — W3C TraceContext handles propagation theoretically; agent frameworks don’t all support it consistently yet
- Natural language control planes — if behavior is determined by a prompt interpreted in context, tracing the prompt isn’t enough; you need to capture what the model understood from it (intermediate reasoning)
- Sampling strategy — agents are prolific (100–1000x human request volume); tracing everything is expensive; sampling without losing the incidents that matter is unsolved
Connections
- dark-code — observability is the direct engineering response; the accountability test Guo poses requires traces to answer
- agent-evaluation — eval and observability are the same problem in agent systems; Braintrust bridges both; domain-specific metrics require trace data
- agentic-engineering — Guardrails layer (Layer 4) relies on observability to detect and respond to violations
- agentic-system-failure-modes — mistake #6 (no evals) and drift detection both require observability infrastructure
- context-engineering — traces capture context at decision time; replay capability requires context snapshots
- agent-first-software — with 100–1000x agent volume, observability infrastructure must scale accordingly; cost tracking becomes critical
- model-context-protocol — MCP server calls are a key span type; context propagation to MCP tools is an open problem
Sources
- AI Agent Observability: Evolving Standards and Best Practices — OpenTelemetry Blog — added 2026-04-16
- Semantic Conventions for GenAI Agent and Framework Spans — OpenTelemetry — added 2026-04-16
- AI Observability Tools: A Buyer’s Guide 2026 — Braintrust — added 2026-04-16
- Distributed Tracing for Agentic Workflows with OpenTelemetry — Red Hat Developer — added 2026-04-16