Agentic System Failure Modes
Six production failure patterns that silently destroy AI agents — not model failures, but system design mistakes that compound into disasters.
Last updated: 2026-04-23
Overview
Paul Iusztin’s diagnostic framework from two years building and breaking AI agents in production. The key insight: most agents fail silently. They work well in demos, drift unpredictably in production, costs spike without explanation, and every release feels risky. Teams end up stuck in “PoC purgatory” — unable to ship, debug, or trust their own system.
The root cause is almost never the model. It’s system design.
The Six Mistakes
1. Treating the Context Window as an Afterthought
What happens: When something breaks, engineers add more — more rules, more history, more tools, more examples. The context window becomes a dumping ground. The model starts ignoring instructions, applying constraints inconsistently, hallucinating more, and drifting across runs. Latency spikes and costs compound (“lost in the middle” problem).
Fix: Treat the context window as a scarce resource. Every LLM call has one clearly scoped job. Curate context aggressively — select, compress, and prune before every call. Move persistence into a memory layer. The context window holds only what matters for the next decision.
Start with a single prompt. If it works, stop. If it fails, don’t jump to agents — introduce a small number of specialized steps and tune. Context engineering is about deliberate selection.
See context-engineering for the full discipline.
2. Starting with Complicated Solutions
What happens: You immediately reach for multi-agent architectures, heavy frameworks, RAG pipelines, hybrid retrieval, multiple databases, or MCP. Every layer adds a hidden tax: more dependencies, higher latency, higher costs, harder debugging. Teams spend months building infrastructure and shipping nothing.
Real example: ZTRON built a multi-index RAG system — OCR pipelines, separate embedding pipelines, hybrid retrieval, agentic RAG loops. Simple queries took 10–15 seconds. When they finally asked if they actually needed all this: no. Their data fit in modern context windows. They replaced agentic RAG with cache-augmented generation (CAG) for most workflows: fewer LLM calls, lower latency, fewer errors, easier to debug.
Fix: Start with the simplest solution that could work. Prove the core task works first. Add memory, tools, retrieval, or multiple agents only when the problem demands it.
3. Building Agents When a Workflow Will Do
What happens: Predictable tasks (data ingestion, summarization, report generation) get treated as if they need agents. Agents handle open-ended tasks (deep research, dynamic decision-making under uncertainty). Using an agent for a structured task pays for autonomy you don’t need: unpredictable behavior, variable latency, higher token usage, inconsistent outputs. The system works 80% of the time and fails when it matters most.
The autonomy slider: Workflows and agents aren’t binary — they’re a spectrum. More autonomy buys flexibility but costs predictability, cost control, and debuggability.
Workflow ←————————————————————→ Agent
deterministic autonomous
cheap expensive
debuggable unpredictable
Fix: Workflow-first approach. Start with prompt chaining, routing, parallelization, or orchestrator-worker patterns. Introduce agents only when the system must autonomously plan, explore unknown paths, or recover from failures dynamically. For vertical AI agents: route known patterns to workflows, open-ended requests to agents.
4. Fragile Parsing of LLM Outputs
What happens: You parse model output with regex, string splitting, or custom logic. It works in staging. Then one day, a missing comma or different bullet style crashes production. LLMs are non-deterministic — even with identical prompts, output can drift due to context changes, model updates, or tool output variations. Prompting for JSON is better but still not a contract: missing keys, wrong types, drifting nested fields.
Fix: Stop treating LLM outputs like text — treat them like data. Define a schema, enforce it at generation time, validate at runtime, and fail fast when wrong. Use Pydantic as the bridge between probabilistic generation and deterministic code.
Only use structured outputs when structure is required. If you only need a plain string, accept a string. Keep schemas shallow and minimal.
5. Forgetting Agents Need Planning
What happens: You give a model tools, let it pick one, feed the result back, and repeat. This looks agentic but is just a workflow with randomness. The system reacts to the last tool output — it’s not driving toward a goal. Without planning, the loop can’t decompose tasks into meaningful steps, evaluate progress, or choose next actions intentionally. Result: random behavior, unnecessary tool calls, infinite loops, shallow reasoning.
Fix: Embed planning into the loop. Before calling a tool, require a reasoning step: what is the goal, what is the next best action, what evidence is needed. Add progress checks and stop conditions (max steps, token budgets, escalation when stuck).
Make planning use-case specific. Generic ReAct is not a product. Tailor planning to your specific tools, data, constraints, and failure modes.
6. Not Starting with AI Evals from Day Zero
What happens: You build features without tracking how well the AI behaves — no tests, no evaluation metrics, no defined success criteria. Every release is a gamble. AI systems don’t fail all at once: they decay. A prompt change, new tool, or model upgrade causes subtle behavior shifts. Without evals, nobody can answer whether a change made things better or worse.
Teams rely on vibe evals — manual, gut-feel testing that doesn’t scale. Generic scores like “helpfulness 3.7/5” tell you nothing about what to fix.
Fix: Use evals as your north star. Define task-specific, binary metrics tied to real system behavior and business requirements — from day one. Integrate evals into your development workflow to catch regressions before users do. Use the eval flywheel: build → eval → improve → repeat.
See agent-evaluation for the full eval discipline including domain-specific failure categories, LLM-as-judge patterns, and the eval progression model.
Context Rot and the Review Problem
A distinct failure mode that compounds with long-running agents: context rot. After 30+ minutes of work, the agent’s context hits compaction. When a human asks for a targeted fix (“change clause three”), the agent — now without full prior context — may rewrite the entire document, or introduce inconsistencies across sections.
The resulting review problem: the human can’t tell what changed. In a chat interface, there’s no structural diff. Did only clause three change? Probably not. And there’s no way to verify without reading everything again.
This compounds failures #1 (context window as afterthought) and #5 (no planning in agent loops):
- If context was managed well (short steps writing to persistent artifacts), compaction has less to destroy
- If planning was embedded, the agent’s intent for each section was explicit, making deviations easier to spot
The fix is artifact-centric execution: write results to documents (not memory), keep each agent step short, and use review interfaces that show structural diffs rather than linear chat history. See agent-human-collaboration for the interface design response.
The Compounding Effect
Individually, each mistake looks small. In production, they compound:
- Bloated context → unreliable outputs → you add more rules → context bloats further
- Complex architecture → slow debugging → missed regressions → more complexity to compensate
- No evals → can’t detect decay → silently shipping regressions → loss of trust
PoC purgatory is the end state: a system that demos well but can’t be reliably shipped, debugged, or trusted.
The Decision Ladder
1. Can a single prompt handle this? → use it
2. Does it need a few specialized steps? → use a workflow
3. Does it need autonomy / unknown paths? → use an agent
4. Does it need parallel/distributed work? → use multiple agents
Add complexity only when you’ve proven the simpler version is insufficient.
Connections
- context-engineering — mistake #1 (context window) is a context engineering failure; entropy reduction and minimal sufficiency principles are the fix
- agent-evaluation — mistake #6 is the central argument for eval-driven development; domain-specific binary metrics over vibe evals
- agentic-engineering — complementary framework: Hoekstra’s four layers (Configuration, Capability, Orchestration, Guardrails) are the architectural response to these failure modes
- dark-code — mistakes #2, #3, and #5 compound into dark code: complex architectures with agent loops nobody can fully trace
- thin-harness-fat-skills — mistake #2 and the CAG example both validate keeping architectures minimal; complexity is earned, not assumed
- coding-agent — the six-component harness implicitly addresses several of these: context compaction (#1), memory (#1), subagents (#3 autonomy slider)
- agent-human-collaboration — context rot and the review problem; artifact-centric execution as the fix