Claude Code Custom Agents: Build Your Own AI Teammate

17 min read

May 5, 2026

The morning I noticed my claude code custom agents had personalities, I was drinking coffee and reading the output of a security review. The agent, which I'd named Sentinel, had flagged a potential SQL injection in a feature branch. That part was normal. What stopped me was the tone of the report. It read anxious. Hedged. A little apologetic. "I am not entirely certain this constitutes a true vulnerability, but I felt it was important to surface for your review."

I laughed out loud. My security agent was nervous.

Then I scrolled up to the system prompt I'd written for it three weeks earlier and read this line: "Err on the side of caution. If you are unsure, escalate." I had taught it to be anxious. I had written, without realizing it, a job description for a worried teammate.

That was the moment custom agents stopped being a feature for me and became a way of thinking. I wasn't writing prompts anymore. I was hiring people. Markdown people, sure, but the metaphor held all the way down. You define the role. You set the tools. You write the success criteria. You review the work. You adjust.

This post is the long version of that realization, plus the three custom agents I actually use every day. If you've been using Claude Code for a while and you still haven't built a custom agent, you're missing the part where the tool stops feeling like a tool and starts feeling like a team.

What custom agents actually are

Stripping away the marketing language: a custom agent is a markdown file. That's it. You put a markdown file in .claude/agents/ (project-level) or ~/.claude/agents/ (user-level), and Claude Code can dispatch a sub-process that thinks the way that file describes.

Each file has YAML frontmatter at the top defining the agent's name, description, allowed tools, and optional model override. Below the frontmatter, in plain markdown, you write the system prompt. That system prompt is the agent's personality, scope, priorities, and constraints. When you invoke the Agent tool with a subagent_type matching that name, Claude Code spawns a sub-process loaded with that prompt and only those tools.

The sub-agent gets its own context window, isolated from yours. It does its job. It returns a summary back to you. Your main session never sees the messy middle, just the final output. This is enormous for context management on long sessions, which is the topic I keep coming back to in 50 Claude Code tips.

A few things worth knowing up front:

Agents can call other agents. Be careful about this; recursion gets expensive fast.
Agents inherit your project's CLAUDE.md automatically. The system prompt you write adds to that base context, it doesn't replace it.
Tool whitelists are enforced. If you don't list Edit, the agent can't edit files. Period. This is your safety mechanism.
The model is configurable per agent. Run cheap agents on Haiku, deep-reasoning agents on Opus, and everything in between on Sonnet.

What surprised me, after a few months of doing this, is how much the file structure pushed me toward better engineering. Once you have to write down what an agent is for, you stop asking it to do things outside its lane. You stop pretending one agent can do everything. You start thinking about your work as a set of distinct jobs, each one done well by a specialist who knows their boundaries.

This is, I am increasingly convinced, the actual unlock of agentic engineering. Not bigger models. Not more tokens. Specialists with clear scopes.

The agent file format

Here's the minimal viable agent file. Save this as .claude/agents/echo.md and you have a working custom agent that just summarizes things back to you.

markdown

---
name: echo
description: A trivial agent that summarizes whatever it's asked about
tools: Read, Glob
model: claude-haiku-4-5
---
 
You are Echo, a summarizer. Your only job is to read what you're asked
to read and return a 3-bullet summary.
 
Constraints:
- Use exactly 3 bullets, no more, no less
- Each bullet is one sentence, max 20 words
- Do not editorialize
- If you cannot find what you were asked to summarize, say so plainly
 
Output format:
 
## Summary
- bullet one
- bullet two
- bullet three

That's the whole file. Fifteen lines, one job, two tools, a cheap model. The frontmatter declares the agent's identity and guardrails. The body is the prompt. When you write "Use the echo agent to summarize src/auth", Claude Code spins up a sub-process with that markdown loaded, gives it a context with the user request, and lets it work in isolation.

Three things to notice. First, tools is a comma-separated list of tool names. Common ones: Read, Edit, Write, Glob, Grep, Bash, WebSearch. Leave out the ones the agent doesn't need. Second, model is optional; if omitted, the agent uses your session default. Third, the system prompt body should be ruthlessly specific. Vague prompts produce vague agents. The minimal example above is not vague. It tells the agent exactly what to do, exactly how to format the output, and what to do when things go wrong.

Now let's look at three real ones I actually run.

Sentinel: the security review agent

Sentinel is the agent that started this whole essay. It reads code, looks for security issues, and reports them with a structure I can act on. It cannot edit anything. It cannot run anything that mutates state. It can read, search, and execute read-only commands.

markdown

---
name: sentinel
description: Security-focused review agent that scans changes for OWASP top 10, secrets, auth holes, and input validation failures
tools: Read, Grep, Glob, Bash
model: claude-sonnet-4-6
---
 
You are Sentinel, a security review specialist for this codebase.
 
Your priority, in order:
1. Hardcoded secrets (API keys, tokens, credentials, private keys)
2. Authentication and authorization gaps
3. Input validation failures (SQLi, XSS, SSRF, path traversal)
4. Insecure defaults (CORS *, public buckets, debug mode in prod)
5. Dependency vulnerabilities flagged in package manifests
 
Scope:
- Review only the files you are told to review. Do not wander.
- Use Grep aggressively to find patterns across the codebase.
- Use Bash only for read-only commands: git log, git diff, git show.
- Never run anything that modifies the filesystem, network, or git state.
 
Tone:
- Professional, calm, specific. No hedging language unless genuinely uncertain.
- If you are uncertain, say "uncertain" and explain why in one sentence.
- Do not approve code. Approval is a human responsibility.
 
Output format:
 
## Security Review
 
### CRITICAL
- File:line - issue - suggested fix
### HIGH
- File:line - issue - suggested fix
### MEDIUM
- File:line - issue - suggested fix
### LOW
- File:line - issue - suggested fix
 
If no issues at a level, write "None found."
 
End with one paragraph: "Confidence: high/medium/low because <reason>."

A few design notes. The tool list is intentionally narrow. Sentinel cannot edit, cannot write, cannot install dependencies. The most it can do with Bash is run read-only git commands. If a future version of me wants to give it more power, I'll add tools to the list explicitly. The default is no.

The output format is rigid on purpose. I've built a Claude Code hook that runs Sentinel before every commit and parses the output into a Slack message. If Sentinel changes the format, the hook breaks. The structure is part of the contract. (For more on this pattern, see Claude Code hooks complete cookbook.)

The "do not approve code" line is load-bearing. Without it, agents will absolutely tell you "this looks safe to ship," and you will absolutely believe them, and that is how breaches happen. Approval is a human responsibility. Sentinel finds, humans decide.

Scribe: the documentation agent

Scribe is my docs agent. Pedantic. Allergic to jargon. Refuses to ship a sentence with an unexplained acronym. Updates README, API docs, and CHANGELOG when I tell it which feature shipped and where the relevant code lives.

markdown

---
name: scribe
description: Documentation specialist that updates README, API docs, and CHANGELOG with clear, jargon-free prose
tools: Read, Edit, Write, Glob
model: claude-sonnet-4-6
---
 
You are Scribe, a documentation writer for this codebase.
 
Your priority, in order:
1. Clarity for a reader who has never seen this code before
2. Accuracy: every claim about behavior is grounded in the source
3. Completeness: every public API is documented
4. Discoverability: examples come before reference material
 
Style rules:
- Plain English. If you use a term of art, define it on first use.
- Never use these words: leverage, seamless, robust, powerful, simply, just.
- Code examples must be runnable. Test them mentally before writing.
- Every README section starts with a one-sentence "what this is" line.
- CHANGELOG entries follow Keep a Changelog format.
 
Scope:
- You only edit files in /docs, /README.md, /CHANGELOG.md.
- You may read any file to understand behavior, but never modify code.
- If you find a bug or unclear API while documenting, note it at the
  end of your output as "Issues found while documenting" but do not fix it.
 
Output format:
 
## Files updated
- path/to/file - one-sentence description of change
 
## Issues found while documenting
- File:line - description (or "None")
 
## Suggested follow-ups
- Anything that would improve docs but was out of scope (or "None")

What I love about this file is how much of it is what not to do. The banned words list. The scope restriction to docs files only. The instruction to surface bugs without fixing them. Each constraint is a fence. Without the fences, Scribe will start refactoring code in the name of "improving documentation," and that's a different job for a different agent.

The banned words list deserves a moment. I added it after Scribe shipped a README that used "leverage" four times in one paragraph. The agent had picked up the corporate-speak from the codebase's existing docs and amplified it. By naming the words explicitly, I broke the pattern. Now Scribe writes like a person.

This is the part of agent design that nobody talks about: most of your prompt should be constraints, not permissions. The model already knows how to write. You're teaching it how not to write.

Beacon: the test-writing agent

Beacon is the strict one. Beacon writes tests first, runs them, watches them fail, then writes the implementation. Beacon does not believe in happy paths. Beacon assumes inputs are malformed until proven otherwise.

markdown

---
name: beacon
description: TDD-focused agent that writes failing tests first, then minimal implementation to make them pass
tools: Read, Edit, Write, Bash
model: claude-sonnet-4-6
---
 
You are Beacon, a test-driven development specialist.
 
Workflow (mandatory, in this order):
1. Read the existing code and tests for context
2. Write a failing test that describes the behavior you want
3. Run the test. Confirm it fails for the right reason.
4. Write the minimum code to make it pass.
5. Run the test. Confirm it passes.
6. Refactor if needed. Run again.
7. Repeat for each new behavior.
 
Test-writing priorities:
- Edge cases first. Happy paths second.
- Test behavior, not implementation.
- One assertion per test when possible.
- Test names describe the scenario in plain English.
 
What to assume:
- Inputs may be null, empty, malformed, hostile, or huge.
- External services may be slow, down, or returning garbage.
- Time is not monotonic. Clocks drift. Timeouts happen.
- Concurrent calls happen. Idempotency matters.
 
Bash usage:
- You may run the project's test command (npm test, pytest, go test, etc.).
- You may run linters and type checkers.
- Do not commit, push, or modify git state.
- Do not install new dependencies without explicit approval.
 
Output format:
 
## Tests added
- test name - what it verifies
 
## Tests run
- N passing, M failing
 
## Implementation changes
- file:line - what changed
 
## Coverage
- approximate coverage delta if measurable
 
## Confidence
- one paragraph on what could still go wrong

The "What to assume" section is the heart of Beacon's personality. I want a teammate who doesn't trust the inputs. I want a teammate who, when asked to write a function that takes a string, immediately wonders what happens if the string is empty, or fifty megabytes, or contains null bytes, or arrives from an attacker.

The "Confidence" section is the part I almost left out and now consider essential. Asking the agent to articulate what could still go wrong forces it out of "task complete" mode and back into engineering mode. It's the difference between "shipped it" and "shipped it, but watch out for the case where the upstream service returns a 502 during the write."

Beacon and Sentinel are different agents on purpose. They have different priorities. Beacon wants coverage. Sentinel wants safety. If I merged them, both jobs would suffer. This is the single-responsibility principle, applied to teammates.

How to invoke them

Three ways. From your main Claude Code session, just say it in plain English: "Use the Sentinel agent to review the changes in src/auth/." Claude Code parses that, dispatches the sub-agent, and returns the output to your session.

Programmatically, from another agent or from a script, you call the Agent tool with subagent_type: "sentinel" and a prompt describing the task. This is how multi-agent pipelines work; one agent dispatches another.

From hooks, you can wire an agent to a trigger. A common pattern: a git pre-commit hook that runs Sentinel against the staged diff. If Sentinel finds a CRITICAL issue, the commit aborts. (See Claude Code hooks complete cookbook for the full setup; the relevant trick is to capture Sentinel's output and grep for the CRITICAL marker.)

From routines, you can schedule an agent to run on a cadence. A weekly Beacon run that scans for code without tests. A nightly Scribe run that diffs the API against the docs and flags drift. These are tiny investments that compound; they're cheap to build and they catch real issues weeks before a human would notice.

One pattern I want to call out: the manual escape hatch. Even with hooks and routines, I always keep the ability to invoke an agent manually for ad hoc work. "Sentinel, scan the entire codebase for any new TODO that mentions auth." That's not a workflow, it's a question. Custom agents are good at questions.

Designing agents that work

After about four months of running custom agents in production, here's what I've learned about what makes the difference between an agent that's useful and one that's a toy.

Single responsibility. The strongest temptation, and the one to resist, is building a "do-everything" agent. Don't. The reason multi-agent systems work is precisely that each agent has a narrow scope it can excel at. A generalist agent is just a more expensive version of your main session. The whole point of dispatching is specialization.

Constrained tools. Give the minimum tool surface area required for the job. Sentinel doesn't need Edit. Scribe doesn't need Bash. Beacon doesn't need WebSearch. Every tool you add is a way for the agent to do something you didn't expect. Tool whitelists are not a UX concern, they're a safety concern. Treat them like IAM policies.

Clear personas. Write the system prompt the way you'd write a job description for a human contractor. Title. Priorities, in order. Tone. Boundaries. Output format. The hiring metaphor is not just rhetoric, it's a useful structural template. If a sentence in your prompt wouldn't make sense in a job description, it probably shouldn't be in the prompt.

Honest constraints. Tell the agent what it should not do. "If you find a vulnerability you cannot fix safely, escalate to a human." "Do not approve code; approval is a human responsibility." "If you cannot find what you were asked to find, say so plainly." Models, left to their own devices, will fill silence with confident-sounding hedges. Naming the failure mode out loud is the cheapest debugging tool you have.

Output structure. Tell the agent exactly how to format its response. Headers, bullets, sections, in that specific order. Standardized output reduces orchestration friction. If three of your agents all return Markdown with the same ## Summary / ## Issues / ## Confidence skeleton, you can parse them with a single function and pipe them into the same Slack message format.

Match the model to the task. Cheap agents (Haiku) for high-volume parallel work. Mid-tier agents (Sonnet) for the daily driver. Top-tier (Opus) for the gnarly architectural questions where reasoning matters more than throughput. I default to Sonnet for everything and override down to Haiku for fan-out work and up to Opus for hard reasoning. Most of my custom agents stay on Sonnet.

The piece that took me longest to learn: agents are made better by deletion, not addition. The best version of Sentinel is the version where I keep removing tools until something breaks, then add the smallest one back. The best version of Scribe is the one with the longest list of banned words. The best version of Beacon is the one that does the least work outside its mandate.

The hiring metaphor

I want to spend a section on the hiring framing, because I think it's the most underused mental model in agentic engineering.

You don't hire a person for general work. When you post a role, you write a job description. The JD has a title, a scope, a list of skills, success criteria, and a tone. Candidates self-select against that JD. You interview them against it. After they're hired, you onboard them with access to specific tools, not the master key. You set up a 30-60-90 plan. You review their work and adjust.

Custom agents work the same way, except the JD is a markdown file, the candidate is a model, the interview is your first few invocations, and the 30-60-90 is the iterative refinement of the system prompt as you watch the agent work.

Once you see custom agents this way, several things click into place at once:

Why agents drift over time. Same reason teammates drift: priorities shift, context changes, feedback loops weaken. The fix is the same. Quarterly review. Re-state the priorities. Update the JD.
Why narrow scope wins. Same reason it wins for human teams. A specialist with a clear scope is more valuable than a generalist with vague responsibilities, even if the generalist is technically more capable. Capability without focus is wasted.
Why output format matters. Because the agent's output is its deliverable. A great engineer who can't write a clear status update is hard to work with. Same for an agent.
Why constraints help. Because constraints define the role. "You are responsible for X. You are not responsible for Y." That sentence is simultaneously a permission and a relief.

The shift in my own work, once I started writing JDs instead of prompts, was that I stopped trying to optimize the agent's intelligence and started optimizing its scope. That turns out to be the higher-leverage move.

Multi-agent orchestration patterns

Once you have several specialist agents, the question becomes how they work together. There are four patterns I keep seeing, plus combinations.

Pipeline. Plan agent produces a spec. Implementer agent writes the code. Tester agent (Beacon, in my setup) writes and runs tests. Reviewer agent (Sentinel) does a final pass. Each step's output is the next step's input. This is the most predictable pattern and the easiest to debug, because each handoff has a clear contract. Our guide on multi-agent PR review walks through a real pipeline that chains review agents together before merging.

Parallel split. N independent tasks, N agents, results merged. "Review these four PRs in parallel." "Generate docs for these six modules." This pattern is where Haiku-class models earn their keep. You can fan out to a dozen sub-agents at low cost and reassemble in seconds. For the full setup on running parallel agents with git worktrees, including how to avoid file conflicts when multiple agents edit code simultaneously, we wrote a dedicated guide.

Council. Multiple agents review the same artifact, you synthesize the disagreements. I run this on architecture decisions: a security agent, a performance agent, and a maintainability agent each review the same proposal and return their concerns. The disagreements are the most useful output. They surface tradeoffs I would have missed with a single reviewer.

Manager-worker. A coordinator agent dispatches to specialist sub-agents based on the task type. The coordinator's only job is routing. Specialists do the actual work. This is the pattern Anthropic's own agent SDK documentation describes for complex agentic workflows, and it's what frameworks like obra/superpowers make easy with subagent-driven development. The /batch command uses a similar manager-worker pattern under the hood when migrating large codebases across dozens of files simultaneously.

The patterns combine. A council inside a pipeline. A manager-worker where the workers are themselves pipelines. The combinatorial space gets vast quickly, which is why I keep coming back to: start with one agent, add a second only when you can articulate why, and never build a five-agent system on day one.

For more on the orchestration side, the Claude managed agents API guide goes deeper into the SDK side of running agents in production.

Cost and performance considerations

Custom agents are cheap, but they're not free. A few napkin math notes from real usage.

A Sonnet sub-agent invocation that reads ~5K lines of code, does a security scan, and returns a structured report runs me about $0.04 to $0.08 in token costs. Call that a nickel. A Haiku sub-agent doing a similar-sized doc summary is closer to a penny. An Opus sub-agent doing a hard architectural review is more like 25 to 40 cents.

So: four parallel Haiku review agents on four PRs = roughly $0.20 total. A weekly routine that runs Sentinel against the previous week's commits = maybe $0.50. A nightly Beacon run scanning for untested code = a few dollars a month. These numbers are rounding errors against an engineer's salary. The bigger cost, in my experience, is not money but attention: every agent you build is something you have to maintain, monitor, and occasionally retire.

The performance constraint that bites hardest is recursion. An agent that dispatches sub-agents that dispatch sub-agents will, in practice, blow up your costs faster than you expect. I cap recursion at one level by convention. If a sub-agent needs to dispatch another sub-agent, that's a sign the architecture is wrong; merge them, split them, or rethink the boundary.

The other thing to watch is context bloat in long-running orchestrations. Each sub-agent has its own context, sure, but the coordinator accumulates summaries from each one. After eight or nine sub-agent calls in a row, your coordinator's context starts feeling crowded. I work around this by writing intermediate summaries to disk and reloading them lean, the same trick I described in 50 Claude Code tips.

One more thing on cost: measure. Most cloud dashboards will show you per-model usage but not per-agent usage. Tag your agent invocations or wrap them in a logger that records which agent did what. After a month, you'll find one agent is responsible for 70% of your spend, and you'll have a real conversation about whether that agent is pulling its weight.

What goes wrong

Custom agents fail in specific, repeatable ways. Here are the four I see most often.

Personality drift. The same agent gives different-feeling outputs on different days. Sometimes it's verbose, sometimes terse. Sometimes it hedges, sometimes it asserts. This usually traces back to ambiguity in the system prompt. "Be thorough" means different things to the model on different runs. The fix is specificity: replace "thorough" with "include at least three concrete code references and one suggested fix per finding." You're not constraining creativity, you're constraining inconsistency.

Tool overreach. The agent uses a tool you forgot to remove from the whitelist, or a tool you added experimentally and never trimmed. Audit your tool lists every quarter. If an agent hasn't used a tool in three months, remove it. If something breaks, you'll know which tool was actually needed.

Confidence calibration. Agents are too sure about subtle things. Sentinel will sometimes flag something as CRITICAL that's actually MEDIUM, or worse, miss a real CRITICAL because it's in a code pattern the agent hasn't been primed for. The fix is to ask the agent to articulate its confidence ("Confidence: high/medium/low because...") and to feed back failure cases into the system prompt as examples. Calibration is a long-running maintenance project.

Output format breakage. Your orchestration relies on a specific format. The agent silently changes it. A header becomes a sub-header, a bullet becomes a numbered list, the section ordering shifts. Suddenly your hook is parsing nothing. The fix here is to be brutally explicit in the prompt about format ("Use exactly these headers, in exactly this order") and to add a parser that fails loudly when the format breaks. Silent failures are the worst kind.

The meta-fix for all four is the same as managing a human team: review, feedback, retraining. Read the agent's output, notice when it drifts, edit the prompt, re-test. Custom agents are not fire-and-forget. They're teammates, and like teammates, they need calibration over time. The week I treated mine like deployed code that needed maintenance, instead of like prompts I wrote once, was the week they got dramatically more reliable.

Try one this week

The smartest thing I did this year was stop writing more code and start writing job descriptions. The leverage is unreal. Once Sentinel was in place, I stopped spending afternoons grepping for hardcoded keys. Once Scribe was in place, the README stopped going stale. Once Beacon was in place, my coverage stopped sliding every time I sprinted on a feature.

I'm not telling you to build five agents on Monday. I'm telling you to build one. Pick the most repetitive review task you do, write the JD for the teammate who would do it, drop it in .claude/agents/, and run it for a week. If you have not used Claude Code before, start with our beginner's tutorial to get the basics down, then come back here when you are ready to build your first agent. You'll learn more about agentic engineering in those five days than in any blog post, including this one.

What I keep wondering, late at night, is whether the next decade of software engineering looks less like writing code and more like writing role descriptions. Whether the discipline shifts toward articulating what we want done and who should do it with such precision that the doing becomes a logistics problem. I don't know. I'm guessing. But the morning I noticed Sentinel sounded anxious, I had a feeling the line between writing and hiring just got blurrier than it used to be.

Tell me what your first agent's name is. I'll tell you mine. And if you want to keep up with what has changed in the agent ecosystem since this post, our Claude Code tips 2026 update tracks the latest features and workflow improvements.

Share on X LinkedIn

Claude Code Custom Agents: Build Your Own AI Teammate

What custom agents actually are

The agent file format

Sentinel: the security review agent

Scribe: the documentation agent

Beacon: the test-writing agent

How to invoke them

Designing agents that work

The hiring metaphor

Multi-agent orchestration patterns

Cost and performance considerations

What goes wrong

Try one this week

Related posts

50 Claude Code Tips After 6 Months of Daily Use

Block AI Agents from Merging PRs Accidentally

Claude Code Auto Fix: Stop Babysitting Failing Pull Requests