The agent made it work on Tuesday. Wednesday morning the login screen rendered a white rectangle where the form used to be. No console error. No stack trace. A polite, uncooperative void. And I had not written a single line of the code behind that form. Claude Code had. This is what it feels like when you have to debug AI generated code at 8 AM on a deadline day: the file tree looks like a stranger's handwriting, and the usual instincts fail you.
Vibe coding debugging is its own discipline. The muscle you use to debug your own work, that quiet voice that says I remember writing this weird bit, let me check there first, goes silent when the code is a stranger. You stare at a 400-line file generated last week and it reads like a confession in a language you half-learned in high school. Familiar shapes. No meaning.
Sofia and I have been debugging each other's agent output for the better part of a year, and we learned the hard way that the instinct to read is almost always wrong. The move is forensic, not interpretive. You do not need to understand the code. You need to find the moment it stopped working.
This is the playbook.
The moment of panic (and why your first instinct is wrong)
Your first instinct is to open the broken file and start reading. Do not do this. Close the file. Stand up. Get water.
A detective walking into a crime scene does not start by interviewing the walls. They secure the scene, note what is out of place, and ask the one question that matters: what changed. Code that worked yesterday and does not work today has exactly one honest answer to that question, and the answer lives in your git history, not in the file itself.
The panic instinct is built for debugging your own code because your own code has a context you already carry in your head. You remember writing it, which gives you a hypothesis about where the bug probably lives. With agent-generated code you have no such map. Reading 800 lines of unfamiliar code at 2 AM is roughly the same productivity as not reading them at all. Bisect, which I will show you in a moment, can find the broken commit in about 7 steps for a 100-commit history. That is the difference between an hour and a week.
The other reason reading fails: AI-written code is often locally correct but globally strange. Every function looks fine in isolation. The bug is almost never in the function you are looking at. It is in the seam between two functions the agent wrote on different Tuesdays. Snyk's field report, The Highs and Lows of Vibe Coding, describes this failure mode: AI-generated code that passes review and still ships subtle integration bugs nobody catches until production.
Step 1: Bisect with git (find the broken commit before reading any code)
git bisect is the single most useful debugging tool that almost nobody under 30 knows exists. It is a binary search over your commit history. You give it a good commit and a bad commit, and it walks you through the middle asking does it work here, until it lands on the exact commit that introduced the bug. Logarithmic. Boring. Works every time.
Here is the full incantation.
git bisect start
git bisect bad # current commit is broken
git bisect good HEAD~50 # fifty commits ago it workedGit will check out a commit roughly in the middle. Now you test. Run your app, hit the broken flow, observe. If the bug is present, mark it bad. If the bug is gone, mark it good.
git bisect bad # bug is here, go earlier
# or
git bisect good # bug is not here, go laterKeep going. In about seven rounds for fifty commits, git will print the exact commit hash that introduced the bug. When you are done, unwind:
git bisect resetIf you can script the reproduction as a test or a curl that exits non-zero when the bug is present, you can let git do the entire walk automatically.
git bisect start HEAD HEAD~50
git bisect run ./scripts/reproduce-bug.shGo make coffee. It will find the bad commit without you. The official git bisect documentation covers the full flag set, but honestly the five commands above will handle 95% of real-world debugging.
One nuance for vibe coders: if you --squash merges or push giant agent-generated commits, bisect still works but lands you on a single commit with thousands of lines. That is when you combine bisect with step 2.
Step 2: Ask the agent to explain its own code (and verify the explanation)
Once bisect has handed you a specific commit, you have a scoped problem. The diff is in front of you. The agent that wrote the diff is still available. Put them back in the room together.
Open Claude Code in the project and paste something like: "You wrote this commit. Here is the diff. Walk me through exactly what it does, function by function, and tell me every side effect. Do not speculate. Do not paper over anything."
Then, and this is the part people skip, verify the explanation against the code. The agent will confidently explain what it meant to write. That is not always what it actually wrote. Treat the agent's explanation the way a good doctor treats a patient's self-diagnosis: useful input, never the final word. A differential diagnosis starts with what the patient reports and then orders the labs anyway.
You are not trying to understand the code in the abstract. You are looking for the specific discrepancy between what the agent says it did and what the code actually does. That gap is almost always where the bug lives. CodeRabbit's analysis of AI code generation patterns found that AI-written code is statistically more likely than human code to contain off-by-one errors, silent fallbacks, and swallowed exceptions. Those are the exact failure modes the agent will not mention unprompted because, from its perspective, they were intentional.
A useful follow-up prompt: "Where in this diff would you put a print statement to confirm the control flow actually goes where you think it goes?" Then put the print statement there and run it. The agent is surprisingly good at nominating the spot that most needs instrumentation when you ask directly.
Step 3: Reproduce in isolation (the smallest test that fails)
Bisect told you where. The agent's explanation gave you a hypothesis about why. Now you need to prove or disprove that hypothesis quickly, in a loop, without spinning up the whole app.
This is the smallest failing test step. A mechanic listening to a bad engine does not take the whole thing apart. They isolate: one cylinder at a time, one injector at a time, until the noise localizes. Your job is the same. Strip the broken behavior down until you have the minimum code that reproduces it.
Sometimes that means a unit test. Sometimes a 20-line script in a /scratch folder that imports the one function and calls it with the breaking inputs. Sometimes a single curl against a running endpoint. The form does not matter. What matters is that the reproduction runs in under a second and gives a clear pass/fail signal.
Once you have a one-second reproduction, you can ask the agent to fix the bug while you watch, with an objective oracle to tell you whether the fix worked. No more "let me restart the dev server and click through three screens." No more I think it is fixed. You know because the red test went green. This is also the test you feed to git bisect run next time. Our 50 Claude Code tips piece has a whole cluster on isolation workflows if you want to go deeper.
Step 4: The "delete and re-prompt" decision (when to throw it out)
Here is a move that feels wrong and is usually right: throw the code away and ask the agent to write it again.
Traditional debugging conditions you to view deletion as failure. You invested time in that code. Real engineers fix, they do not delete. Real engineers also did not have an AI that can regenerate 200 lines in 40 seconds. The economics are different now, and the instincts need to catch up.
The rule of thumb Sofia and I have landed on: if the buggy chunk is under 300 lines and the agent wrote it in a single session, and if you have a one-second reproduction test from step 3, delete it and re-prompt. Do not even try to fix it. Multiply the probability of success by the time cost and the math is not close.
What makes this work is the reproduction test. Without it, you are gambling. With it, you have an oracle. Delete the file, write a prompt like "Here is what this module needs to do. Here is the failing test. Make the test pass without touching anything else." If attempt two also fails the test, the problem is in your prompt or your test, not the agent. Fix those before a third regeneration.
When not to delete: if the bug is in code that integrates with many other parts of the system and you cannot cleanly isolate it, regeneration risks a cascade of new bugs. In that case, fix the specific line. This is the same idea we explored in Everyone can code, nobody can ship: the fragile part of vibe coding is not generation, it is the long tail of what you keep.
Step 5: Capture the lesson in CLAUDE.md so it does not happen again
You found the bug. You fixed it. Do not move on yet. You are missing the step that separates a debugger from a senior engineer.
Every bug you just fixed is a teacher. The bug happened because the agent did not know something it needed to know. That knowledge gap is now your knowledge, which means it lives only in your head and will leave your head by Friday. Move it into the file the agent reads at the start of every session.
Open your CLAUDE.md and add one line. Not a paragraph. One line. Something like:
- Never swallow exceptions in the auth middleware silently. Always rethrow or log with context.
- When creating a new API route, verify the OpenAPI schema is updated in the same commit.
- Do not assume
user.emailis present. The schema allows null. Always check.
The rule should be the smallest sentence that, if the agent had known it before, would have prevented this bug. No examples, no preamble, no theory. Just the rule. Our definitive guide to CLAUDE.md lays out the seven sections we use and why rules files work this way.
The compounding effect is what matters. One rule per bug, over six months, is 30 to 50 rules. Each represents a class of mistakes the agent no longer makes in your repo. By the end of a year, the agent has genuinely learned your codebase, not through training but because every scar became a stitch in the contract. This is also how you avoid the security version of the same problem, covered in the vibe coding security rules.
One warning. Do not let CLAUDE.md become a wiki. Every line competes for attention in the context window. When you add a rule, check whether an existing one should be rewritten or removed instead. The goal is sharper, not longer.
The five bug categories AI consistently ships (with one-line fixes for each)
After enough bisecting, the same shapes keep turning up. An archaeologist looking at a broken pot can date it by the glaze pattern; after a year of vibe coding, you start recognizing bug lineages the same way. Five categories cover almost everything we see, with a one-line fix for each.
-
The silent fallback. The agent wraps something in a
try/exceptthat catches all exceptions and returns a default. The app never crashes, it just quietly returns wrong answers. Fix: grep for bareexcept:andcatch (e)with empty bodies, and require every caught exception to be logged or rethrown. -
The hallucinated API. The agent calls a method that does not exist on a library it half-remembers from training data. Your linter may not catch it because the method is syntactically valid, just wrong. Fix: pin library versions in your CLAUDE.md and require the agent to verify method signatures against the real docs before use.
-
The off-by-one on the boundary. Dates, array indices, pagination cursors. The agent gets the middle right and the edge wrong. Fix: every function that takes a range gets two tests, one for the inclusive edge and one for the exclusive edge, before any production use.
-
The quietly duplicated state. The agent adds a new piece of state to solve the immediate problem without noticing that the same information already lives somewhere else. Now you have two sources of truth and they drift. Fix: before accepting any state addition, ask the agent to list every existing variable in the module and justify why a new one is needed.
-
The schema drift. The database, the API types, and the frontend types diverge because the agent updated one and forgot the others. Fix: generate types from a single source of truth (Zod, Prisma, OpenAPI) and add a CI check that fails if generated types are out of date.
Those five cover the overwhelming majority of what we see, and none of them will be caught by reading code line by line. They are all integration bugs. They live in the seams. They are exactly why bisect plus a reproduction test plus a deletion mindset works better than interpretation.
The thing we did not expect, a year in, is that debugging vibe coded apps has made us better debuggers of our own code too. The forensic posture, the bias toward bisecting before reading, the discipline of writing the smallest failing test first, all of it transfers. The techniques you need to survive code you did not write are the techniques you should have been using all along.
We are still figuring out the edges. Sofia thinks the "delete and re-prompt" threshold should be higher. I think CLAUDE.md hits a ceiling around 200 rules and we will need a better structure. Neither of us has a clean answer for debugging agent code rewritten three times across sessions.
If you have a workflow that works, tell us. If the five categories match what you see in the wild, or you are seeing a sixth, tell us that too. The playbook is open-source in the oldest sense: it gets better when people share what broke. See you at the crime scene.