Claude Code in GitHub Actions: CI/CD Guide

James ParkJames ParkSenior Developer Advocate
13 min read

Friday, 6:47 PM. The PR landed. Title: "chore: bump core deps + tidy types." Description: empty. Diff: 14 files, 3 dependency bumps, two of them major. The author's out-of-office was already on. Both seniors had ski emojis in their Slack statuses. I felt the specific dread of a weekend PR that will rot if nobody touches it.

So I did the thing I had promised the team for months: wired up claude code github actions on the repo, pushed the workflow file, then closed and reopened the dependency PR to trigger the new pipeline. Twelve minutes later, the bot had posted nine inline comments, a top-level summary, and a tagged callout: "This bump silently changes the default behavior of compress() from gzip to brotli. If your CDN expects gzip Content-Encoding, this will return 415 in production."

That was the breaking change. The one a tired human would have missed at 7 PM on a Friday. The PR sat there over the weekend with the comments visible, the dependency author woke up Monday, fixed the config, and we merged it before standup. No human review burden over the weekend, no fire on Monday, no apology meeting. That was my conversion moment for treating CI agents as part of the actual delivery pipeline rather than a toy in someone's sandbox.

This is the guide I wish I had on that Friday. We are going to wire up four real workflows you can copy-paste, talk honestly about what they cost, where they break, and what you should never trust them to do alone.

What we are actually building

The piece doing the heavy lifting is Anthropic's official action: anthropics/claude-code-action@v1. It's a GitHub Action you drop into any workflow trigger. It authenticates with an ANTHROPIC_API_KEY secret you set in your repo. Once it's in your workflow, it has access to your checked-out files and can do most of what claude code does locally: read code, write comments, open PRs, run shell commands, edit files, post status updates.

It's not magic. It's a subprocess running in a hosted Linux container that happens to be very good at reading code and writing English. Think of it as a contractor you hired for a few minutes per task. They are competent, they are fast, and they are not your senior engineer. Treat them like a junior with infinite patience and questionable judgment about edge cases. The quality of the action's output depends heavily on the project's CLAUDE.md rules file, which gets loaded into the agent's context on every run.

The obvious gotcha is cost. Every invocation is a metered API call. A medium PR review might be a quarter. A weekly dependency audit might be two dollars. If you don't put a budget cap on it, a runaway loop or a 4,000-file diff can absolutely ruin your week. We will get to budget management in section eight.

Setup

Five steps, none of them clever. Do them in order. If you have never used Claude Code locally, start with our beginner's tutorial to get familiar with the basics before wiring it into CI.

  1. Create an API key at the Anthropic Console and add it to your repo as a secret named ANTHROPIC_API_KEY. Settings, Secrets and variables, Actions, New repository secret.
  2. Pick your billing model. Pro accounts work fine for personal projects, but they share the same rate limit pool as your local claude code use, so a noisy CI workflow can starve your editor mid-afternoon. For team repos, use a dedicated API key tied to a billing account with a hard monthly cap set in the console.
  3. Create the workflow directory if it does not exist: .github/workflows/. Each YAML file in here is its own workflow.
  4. Pick your trigger. The four big ones are pull_request (for PR review), pull_request_target (for label-driven actions on forks, with caveats), schedule (for cron jobs like the ones we cover in our Claude Code scheduled tasks guide), and issue_comment (for human-on-demand kickoffs like commenting /claude review).
  5. Pin the action version. Use @v1 for stability or @v1.4.2 for paranoid pinning. Do not use @main in production. I have watched a midnight upstream change break our entire review pipeline because someone trusted floating tags.

Now let's wire up the four workflows that earn their keep.

Workflow 1: Automated PR review

This is the one that converted me. It runs on every PR open and every push to a PR branch, reads the diff plus relevant context files, and posts inline comments and a top-level summary. Drop this in .github/workflows/claude-pr-review.yml:

yaml
name: Claude PR Review
on:
  pull_request:
    types: [opened, synchronize, reopened]
 
jobs:
  review:
    runs-on: ubuntu-latest
    permissions:
      contents: read
      pull-requests: write
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0
      - uses: anthropics/claude-code-action@v1
        with:
          anthropic-api-key: ${{ secrets.ANTHROPIC_API_KEY }}
          mode: review
          model: claude-haiku-4-5
          max-budget: 0.50
          prompt: |
            Review this PR. Focus on correctness, security, and breaking
            changes in dependency bumps. Flag risk in the summary as
            LOW, MEDIUM, or HIGH. Do not nitpick formatting; we have a
            linter for that.

A few things worth noting about this YAML, because every line earns its keep. fetch-depth: 0 matters because the action needs the full git history to diff against the base branch, not just the PR commit. The permissions block scopes the GITHUB_TOKEN to the bare minimum: read code, write to PRs. The max-budget: 0.50 is your seatbelt. If the agent burns through fifty cents on this PR, the workflow fails closed instead of running up a bill. The model: claude-haiku-4-5 choice is deliberate: Haiku is plenty smart for diff review and costs about a fifth of Sonnet.

Cost in the wild: across the last 200 PRs on a midsize TypeScript repo, our review workflow averaged $0.11 per PR. The 95th percentile was $0.34 (a 47-file refactor). The single most expensive run was $0.49, one cent under the budget cap, which is exactly what you want a budget cap to do. Treat that fifty cents as a non-negotiable safety rail.

The Friday-night PR I opened with cost twenty-two cents. The senior engineer it would have woken up bills out at $185 an hour. Napkin math: the action saved us roughly 800x its cost in opportunity terms, before counting the production incident it prevented. You can pair this with Claude Code hooks locally so that the same review patterns run pre-commit, catching issues before they even reach CI.

Workflow 2: Test generation for new code

The second workflow is more opinionated and you should treat it as a draft generator, not a finished product. It triggers on a label, looks for new functions without test coverage, and opens a follow-up PR with proposed unit tests. Save as .github/workflows/claude-test-gen.yml:

yaml
name: Claude Test Generation
on:
  pull_request_target:
    types: [labeled]
 
jobs:
  generate-tests:
    if: github.event.label.name == 'needs-tests'
    runs-on: ubuntu-latest
    permissions:
      contents: write
      pull-requests: write
    steps:
      - uses: actions/checkout@v4
        with:
          ref: ${{ github.event.pull_request.head.sha }}
      - uses: anthropics/claude-code-action@v1
        with:
          anthropic-api-key: ${{ secrets.ANTHROPIC_API_KEY }}
          mode: agent
          model: claude-sonnet-4-6
          max-budget: 2.00
          prompt: |
            Identify new exported functions in this PR that lack tests.
            For each, write a unit test in the matching __tests__/ file
            using the project's existing test framework. Open a new PR
            against this branch with the title "tests: cover new functions
            from #${{ github.event.pull_request.number }}".

Why Sonnet here and not Haiku? Test generation actually requires reasoning about what a function does, what its edge cases might be, and what mocks it needs. Haiku will produce tests that pass; Sonnet will produce tests that catch bugs. The budget jumps to $2.00 because writing eight or ten tests across a sprawling diff is a much heavier workload than a review.

Honest limit: this workflow writes the happy path beautifully and misses the dangerous stuff. It will test that divide(10, 2) returns 5. It will not test that divide(10, 0) throws, unless you specifically prompt for it, and even then it will sometimes generate a test that asserts the wrong error type. We treat the generated PR as a starting point: a junior engineer takes thirty minutes to review and add the edge cases the agent missed. That is still vastly cheaper than writing the boilerplate from scratch, but please do not auto-merge agent-generated tests. They give you false confidence in exactly the cases where you can least afford it.

For a deeper read on getting more reliability out of this kind of pattern, see our piece on Claude Code review with multi-agent PR pipelines.

Workflow 3: Documentation updates

Documentation drift is the silent killer of midsize codebases. You ship a feature, you forget to update the README, six months later a new hire spends a day debugging a config option that no longer exists. This workflow tries to keep that drift bounded. It runs on pushes to main, diffs the change, and opens a PR if any user-facing docs need an update. Save as .github/workflows/claude-docs.yml:

yaml
name: Claude Docs Sync
on:
  push:
    branches: [main]
 
jobs:
  update-docs:
    runs-on: ubuntu-latest
    permissions:
      contents: write
      pull-requests: write
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 2
      - uses: anthropics/claude-code-action@v1
        with:
          anthropic-api-key: ${{ secrets.ANTHROPIC_API_KEY }}
          mode: agent
          model: claude-sonnet-4-6
          max-budget: 1.00
          prompt: |
            Diff HEAD against HEAD~1. If the change affects public
            APIs, CLI flags, environment variables, or config schemas,
            update README.md, CHANGELOG.md, and any matching files in
            docs/. Open a PR titled "docs: sync with ${{ github.sha }}"
            and mark it as a draft. Do nothing if no user-facing
            change is detected.

The mark it as a draft instruction is doing a lot of work here. The agent is confident. It is sometimes confident and wrong. I have watched it cheerfully update the README with a CLI flag that the underlying PR removed, because it pattern-matched on the function name and assumed the flag still existed. I have watched it write changelog entries in the wrong section. I have watched it invent a --verbose flag that does not exist because every CLI it has ever read has one.

Drafts give a human the chance to skim, fix the three lines that are wrong, and merge. Treat docs PRs the way you treat a recipe scribbled on a napkin by a friend who was three drinks in: probably right, occasionally inspired, definitely worth a sober second look before you serve it to guests.

If you want a cleaner pattern for this loop, our writeup on Claude Code auto-fixing failing PRs walks through how to keep the bot in a tighter feedback cycle so its mistakes get caught before they ship.

Workflow 4: Weekly dependency audit

The final workflow is the one that has saved us the most in straight monetary terms. It runs every Monday morning, executes whatever audit tool your stack uses, identifies safe bumps and CVE patches, and opens a single rollup PR with the upgrades and a written justification for each one. Save as .github/workflows/claude-deps.yml:

yaml
name: Claude Weekly Dependency Audit
on:
  schedule:
    - cron: '0 9 * * 1'
  workflow_dispatch:
 
jobs:
  audit:
    runs-on: ubuntu-latest
    permissions:
      contents: write
      pull-requests: write
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: '20'
      - run: npm ci
      - uses: anthropics/claude-code-action@v1
        with:
          anthropic-api-key: ${{ secrets.ANTHROPIC_API_KEY }}
          mode: agent
          model: claude-sonnet-4-6
          max-budget: 3.00
          prompt: |
            Run "npm audit --json" and "npm outdated --json".
            For each vulnerability or out-of-date package:
              1. Read the changelog between current and target version.
              2. Skip any major-version bump that has known breaking changes.
              3. Group all safe patches into one PR titled
                 "deps: weekly audit ${{ github.run_number }}".
              4. In the PR body, list each bump with a one-line risk note.
            Do not bump anything you cannot justify in writing.

The killer feature is the changelog read. Dependabot will happily open a PR bumping axios from 1.7.2 to 1.7.9. Claude will open the same PR but include a sentence: "Patches a regression in proxy handling on Node 20.10+. Low risk for our usage; we do not configure proxies." That sentence is the difference between a PR that gets merged in two minutes and a PR that sits in a queue for a week.

In our experience, this workflow has caught two CVEs before Dependabot even opened a PR for them, because it was reading the npm advisory database directly rather than waiting for GitHub's indexing lag. It cost us $2.40 that week. The senior engineer who would have triaged those advisories manually costs roughly $130 an hour. The math, as is becoming a theme, is favorable.

Cost management

Let's get specific about money, because the moment your CFO sees an Anthropic invoice with no context, you will be defending this whole project in a meeting.

Use Haiku for cheap parallel work like PR review, summary generation, and routing decisions. It's roughly a fifth of Sonnet's price and good enough for diff comprehension. Use Sonnet for code generation, test writing, and any task where being right matters more than being fast. Reserve Opus for hard reasoning tasks like architecture review or security audits where you genuinely need the deepest model. Most teams default to Sonnet for everything and burn 4x the budget they need to.

Set hard budget caps with the max-budget option on every workflow, with no exceptions. The cap fails the workflow when hit, which is exactly the behavior you want. Better a failed CI run than a $400 surprise.

Cache where you can. The action will reuse cached repo context across runs in the same workflow if you use the standard actions/cache step for your node_modules or virtualenv. That trims redundant file reads.

Napkin math for a busy repo: 30 PRs per week at $0.15 average review cost is $4.50. Weekly dependency audit at $2.50 is $10 a month. Daily docs sync at $0.40 average is $12 a month. Test generation triggered manually maybe 10 times a month at $1.20 each is another $12. Total: roughly $40 to $80 a month for a team shipping at a healthy clip. A junior reviewer at $35 an hour reaches that number in two hours of work. The break-even point is essentially a single triaged PR.

Compare that to the pricing page on the Anthropic docs and you can sanity-check the per-workflow estimates against current rates before you commit.

Security and permissions

This is the section where I would like to put a flashing red banner. Permissions are not an afterthought. Get them wrong and your repo is one prompt injection away from a bad week.

The GITHUB_TOKEN permissions block in each workflow should be the absolute minimum needed. For PR review: contents: read and pull-requests: write, nothing more. For workflows that open PRs: contents: write and pull-requests: write. For workflows that triage issues: add issues: write. Never grant id-token: write unless you specifically know why you need OIDC token issuance and have audited what the agent might do with it. That permission lets the workflow request federated identity tokens for external services like AWS or GCP, and an agent that decides to be creative with that capability can do real damage.

Branch protection should require human review on any PR opened by a bot account, including the agent. This is one click in GitHub repo settings: require a pull request review before merging, and require approval from someone other than the PR author. Do not give the agent merge permissions on your default branch. I cannot stress this enough: the agent should be allowed to propose, never to commit unilaterally to main.

Secrets are the other landmine. Do not echo secrets in your prompts. Do not include API keys in commit messages the agent might summarize. The action's logs are visible to anyone with read access to the workflow run, and if your prompt contains echo "Stripe key: $STRIPE_KEY", that key just leaked to anyone who can view the workflow logs. GitHub will scrub literal secret values from logs if they're stored as repo secrets, but anything synthesized by the agent itself bypasses that scrubber.

If you need a refresher on permissions semantics, the GitHub Actions documentation is surprisingly readable on this topic and worth a slow Sunday afternoon. For the broader question of how to block AI agents from merging PRs while still letting them propose changes, we wrote a dedicated guide covering branch protection rules, CODEOWNERS, and required reviewer workflows.

Honest limits and gotchas

We have shipped these workflows for ten months now. Here is the scar tissue.

The agent will be wrong about subtle things. It does not reason well about race conditions in multi-threaded code, type narrowing in TypeScript's flow analysis, or floating-point arithmetic edge cases. It will tell you a Promise.all is fine when actually one of the promises mutates state the others depend on. Do not use the agent as your sole reviewer for concurrent or numeric code.

It will hallucinate APIs from the wrong version of a library. This happens most often with packages that had a major rewrite, like react-router v6 to v7 or pydantic v1 to v2. The agent has training data from both versions and will sometimes produce a Frankenstein recommendation that uses v7 syntax against a v6 install. The fix is to include the relevant package.json or requirements.txt in the prompt context, but you will still see this fail occasionally.

Long-running workflows hit GitHub's job timeout, which is six hours by default but often configured lower. If your prompt asks the agent to "refactor the entire monorepo," it will try, and it will fail at hour six leaving a half-finished branch. Budget your prompts the way you budget a sprint: small, scoped, completable in under thirty minutes of agent work.

External contributors and forks are the trickiest case. A regular pull_request trigger does not have access to your secrets when the PR is from a fork, which is correct security behavior. pull_request_target does have secret access, which sounds convenient but means a malicious PR could potentially manipulate the agent into exfiltrating data. Only use pull_request_target with a label gate or some other human approval step, and audit the prompt for any way it could be coerced by the PR contents themselves. Anthropic's Claude Code documentation covers this pattern in detail, and it's worth the read before you turn anything on for an open-source repo.

For more day-to-day patterns and shortcuts that compound across all of these workflows, our 50 Claude Code tips collection has practical bits you can fold in.

Closing

The CI/CD system you have been wanting since 2018, the one that reviews your PRs at 3 AM, drafts your tests, keeps your docs from rotting, and patches your dependencies before Dependabot wakes up: it actually exists now. The friction is not technical anymore. The action is well-built, the API is stable, the docs are coherent. The friction is operational. It's about budgets, branch protection, prompt scoping, and trust calibration. Those are people problems, not engineering problems.

Here is my one ask. Do not try to set up all four of these workflows tomorrow. Pick the one that maps to your worst recurring annoyance. If your weekend PRs rot, do PR review. If your docs drift, do docs sync. If your dependency PRs pile up, do the weekly audit. Ship one. Watch it run for two weeks. Tune the prompt. Then add the next one.

The Friday-night-PR moment will come for you too. When it does, you will be glad you have something running that does not need a weekend off.

I am still figuring out what the right shape of agent-driven CI looks like at scale, and I would love to know what you ship. Tell me what worked, what burned a hole in your budget, what your team refused to let through. The interesting part of this story is just starting.

Stay in the loop. Get weekly tutorials on building software with AI coding agents. Speak to the community of Builders worldwide.

Free forever, no spam. Tutorials, tool reviews, and strategies for founders, PMs, and builders shipping with AI.

Learn More