I spent most of last Tuesday staring at a model selector, wrestling with the claude opus 4.6 vs sonnet 4.6 question for the tenth time that day. Not writing code. Not debugging. Just staring at a dropdown, trying to decide whether a refactor I had been putting off for six weeks was Opus-hard or Sonnet-fast. The task was untangling a legacy billing module written by three different engineers over four years. I eventually picked Opus. It cost me about $14 in tokens and produced a plan I still use. If I had picked Sonnet, I probably would have spent $2 and gotten 80 percent of the insight.
That specific tradeoff, the one I face two dozen times a week, is the entire point of this piece. So let me try to give you the clearest possible breakdown, not as a benchmark fetishist but as someone who pays for both models out of his own research budget and notices, in a physical way, when the monthly bill surprises him.
I want to say upfront: I am skeptical of clean verdicts. Anyone telling you one model is universally better is selling something, usually a newsletter. The honest answer is that these two models are built for different moments in a workday, and choosing well is less about intelligence and more about matching the tool to the cut.
The decision I make twenty times a week
Here is what the decision actually looks like in practice. I open a terminal. I think about the task. I ask myself three questions.
Is this novel or routine? Routine means I have seen the shape of the problem before, probably ten times. Novel means I genuinely do not know where to start.
Will I be iterating quickly? A tight loop, prompt-response-edit-prompt, punishes latency. A slow deliberative session, where I am reading the output carefully and sketching on paper, does not.
Is the cost of being wrong high? If I am writing a throwaway migration script, wrong is cheap. If I am making an architectural call that the rest of the codebase will inherit for two years, wrong is expensive in a compounding way.
If the answers skew novel, slow, and high-stakes, I reach for Opus. If they skew routine, fast, and low-stakes, I reach for Sonnet. That is the whole heuristic. Everything below is just me showing my work.
What each model is actually good at
Opus 4.6 is the deepest reasoner Anthropic ships. It is slower. It is more expensive. It thinks longer before it types, and when it does type, the output tends to hold together across multi-page arguments in a way that Sonnet sometimes does not. It is the model I reach for when I need the AI to hold twelve constraints in its head at once and still produce coherent architecture.
Sonnet 4.6, on the other hand, is the workhorse. It is the kitchen knife in a chef's roll, the one that gets used for ninety percent of cuts while the fancy ones sit in their sheaths. It is fast. It writes clean code. It follows instructions with surprising precision. Its extended context window, which reaches up to one million tokens in the extended mode, means I can feed it an entire repository and ask it to audit for consistency. Opus caps at around 200K tokens, which sounds like a lot until you try to paste in a medium-sized Next.js project and realize you are three files short.
This context difference matters more than it gets credit for. The model that can see your whole codebase, even if it reasons slightly less deeply per token, often makes better decisions than the smarter model working from fragments. Knowing the territory beats knowing the map.
There is a line I heard from an old carpenter: the best saw is the one already in your hand. For most Claude Code sessions, that saw is Sonnet. The ergonomics of fast response time compound over a four hour coding session. You do not notice it in any single exchange, but by the end of the day, you have done more work with less cognitive friction. That is worth something the benchmark papers never measure.
Pricing: the napkin math that matters
Let me do this the way I think about it, with real numbers that I want you to treat as approximate rather than precise.
As of February 2026, rough published pricing (check Anthropic's pricing page for current rates, because these drift) sits around these ranges. Opus 4.6: approximately $15 per million input tokens and $75 per million output tokens. Sonnet 4.6: approximately $3 per million input tokens and $15 per million output tokens. So Opus costs roughly five times more per token, input or output. These are list prices. Enterprise contracts and batching can shift them.
Now the napkin math. Say a typical Claude Code session involves 50,000 input tokens (codebase context, prior messages) and 10,000 output tokens (code, explanations). That session on Sonnet costs: 0.05 × $3 + 0.01 × $15 = $0.15 + $0.15 = $0.30. Call it thirty cents.
The same session on Opus: 0.05 × $15 + 0.01 × $75 = $0.75 + $0.75 = $1.50. Call it a buck fifty.
The delta per session is $1.20. If you run ten sessions a day, every working day of the year, the yearly difference is 10 × $1.20 × 250 = $3,000 per engineer. That is not a rounding error. That is a meaningful budget item in a small team, and a line on a CFO's spreadsheet in a larger one.
But the math cuts both ways. A single good architectural decision that saves three days of wrong-direction coding is worth maybe $3,000 of engineer time by itself. So if Opus gives you one of those decisions per month that Sonnet would have missed, you break even. If it gives you two, it pays for itself twice over. The question is not whether Opus is worth it in the abstract. It is whether the specific task in front of you is the kind of task where deeper reasoning changes the outcome.
Benchmark snapshot: what the numbers say
I am going to be careful here because benchmarks are where writing about models goes to die. Numbers vary by evaluation, by prompt engineering, by month, and by who is reporting them. Treat everything below as rough and directionally useful, not gospel.
On SWE-bench Verified, a coding benchmark that tests agentic software engineering tasks, both models are at the frontier. Opus 4.6 holds a modest edge, something in the range of a few percentage points above Sonnet 4.6 in most published runs I have seen. On general reasoning benchmarks like GPQA Diamond, Opus pulls ahead more meaningfully, often by five to ten percentage points. On tasks that reward long-context coherence, Sonnet's million-token window actually flips the usual ordering; Opus is smarter but cannot see as much at once.
Anthropic's own documentation and engineering posts (see the engineering blog and model cards) position Sonnet 4.6 as the recommended default for most coding workflows. That is a meaningful signal from the vendor who presumably has the most honest data about when each model shines. Internal surveys of Claude Code users show something like seventy percent defaulting to Sonnet for daily coding sessions, with Opus reserved for planning, architecture, and research. I have not independently verified this number, and the real figure may be higher or lower, but it matches what I see in my own team.
The benchmark I trust most is the one I run myself. Every few weeks, I pick three representative tasks from my actual work, run them on both models, and compare the outputs. Sonnet wins on speed and consistency. Opus wins on novel problems and deep architectural thinking. Neither wins universally. Your mileage will vary, and I mean that as honest advice rather than a cop-out.
Three real tasks, three different picks
Let me walk through three concrete tasks from the past month, with the model I picked and why.
Task one: writing a set of unit tests for a new payment validation function. The function is small, maybe eighty lines. I know exactly what edge cases matter. I need tests fast because I want to commit before lunch. I picked Sonnet. Total cost: about $0.18. Total time: eleven minutes. Sonnet wrote twelve tests, caught two edge cases I had not explicitly mentioned, and produced code I committed with two minor edits. This is the meat-and-potatoes workflow where Sonnet shines. Speed compounds.
Task two: deciding how to restructure an authentication flow that had accumulated five years of workarounds. I needed the model to read through three related services, understand the historical decisions, and propose a migration path that would not break production. I picked Opus. Total cost: about $9. Total time: forty minutes of back and forth. Opus produced a four-phase migration plan with risk assessments for each phase. The plan correctly identified a subtle race condition that I had not even asked about. A week later, I compared what Sonnet would have produced on the same prompt. Sonnet's plan was competent but missed the race condition and proposed a phase ordering that would have caused a two-hour outage. This is exactly the kind of task where Opus earns its premium.
Task three: reviewing a pull request from a junior engineer. Routine. Medium stakes, because bad reviews compound into bad code. But I know what good code looks like in this codebase. I picked Sonnet. It gave me a tight list of five actionable comments and flagged one potential performance issue. I added my own thoughts on naming and shipped the review in under ten minutes. Opus would have been overkill. Worse, its slower response time would have pulled me out of the reviewing rhythm.
If you notice a pattern: Opus for the rare high-stakes decision, Sonnet for everything with a rhythm. That maps closely to how the 50 Claude Code tips I keep coming back to recommend organizing a workflow.
When Opus justifies its price
Opus is worth the premium in a narrow but important set of situations.
Architectural decisions that will outlive the current sprint. Choosing a database engine, a queue system, a boundary between services. These decisions are sticky. Being wrong costs you weeks later. Paying $5 for a better answer is obvious math.
Debugging across unfamiliar code. When I inherit a codebase and something is broken in a system I do not understand, Opus's deeper reasoning handles the ambiguity better. It holds more hypotheses in parallel before settling.
Research tasks where correctness matters more than speed. I do a lot of comparative analysis writing (hello, this article). For the kind of reasoning that needs to reconcile multiple sources and surface non-obvious connections, Opus tends to produce work I trust with fewer revisions.
Long multi-step planning. The cloud-based /ultraplan flow in Claude Code uses Opus precisely because planning is where reasoning depth pays the most per dollar. I wrote more about this in the ultraplan piece on cloud planning, if you want to see how that plays out in practice.
Rare hard decisions. There is a class of problem where you are not writing code at all, you are deciding whether to write code. Should we build this in-house or buy it? Should we refactor now or defer? Should we adopt this new framework? These are judgment calls, and judgment is where deeper reasoning tends to matter most.
When Sonnet is not just enough but better
This is the part most people get wrong. Sonnet is not a lesser Opus. For many tasks, it is actively the better choice, not merely the cheaper one.
Tight iteration loops. When you are prompting, reading, editing, and re-prompting ten times in a row, latency dominates. Opus's extra seconds per response compound into broken flow. Sonnet keeps the rhythm. You think faster with it because it keeps up with your thinking.
Code generation in a well-understood domain. Writing a REST endpoint, generating a schema migration, creating a React component. These are tasks where the structure is known, and the marginal value of deeper reasoning is small, while the marginal cost of waiting is real.
Test writing. I have watched Sonnet write more robust test suites than Opus for the same code, specifically because it writes them faster and I end up catching edge cases through iteration rather than single-shot reasoning.
Pair programming cadence. The Claude Code tutorial for beginners covers this workflow in more depth, but the short version is that the ideal pair is responsive, not ponderous. Sonnet feels like a teammate. Opus feels like an architect you booked for a consultation.
Anything with a million-token context requirement. Sonnet 4.6's extended context mode simply does things Opus cannot. If I need the model to audit an entire repository for consistency, Opus is not even in the conversation.
There is a reason surgeons use a chef's paring knife for most cuts, not a scalpel. The scalpel is sharper, but the paring knife is the tool that matches the hand.
My default rule of thumb
Here is the rule I have converged on after running both models in parallel for six months.
Default to Sonnet. Start every session with it unless you have a specific reason not to. The speed, the cost, the long context, the consistency, they all add up to the best daily experience.
Escalate to Opus when three conditions hit at once. First, the task is architectural or cross-cutting, meaning its decisions will ripple through the codebase. Second, you genuinely do not know the answer, which means this is novel territory rather than a variant of something you have done before. Third, the cost of being wrong is measured in days, not minutes.
If any of those three are missing, stay on Sonnet. If all three are present, the extra dollar or two is trivial compared to the cost of a bad decision.
And yes, sometimes I get it wrong. Sometimes I pick Opus and the problem turns out to be routine and I have overpaid by a factor of five. Sometimes I pick Sonnet and three hours later I realize I should have escalated. The art is calibrating your own intuition, and that only comes from running both models on enough tasks to feel the difference.
For what it is worth, the comparison in Cursor vs Claude Code 2026 has some adjacent thoughts on how model selection interacts with tooling choice, which matters more than most people admit. The IDE you use shapes which model feels natural.
One more piece of honest advice. Do not let model selection become procrastination. I have wasted entire afternoons optimizing my dropdown choice when I should have just picked something and started working. The meta-work on tool selection is only valuable up to a point. Beyond that, it is avoidance dressed up as diligence. Pick, work, adjust.
Closing
I keep coming back to that Tuesday afternoon with the billing module. Opus produced a plan I am still using two months later. But the thing I remember most clearly is not the plan itself. It is the moment I decided to pick Opus. I paused. I thought about it. I asked the three questions. And then I committed.
That pause is the practice. Not the model choice, the pause. Everything downstream flows from deciding that this particular task deserved a moment of deliberation rather than a reflex.
So here is where I will leave you. Tomorrow, when you open Claude Code for your first session, spend five seconds on the model selector. Not five minutes. Just five seconds. Ask yourself the three questions. Commit to an answer. See what happens over a week of doing that deliberately. I suspect you will end up, like me, defaulting to Sonnet most of the time and escalating to Opus for the rare hard decisions. But the deeper win is not the specific ratio. It is the practice of matching the tool to the cut.
If you run this experiment for a week and learn something I did not, tell me. I keep rewriting this essay in my head every time I find a new edge case, and I would rather be wrong in good company than right alone. The conversation, as always, continues.
For reference while you experiment, the Claude Code documentation has the canonical commands for switching models mid-session, which I use more often than I expected to.