I started tracking my time for two weeks because I was beginning to suspect myself.
It was a Thursday afternoon. I had just finished what felt like a ferociously productive day of vibe coding, the kind of session where you close the laptop and think I am operating at a different level now. I opened my task tracker to mark off what I had shipped. Two tickets. Both small. The rest of the day was a blur of accepted suggestions, tab key presses, and the warm hum of watching Claude type for me.
That is when the METR study started whispering in the back of my head. The AI productivity paradox, the one where experienced open source developers using AI tools were measurably 19 percent slower while reporting they were 20 percent faster. A 39 point gap between perception and reality. I had dismissed it the first time I read it. Not me. My workflow is different. It was something that happened to other people, to less careful developers.
Then I did the math on my week and went quiet.
The METR Study In Plain English
Here is what METR actually did, because the headline has been mangled in every summary I have read since February. They recruited sixteen experienced open source developers, gave them real tasks from their own repositories, and randomized whether they could use AI tools like Cursor with Claude on each task. Real work, real codebases, real stakes. Not a toy benchmark.
The developers predicted they would be 24 percent faster with AI. After the tasks were done, they reported they felt 20 percent faster. The stopwatch said they were 19 percent slower.
Think about the shape of that number. It is not that AI tools did nothing. The subjective experience pointed the wrong direction entirely. The developers were not mistaken by a little. They were mistaken about the sign of the effect. The arrow pointed north and they were certain it pointed south.
The DEV post breaking down the paradox does a nice job with the methodology objections. Yes, the sample was small. Yes, the developers were senior and working in codebases they knew intimately. But the measurement was clean. The tasks were not cherry picked. The gap between felt time and measured time was too large to explain away.
I kept waiting for the rebuttal study. It has not come. What came instead was the Pragmatic Engineer 2026 tooling review, which walked around the paradox carefully and mostly declined to dispute it. Arguing with a stopwatch is hard.
Why This Happens Neurologically
I am not a neuroscientist. I want to mark that boundary clearly before I wade in. But I have read enough to know the shape of the answer, and the shape matters more than the citations.
Dopamine does not actually reward you for finishing things. This is the part everyone gets wrong. Dopamine rewards anticipation of reward. It fires when the slot machine is spinning, not when the coins hit the tray. The loudest hit comes between the pull of the lever and the moment you learn the outcome. Finishing is almost an anticlimax.
Vibe coding is a slot machine built for this exact wiring. You type a prompt. The cursor blinks. A response begins to stream in, character by character. You read the first few lines and feel something click. Yes. That is roughly what I meant. You accept the suggestion. The ticker resets. You type another prompt.
Every accepted suggestion is a pull of the lever. The dopamine hits before you have verified the code runs, before you have integrated it with anything, before you know if it solved the actual problem. You are being rewarded for starting, over and over, in a loop so smooth you stop noticing you have not finished anything in hours.
Compare that to the old way. You sat with a problem. You thought. You wrote a line of code. It broke. You fixed it. The reward came at the end, in a single thick chord rather than a drizzle of small chimes. It felt worse in the moment and better at the end of the day.
This is the mechanism under the METR number. The felt speed is real. It is measuring the wrong thing.
My Own Two Weeks Of Timing
So I tracked everything for two weeks. Not in a clever way. I opened a plain text file and wrote down start time, end time, task, and whether I had used Claude Code as the primary driver or as a sidekick. I did not change my workflow. I wanted baseline, not best case.
Here are the rough numbers. Take them with appropriate skepticism because I am a sample size of one and a biased observer.
Two weeks. Roughly 42 hours of actual coding time (I do not count meetings or thinking walks). I shipped eleven things that got merged. On the tasks I could compare to similar work I had done unassisted in March, I was slightly slower on average. Not 19 percent slower. More like 8 to 12 percent, though the error bars on that are enormous and I trust them about as far as I can throw them.
But the subjective experience was wild. I felt like I was shipping at twice my normal rate. Twice. I would close the laptop at seven in the evening convinced I had just had the best day of my year, and then I would look at git log and see three commits, two of which were small.
Where did the time go? Roughly a third of the gap was me reviewing AI generated code that turned out to be subtly wrong. Another third was re-prompting because the first output missed a constraint I had not spelled out. The final third was a kind of drift. I would accept something 90 percent right, move on, and come back later because the 10 percent created a weird ripple in a file I had not been looking at. Debt, paid in small installments.
If I extrapolate, 42 hours at ten percent slower is 4.2 hours of hidden tax every two weeks. Over a year, that is roughly 100 hours I am paying to feel fast. A hundred hours is two and a half workweeks. I am still not sure if the trade is worth it, and I am suspicious of anyone who tells me they are sure.
I might be wrong about all of this. The numbers are soft. But the direction matches METR, and the direction is the part that unsettled me.
The Scoped Versus End To End Gap
Here is where the story gets more interesting, because the paradox is not the whole picture. The paradox is what you see when you measure the wrong unit.
When I broke my tracking down into scoped tasks (write this function, refactor this component, add this test, translate this snippet from Python to Go), the AI was genuinely, measurably faster. Not 20 percent. More like 30 to 55 percent, which lines up with every internal benchmark I have seen from teams that measure honestly. On a scoped task, vibe coding is a jetpack.
The gains evaporate when you zoom out. The parts that matter at the project level are the parts AI cannot help with yet. Deciding what to build. Choosing which features to cut. Understanding why the previous engineer structured the module the way they did. Integrating three subsystems whose interfaces do not quite line up. Debugging production issues in code you did not write. These bottlenecks dominate real delivery timelines, and they are exactly the parts where AI either shrugs or produces plausible nonsense with confidence.
Think of a restaurant kitchen. Give a home cook a robot that dices onions at ten times human speed and you have not shortened dinner by much. Dicing onions was never the bottleneck. The bottleneck is deciding what to cook, timing the dishes so they all arrive warm, tasting the sauce. The robot makes the prep feel electric but the meal arrives at the same time, or maybe a little later because you stopped watching the stove while the robot worked.
Or think of a marathon. The sprint splits feel faster than ever. You are running the first quarter mile at a pace that would astonish your old coach. But the race is 26.2 miles, and the splits do not compound the way you think. The AI is brilliant at sprinting and almost useless at pacing, and the race is mostly pacing.
This is the honest reading of METR. Not that AI tools make developers slower, which is the misleading headline. The real finding is that AI tools make parts of the work faster while making the whole of the work slower, because the parts that get faster are not the parts that were limiting you. I wrote a version of this idea in Everyone Can Code Now. Nobody Can Ship. and I think about it every time I mistake a dopamine hit for velocity.
The jobs that benefit most from the AI shift are not the jobs of people who write code faster. They are the jobs of people who decide what to build, which I explored in Vibe Coding Is The New Product Management. If coding is no longer the bottleneck at the project level, then taste is.
The Compensating Practice
So what do you do if you want the scoped gains without paying the full price of the paradox? I have been experimenting. None of it is clean. All of it feels a little joyless at first, the way eating vegetables feels after a week of dessert.
The first practice is measuring. Not obsessively, but enough to have a reference point. Track how long a task takes end to end, from the moment you start thinking about it to the moment it is merged and green in CI. Do it for a week without AI and a week with AI. The felt difference and the real difference will not match. You are trying to re-calibrate your internal stopwatch against the real one.
The second practice is what I have started calling scope discipline. Use AI aggressively on tasks that fit in a single file or a single function. Back off when the work requires touching five files or understanding a new subsystem. This is not permanent. It is a guardrail until the tools get better. The tooling landscape moves fast, which is partly why I maintain the vibe coding tools comparison.
The third practice is brutal and I have been bad at it. Finish things before starting new things. The dopamine loop rewards starting, so you have to manually weight the scoreboard toward completion. One merged PR is worth three in progress. I keep a sticky note on my monitor that says MERGE BEFORE YOU BRANCH. It helps a little.
The fourth is the strangest. I have started doing one hour a day of coding with the AI turned off entirely. Not because it is more productive. Because it keeps my judgment calibrated. It is like a pianist practicing scales instead of just playing concerts. The scales are not the performance. The scales are what keeps the performance possible. When I go back to the AI, I catch bad suggestions faster.
I am still not sure about all of this. I have been running these experiments for maybe six weeks and my data is messy. The tools are getting better every quarter in ways that could invalidate the whole argument by July. Maybe the paradox is a temporary feature of this specific moment in tooling maturity.
But I think the underlying mechanism is not going anywhere. Dopamine will still reward anticipation more than completion. The bottlenecks at the project level will still be the parts AI cannot touch. The slot machine will still feel like a rocket ship, and the stopwatch will still disagree.
I would love to know what your two weeks of tracking looks like. Not the story you tell yourself at the end of a good day. The actual log, with honest numbers, and the small shock of discovering you were wrong about your own speed. Send me yours and I will send you mine.