June 12, 2026

Home Blog Writing Code vs. Shipping Code: The AI Productivity Paradox
riting Code vs. Shipping Code: The AI Productivity Paradox

AI made coding much faster, but delivering finished software hasn’t sped up nearly as much. In a 2026 NBER study of over 100,000 developers, autonomous agents boosted commit activity by about 180%. Yet completed projects increased by only 50%, and real releases by just 30%. This reveals the core paradox: the gains from AI vanish at the same human checkpoints as before — review, testing, and approval — which still require human involvement. The real winners are teams that adapt their processes to this new reality, focusing on improving human-dependent stages rather than just generating more code.

One number from a recent NBER study sticks with me: teams using autonomous coding agents saw their commit activity jump by 180%. On paper, you’d expect finished projects and shipped releases to rise by about the same amount.

But it didn’t. Projects rose by roughly 50%, and releases by about 30% (Demirer, Musolff & Yang, 2026).

That gap explains the heart of the AI productivity paradox. Once you notice it, the cause is clear: writing code was never the main bottleneck. AI accelerated an already fast part of the process, but the true slowdowns — review, testing, deployment, and approval — hardly changed, so shipping doesn’t speed up proportionally.

The speedup is real (at the keyboard)

At the level of one developer, the gains aren’t in dispute. In a controlled study, developers using Copilot finished a build-an-HTTP-server task 56% faster than the control group (Peng et al., 2023). In a larger field experiment across Microsoft, Accenture, and a Fortune 100 company, about 4,900 developers completed roughly 26% more tasks with an AI assistant (Cui et al., 2024).

The generation effect also compounds. Across more than 100,000 GitHub developers tracked from 2022 to 2026, the three waves of tooling stack on top of one another: autocomplete alone adds about 40% to commit activity, sync agents bring the cumulative figure to roughly 140%, and async agents push it to 180% (Demirer, Musolff & Yang, 2026).

Pause on that last number, because part of it isn’t the developer moving faster. It’s commits the agents wrote themselves.

It really looks like the tools are getting better, not people getting better at using them. When the authors aligned early Claude Code adopters to calendar time, productivity rose with each new Opus release, rather than creeping upward as a learning curve would. The effect held across the full 30 weeks after adoption. The models were doing more of the work; the prompts weren’t just getting cleverer.

So, the keyboard is fast. But what happens after it?

Where the gains go to die

Follow the 180% downstream, and it bleeds out at every checkpoint until only 30% reaches a release.

If you pull out a single tool generation and look at it by itself, the funnel is even more brutal. Sync agents lifted lines of code by 741% and pull requests by 65%, but all that translated into just a 20% bump in releases. Autocomplete runs the same shape one floor down: 228% more code, 10% more releases (Demirer, Musolff & Yang, 2026).

The authors call this a weak-link problem. The pipeline moves at its slowest, most human-dependent stage. They put the elasticity of substitution between AI and human effort at about 0.25 — low enough that the two act as near-complements rather than substitutes, which is why pouring in more machine power doesn’t clear a human bottleneck. The gains, in their words, get “attenuated by human bottlenecks in the production chain.”

And it’s not just this one study. Google’s DORA team — the people behind the delivery metrics half the industry benchmarks on — keep landing on the same thing. Back in 2024, their data showed that the more AI a company adopted, the worse its delivery performance became: every 25% increase in adoption was associated with roughly 1.5% lower throughput and 7.2% lower stability. A year later, the picture had shifted. Teams had learned to use the tools, and throughput climbed back into positive territory — but the stability problem persisted. More AI still meant more breaking after release, not less. DORA’s own way of putting it is hard to argue with: AI makes you faster, and moving faster mostly shows you where you were already fragile. And their take on where the saved time goes will sound familiar by now — you write the code quicker, then spend the difference checking and verifying it.

The human stages AI can’t rush

Those bottlenecks aren’t abstract.

The most immediate is review. Spinning up a pull request takes seconds, but reading one takes a human who actually knows the codebase and is ready to think hard about it. Production speeds up, review doesn’t, and the queue just grows — the bottleneck shifts from typing to reviewing, and nobody really chooses that, it just happens.

Quieter and slower to surface is what Storey (2026) calls cognitive debt: the team loses its shared understanding as people start approving code nobody truly gets. It shows up later as churn — code rewritten or ripped out not long after it lands. It looked like shipping, but a lot of it was just activity, not real progress.

Validation is the one teams least expect to pay for, and it never goes away. Almost no one ships model output without checking it, so the time you saved writing comes right back: double-checking edge cases, re-prompting, chasing down the weird bugs the model missed. The faster you get a draft, the more of this work lands on you.

Under all of it, in regulated work, sits a floor you can’t automate away. Security review, licensing, compliance: human by design, and a hard ceiling on how fast anything reaches production, regardless of how quickly it was written. That’s most of what I see.

What AI is actually good for

None of this means the tools are oversold. They’re good at a specific kind of work, and the teams getting value are those aiming them there.

Mostly, it’s the expensive busywork: config files, service scaffolding, schemas, or the first version of a test suite. What once took twenty minutes digging through docs now takes just twenty seconds.

It’s also flow. Looking up an API pattern without leaving the editor sounds trivial until you factor in context switching, one of the highest hidden costs in a developer’s day.

And it’s ramp-up: a huge share of engineering is reading code you didn’t write, and an explanation on demand compresses the slowest, least skippable part of the job.

The thread is the same through all three. Effort shifts from mechanical work to judgment, which is the version of these tools worth having.

Juniors get the speed, seniors get the leverage

The distribution of gains surprised me most (Cui et al., 2024).

Juniors get the biggest task-level speedups. Tools fill knowledge gaps, working code arrives faster, and the day feels more productive. But they’re more likely to miss architectural mistakes or integration problems, since spotting those requires experience.

Seniors get less raw speed and far more of the project-level payoff, not because they use the tools more, but because they use them better. A strong mental model lets them bin a bad suggestion on sight, keep the good parts, and shape the rest into something that holds.

The tools amplify expertise but don’t supply it.

So what do you actually do

If you point AI solely at code generation, your pipeline will quietly eat those gains. The teams that convert coding speed into shipping speed do a few things differently, and none of them is technical.

They push automation beyond the editor, extending it into testing, docs, security analysis, and operational glue, rather than stopping at code alone. They hire and coach for evaluation, not for generation, because the scarce skills now are review and architectural judgment, not raw output.

The move that matters most also gets skipped most: treating the PR process as the bottleneck it’s become. That means real investment in automated testing and triage so human review doesn’t harden into a permanent traffic jam.

Then there’s measurement. Track the far end of the line: DORA metrics, change-failure rate, rework, churn, and stop rewarding commit counts and line totals that AI inflates by design.

Here’s the part I’d underline. The biggest productivity gains go to teams that redesign their workflow around the reality that human understanding, review, and accountability remain the final gate for shipping software, no matter how much AI accelerates code generation.

Shipping isn’t even the finish line. When the same researchers checked four app marketplaces — the Chrome Web Store, Google Play, and SourceForge among them — they found more apps but no growth in total usage. The extra supply just competed for the same finite attention.

AI has made code generation cheap, but turning that increased output into shipped, meaningful products is fundamentally an organisational challenge. Creating value still depends on processes built around human monitoring, not just technical improvements.

FAQ

Does AI actually make software teams ship faster?

Somewhat, but far less than the Somewhat, but far less than the coding numbers suggest. Commit activity climbs about 180% with autonomous agents, while actual releases rise roughly 30%. The headline productivity figure measures the wrong end of the pipeline.

Why do AI coding tools not translate into faster shipping?

Because shipping isn’t bottlenecked on typing. It is bottlenecked by review, testing, integration, and security and compliance sign-off — all still human-paced. Speeding up one stage of a chain barely moves the total when the stages after it don’t budge.

Who benefits most from AI coding tools, juniors or seniors?

They benefit differently. Juniors get a larger raw speed-up because the tools fill knowledge gaps, but they’re more likely to miss flaws in the model’s output. Seniors get more of the project-level payoff because they can tell a good suggestion from a bad one and shape the output into something that holds.

What should engineering leaders measure instead of lines of code?

Outcomes, not output. DORA metrics, change-failure rate, and rework or churn track whether software actually ships and holds up. Commit counts and lines of code mostly reward inputs that AI inflates.

Final takeaway

The teams seeing the highest returns aren’t the ones generating the most lines of code. They’re the ones redesigning their workflow around effective human-AI collaboration. AI accelerates code creation for free—converting that into faster releases is an organizational problem, not a technical one.

References

  • Cui, Z. (Kevin), Demirer, M., Jaffe, S., Musolff, L., Peng, S., & Salz, T. (2024). The Effects of Generative AI on High-Skilled Work: Evidence from Three Field Experiments with Software Developers. SSRN 4945566 (later in Management Science).
  • Demirer, M., Musolff, L., & Yang, L. (2026). Writing Code vs. Shipping Code: Productivity Effects Across Generations of AI Coding Tools. NBER Working Paper No. 35275. https://doi.org/10.3386/w35275.
  • Peng, S., Kalliamvakou, E., Cihon, P., & Demirer, M. (2023). The Impact of AI on Developer Productivity: Evidence from GitHub Copilot. arXiv:2302.06590.
  • Shaw, S. D., & Nave, G. (2026). Thinking — Fast, Slow, and Artificial: How AI Is Reshaping Human Reasoning and the Rise of Cognitive Surrender. Wharton Research Paper, SSRN.
  • Storey, M.-A. (2026). From Technical Debt to Cognitive and Intent Debt: Rethinking Software Health in the Age of AI. arXiv:2603.22106.
Scroll to Top