Blog

Broken Metrics: Measuring AI Productivity in Web3

Lines of code, velocity, and commit frequency were already poor proxies for developer output. AI-assisted coding in Web3 makes them actively misleading.

Join Our Newsletter

Subscribe to our newsletter to get the latest updates and offers

* Will send you weekly updates on new features, tips, and developer resources.

TL;DR:

METR's 2025 study found experienced open-source developers worked 19% slower when using AI tools, despite self-reporting an expected 20-25% speed improvement before the study began
Research covering 121,000 developers across 450+ companies found productivity gains have plateaued at roughly 10%, even as 92.6% of developers now use AI coding assistants at least monthly
Lines of code, the most commonly cited productivity proxy, becomes meaningless when AI can generate hundreds of lines in seconds, and actively misleading when those lines require hours of validation
Web3 development compounds the measurement problem because smart contract code carries irreversible financial consequences, making validation time, security review cycles, and audit overhead invisible in standard metrics
The SPACE framework offers a more complete model for measuring developer productivity but requires significant adaptation before it maps accurately onto AI-augmented blockchain development workflows
Developers with more than 50 hours of practice with AI tools show meaningfully different productivity profiles than those with less experience, which means aggregate team metrics obscure the actual distribution of outcomes
The hidden costs of AI coding agents, including token consumption, context management, and the cognitive overhead of reviewing generated code, rarely appear in standard productivity calculations

The result: Measuring AI coding productivity in Web3 requires a framework built around outcomes rather than output, and the industry is only beginning to build it.

The Measurement Problem Nobody Wants to Admit

There is a particular kind of organizational discomfort that sets in when a widely adopted tool fails to produce the results that justified its adoption. That discomfort is now settling across engineering teams everywhere, and it is most acute in Web3, where the stakes attached to every line of deployed code are higher than in almost any other domain. The problem is not that AI coding tools are failing developers. The problem is that the metrics being used to evaluate them were never designed for this kind of work, and applying them to AI-augmented development in blockchain environments produces numbers that are simultaneously misleading and politically inconvenient.

The numbers that do exist are striking in their ambiguity. Research presented by Laura Tacho, CTO at DX and a recognized authority on developer productivity measurement, drew on data from 121,000 developers across more than 450 companies. The findings showed that 92.6% of those developers use an AI coding assistant at least once a month, and roughly 75% use one weekly. That is not a niche adoption pattern. That is a fundamental shift in how software gets written. And yet the productivity gains from that shift have plateaued at approximately 10%, with developers reporting time savings of around 4 hours per week, a figure that has remained essentially flat since the second quarter of 2025. The gap between adoption rate and productivity impact is not a sign that AI tools are broken. It is a sign that the measurement framework is.

What makes this particularly relevant for Web3 teams is that blockchain development has always operated under a different set of constraints than traditional software engineering. Deployment is irreversible. Bugs are not patched with a hotfix pushed to a server. They are exploited, and the consequences are measured in dollars lost from protocol treasuries, not in supporttickets and incident reports. That context changes everything about how productivity should be defined, and it exposes just how poorly the standard metrics translate.

Lines of Code Was Already a Bad Metric

The critique of lines of code as a productivity proxy is not new. Engineers have been making this argument for decades, and the counterarguments are well-worn enough that most engineering leaders have at least nominally moved past it. But nominally is doing a lot of work in that sentence. In practice, lines of code, commit frequency, pull request volume, and story points remain the dominant signals in most engineering dashboards, because they are easy to collect, easy to visualize, and easy to present to stakeholders who want a number they can track over time. The fact that these numbers measure activity rather than value has always been a known limitation. AI-assisted coding does not just preserve that limitation. It inverts the signal entirely.

When a developer using GitHub Copilot or a similar tool accepts a suggested block of code, the lines of code metric goes up. When that same developer spends the next two hours reading through the generated output, cross-referencing it against the protocol specification, checking for edge cases in the business logic, and running it through a static analysis tool, none of that time appears in any standard productivity metric. The output number increases. The effort number stays invisible. In traditional software development, this distortion is annoying. In smart contract development, where a single unchecked edge case in a token transfer function can drain a liquidity pool, the distortion is dangerous, because it creates organizational pressure to ship faster based on metrics that are measuring the wrong thing.

Nicole Forsgren, whose work on the DORA metrics and the SPACE framework has shaped how the industry thinks about developer productivity, has been direct about this problem. When developers are pair programming with AI assistants, the question of what lines of code even means becomes genuinely difficult to answer. The authorship model breaks down. The effort model breaks down. What remains is the need to measure value creation, which requires a fundamentally different set of instruments than the ones most teams currently have in place.

What the METR Study Actually Tells Us

The study published by METR in mid-2025 generated significant attention, and most of the coverage focused on the headline finding: experienced open-source developers worked 19% slower when using early-2025 AI tools compared to working without them. That number was widely cited as evidence that AI coding tools are overhyped, and the Hacker News thread discussing it accumulated nearly 500 comments from engineers sharing their own experiences on both sides of the debate. But the more interesting finding in the study is not the 19% slowdown. It is the gap between expectation and reality.

Before the study began, participants predicted they would be approximately 20 to 25% faster with AI assistance. They ended up 19% slower. That is not a small miscalibration. That is a near-complete inversion of the expected outcome, and it points to something important about how developers currently understand their own workflows. The tasks where AI tools genuinely accelerate work, boilerplate generation, documentation, test scaffolding for well-understood patterns, tend to feel significant in the moment. The tasks where AI tools create friction, context management, output validation, debugging generated code that fails in non-obvious ways, tend to feel like normal development overhead rather than AI-specific costs. The result is a systematic bias toward overestimating AI's contribution.

For Web3 developers specifically, this bias has a compounding effect. Smart contract development involves a higher proportion of tasks that fall into the second category. Writing a Solidity function that correctly implements a complex DeFi invariant is not a task where accepting the first AI suggestion and moving on is a reasonable workflow. The validation work is the work. And when that validation work is invisible in the productivity metrics, teams end up with dashboards showing strong output numbers while the actual quality and security of the codebase quietly degrades.

The 10% Plateau and Why It Persists

The finding that productivity gains from AI tools have plateaued at roughly 10% is worth sitting with for a moment, because it is both more and less encouraging than it appears. On one hand, a 10% productivity improvement sustained across a team of ten engineers is equivalent to adding a full engineer's worth of output without adding headcount. That is not nothing. On the other hand, the plateau itself is the problem. If AI adoption has reached 92.6% of developers and the productivity gain has stopped growing, something structural is limiting further improvement, and that structure is almost certainly the measurement and workflow layer rather than the tools themselves.

The data from Laura Tacho's research shows that the amount of AI-authored code, meaning code generated by AI that actually gets merged into production, has continued to increase even as the productivity plateau has held. That divergence is significant. More AI-generated code is shipping, but it is not translating into proportionally more developer output as measured by the metrics teams are using. One explanation is that the validation and review overhead for AI-generated code is growing at roughly the same rate as the generation speed, keeping the net productivity gain flat. Another explanation is that the metrics themselves are not capturing the right outputs, and the actual productivity improvement is larger than 10% but is showing up in dimensions that standard dashboards do not measure, such as code quality, reduced defect rates, or faster onboarding for new team members.

In Web3, the plateau problem is further complicated by the audit cycle. Most production smart contract deployments require at least one external security audit before going live, and many require two. Audit timelines typically run four to eight weeks for a moderately complex protocol. If AI tools are accelerating the initial development phase but the audit cycle remains constant, the overall time-to-deployment metric barely moves. Teams that measure productivity by deployment frequency will see minimal improvement even if their development velocity has genuinely increased, because the bottleneck has shifted rather than been eliminated.

Why Web3 Makes Everything Harder to Measure

Traditional software development has a relatively forgiving error model. A bug in a web application gets reported, triaged, fixed, and deployed in a patch. The cost is measured in engineering hours and user frustration. The system recovers. Smart contract development operates under a fundamentally different error model, one where the cost of a bug is not the time to fix it but the value that can be extracted from it before anyone notices. The Moonwell DeFi protocol lost $1.78 million to an exploit traced to vulnerable code. The Euler Finance exploit in 2023 resulted in approximately $197 million in losses before a negotiated return. These are not edge cases. They are the expected outcome when security gaps reach production in a system where transactions are irreversible and value is directly accessible.

This error model changes what productivity actually means in a Web3 context. A developer who ships 500 lines of Solidity in a day and introduces a reentrancy vulnerability has not been productive. They have created a liability. A developer who ships 200 lines of Solidity in a day, runs it through Slither and Echidna, writes comprehensive invariant tests, and documents the security assumptions clearly has been genuinely productive, even though every standard metric would rank the first developer higher. The measurement framework needs to account for the cost of defects, not just the rate of output, and in Web3 the cost of defects is denominated in a currency that makes the accounting very concrete.

There is also the question of what counts as done in a smart contract context. In traditional software, a feature is done when it passes tests and gets merged. In Web3, a feature is done when it has been tested, audited, deployed to a testnet, monitored for a period, and finally deployed to mainnet with appropriate access controls and upgrade mechanisms in place. The gap between merge and mainnet can span weeks or months, and all of the work that happens in that gap is invisible to metrics that measure at the repository level. AI tools that accelerate the merge step without addressing the post-merge pipeline do not actually compress the delivery timeline in any meaningful way.

The Hidden Costs That Never Show Up in Dashboards

Cyfrin's analysis of AI coding agents identified a cost structure that most productivity discussions ignore entirely. Running an AI coding agent on a complex task is not free. Token consumption for a multi-step agentic workflow on a non-trivial smart contract can run into the tens of thousands of tokens per session, and at current API pricing that translates to real costs that accumulate quickly across a team. More importantly, the cognitive overhead of managing an AI agent, providing context, reviewing intermediate outputs, correcting course when the agent goes in the wrong direction, and validating the final result, is itself a form of work that does not appear in any standard productivity metric.

This hidden cost problem is particularly acute for smaller Web3 teams, which represent the majority of the ecosystem. A two-person team building a DeFi protocol does not have the luxury of treating AI agent costs as a rounding error in the infrastructure budget. They need to make deliberate decisions about when to use agentic workflows versus simpler autocomplete-style assistance, and those decisions require a level of tooling awareness that most productivity frameworks do not support. The question is not just whether AI tools make developers faster. It is whether the specific configuration of AI tools being used is appropriate for the specific task at hand, and whether the cost of using those tools is being accounted for in the productivity calculation.

Context management is another hidden cost that compounds in blockchain development specifically. Smart contracts exist within ecosystems of other contracts, oracles, governance mechanisms, and off-chain infrastructure. An AI tool that lacks context about the broader protocol architecture will generate code that is locally correct but globally broken, and the developer will spend significant time discovering and correcting that mismatch. The time spent managing context, providing relevant background to the AI, and verifying that generated code integrates correctly with the existing system is real work, and it scales with protocol complexity in ways that simple task-completion metrics do not capture.

The Learning Curve Is Real and It Matters

One of the most practically useful findings in the available research is the relationship between AI tool experience and productivity outcomes. The developer in the METR-adjacent study who had accumulated more than 50 hours of practice with AI coding tools showed a 20% productivity improvement, while less experienced users showed the opposite pattern. That gap is not a minor statistical artifact. It represents a fundamentally different workflow, one where the developer has internalized which tasks to delegate to AI, how to prompt effectively for the specific domain they are working in, and how to validate AI output efficiently rather than exhaustively.

For Web3 teams, this learning curve has a domain-specific dimension that general-purpose AI tool training does not address. Learning to use GitHub Copilot effectively for React development does not automatically transfer to using it effectively for Solidity development. The patterns are different, the failure modes are different, and the validation requirements are different. A developer who is highly proficient with AI assistance in a traditional software context may still be in the early, slower phase of the learning curve when they start applying those tools to smart contract work. Aggregate team productivity metrics that do not account for this domain-specific learning curve will systematically underestimate the potential upside and misattribute current underperformance.

This also means that team-level productivity metrics can be deeply misleading when the team has a mixed distribution of AI tool experience. If three developers on a six-person team are past the 50-hour threshold and three are not, the aggregate productivity number will look flat or slightly negative, even though the experienced half of the team may be showing strong individual gains. Engineering leaders who see that flat aggregate number and conclude that AI tools are not working for their team may pull back on investment at exactly the moment when the rest of the team is approaching the inflection point where gains start to compound.

What a Better Framework Actually Looks Like

The SPACE framework, developed by Nicole Forsgren and colleagues, offers a more complete model for developer productivity by measuring across five dimensions: Satisfaction and wellbeing, Performance, Activity, Communication and collaboration, and Efficiency and flow. Even before AI tools became ubiquitous, SPACE represented a significant improvement over single-metric approaches because it acknowledged that developer productivity is multidimensional and that optimizing for one dimension at the expense of others produces poor outcomes. In an AI-augmented development environment, the framework needs further adaptation, but it provides a useful starting point.

For Web3 specifically, the Performance dimension of SPACE needs to be anchored to outcomes that reflect the actual risk profile of blockchain development. Code quality metrics should include static analysis pass rates from tools like Slither, Mythril, and Echidna. Security review cycle time should be tracked as a first-class metric, not as an afterthought. Defect escape rate, meaning the proportion of bugs that make it past internal review to external audit or, worse, to mainnet, should be weighted heavily because the cost of a defect that escapes to mainnet is orders of magnitude higher than one caught in development. These are not exotic metrics. They are the natural outputs of a development process that takes the irreversibility of smart contract deployment seriously.

The Efficiency and flow dimension of SPACE is where AI tools have the most direct impact, and also where the measurement is most nuanced. Flow state, the condition where a developer is working at full cognitive engagement without interruption, is genuinely valuable and genuinely difficult to quantify. AI tools that reduce context-switching, surface relevant documentation without requiring a browser tab, and handle boilerplate generation without breaking the developer's train of thought contribute to flow in ways that do not show up in commit counts or story point velocity. Measuring this requires developer experience surveys, time-in-flow estimates, and qualitative feedback, none of which are as clean as a number on a dashboard, but all of which are more honest about what is actually happening.

Adapting SPACE for Blockchain Development Workflows

Translating SPACE into a practical measurement system for a Web3 engineering team requires making some concrete choices about what to instrument and how. On the Activity dimension, the relevant signals are not raw commit frequency but rather the ratio of AI-assisted to manually written code, the acceptance rate for AI suggestions, and the time between first draft and final review for AI-generated contract functions. These signals, taken together, give a picture of how effectively the team is integrating AI into the development workflow rather than just whether they are using it at all.

On the Communication and collaboration dimension, Web3 teams have a unique signal available to them that traditional software teams do not: the audit finding rate. External auditors produce structured reports that categorize findings by severity, and tracking how that finding rate changes over time as AI tools are adopted gives a direct measure of whether AI assistance is improving or degrading the security quality of the code being produced. A team that is shipping more code faster but accumulating more high-severity audit findings is not more productive in any meaningful sense. A team that maintains or improves its audit finding rate while increasing throughput is demonstrating genuine productivity improvement.

The Satisfaction dimension matters more in Web3 than it might initially appear, because developer retention is a significant constraint in the ecosystem. Experienced Solidity developers are scarce, and the cognitive load of working in an environment where mistakes have irreversible financial consequences is genuinely high. AI tools that reduce that cognitive load, by surfacing relevant security patterns, flagging potential vulnerabilities before they reach review, and handling the mechanical parts of contract development, contribute to developer wellbeing in ways that have downstream effects on retention and team stability. Those effects are real productivity contributions even if they do not show up in any sprint velocity chart.

The Compounding Problem of Invisible Validation Work

One of the most consistent patterns in the research on AI coding productivity is that the time savings from AI generation are visible and immediate, while the costs of validation are diffuse and easy to misattribute. A developer who uses an AI tool to generate a token vesting contract in ten minutes instead of two hours will remember the ten-minute generation. They are less likely to accurately account for the forty-five minutes they subsequently spent reading through the output, the twenty minutes they spent running it against a test suite, and the hour they spent in a review session where a colleague flagged a subtle issue with the cliff calculation logic. The generation time is salient. The validation time blends into the background of normal development work.

This attribution problem is not unique to Web3, but it is more consequential there. In a traditional software context, the cost of shipping code that has not been thoroughly validated is a bug report and a patch cycle. In a smart contract context, the cost is potentially the entire value locked in the protocol. Teams that are measuring productivity by generation speed and not accounting for validation overhead are building a false picture of their actual throughput, and they are creating incentive structures that push developers toward accepting AI output with less scrutiny than the risk profile of the work demands.

Building validation time into the productivity framework requires treating it as a first-class activity rather than overhead. This means tracking time spent in security review, time spent running formal verification tools, and time spent writing invariant tests as productive work, not as friction that slows down delivery. It means setting expectations with stakeholders that a 10x increase in code generation speed does not translate to a 10x increase in deployment frequency, because the validation pipeline has its own throughput constraints. And it means designing AI tooling that supports the validation workflow, not just the generation workflow, which is a design choice that most general-purpose AI coding tools have not yet fully made.

Where Cheetah AI Fits Into This Picture

The measurement problem described throughout this piece is not going to be solved by better spreadsheets or more sophisticated dashboards alone. It requires tooling that is built with the specific constraints of Web3 development in mind, tooling that understands the difference between code that compiles and code that is safe to deploy, and that surfaces the right signals at the right moments in the development workflow rather than leaving developers to instrument everything themselves.

Cheetah AI is built for exactly this context. As a crypto-native AI IDE, it is designed around the reality that smart contract development is not just software development with a different syntax. It is a discipline with its own risk model, its own validation requirements, and its own definition of done. The productivity gains that matter in this context are not measured in lines generated per hour. They are measured in audit findings avoided, in security review cycles shortened, in the confidence a developer has when they push a contract to testnet knowing that the tooling has already surfaced the issues that would otherwise surface in an external audit six weeks later.

If your team is currently trying to evaluate AI coding tools using metrics that were designed for a different kind of work, the most useful thing you can do is start by questioning the framework before you question the tools. The 10% productivity plateau is not a ceiling. It is a measurement artifact. Teams that build evaluation frameworks around outcomes rather than output, and that use tooling designed for the specific demands of blockchain development, are finding that the actual improvement potential is considerably larger. Cheetah AI exists to help Web3 developers get there.


That means tracking the things that actually matter in a crypto-native context: how quickly a developer can move from a protocol specification to a reviewed, tested contract; how many security issues are caught before they reach an external auditor; how much cognitive overhead is removed from the validation workflow so developers can stay in flow on the hard problems rather than the mechanical ones. These are the metrics that reflect real productivity in Web3, and they are the metrics that Cheetah AI is built to move.

If you are building on-chain and you are still measuring your team's output in commit frequency and story points, it is worth spending some time with the research cited here and asking whether the picture those numbers are painting is actually accurate. The tools have improved. The workflows are maturing. The measurement frameworks just need to catch up.

Back to Blog

Architecture, AI

Cheetah Architecture: Building Intelligent Code Search

Building Intelligent Code Search: A Hybrid Approach to Speed and Relevance TL;DR: We built a hybrid code search system that:Runs initial text search locally for instant response Uses

Cheetah AI Team

02 Dec, 2025

AI, Web3

Reasoning Agents: Rewriting Smart Contract Development

TL;DR:Codex CLI operates as a multi-surface coding agent with OS-level sandboxing, 1M context windows via GPT-5.4, and the ability to read, patch, and execute against live codebases, making it

Cheetah AI Team

09 Mar, 2026

AI, Developer Tools

The New Bottleneck: AI Shifts Code Review

TL;DR:AI coding assistants now account for roughly 42% of all committed code, a figure projected to reach 65% by 2027, yet teams using these tools are delivering software slower and less relia

Cheetah AI Team

09 Mar, 2026

Broken Metrics: Measuring AI Productivity in Web3

The Measurement Problem Nobody Wants to Admit

Lines of Code Was Already a Bad Metric

What the METR Study Actually Tells Us

The 10% Plateau and Why It Persists

Why Web3 Makes Everything Harder to Measure

The Hidden Costs That Never Show Up in Dashboards

The Learning Curve Is Real and It Matters

What a Better Framework Actually Looks Like

Adapting SPACE for Blockchain Development Workflows

The Compounding Problem of Invisible Validation Work

Where Cheetah AI Fits Into This Picture

Related Posts

Cheetah Architecture: Building Intelligent Code Search

Reasoning Agents: Rewriting Smart Contract Development

The New Bottleneck: AI Shifts Code Review