$CHEETAH is live!
Type something to search...
Blog

AI Coding Frequency: Smart Contract Vulnerability Rates Revealed

New research shows that high-frequency AI coding use correlates with rising smart contract vulnerability rates, particularly for complex business logic flaws. Here is what the data means for Web3 development teams.

AI Coding Frequency: Smart Contract Vulnerability Rates RevealedAI Coding Frequency: Smart Contract Vulnerability Rates Revealed
AI Coding Frequency: Smart Contract Vulnerability Rates Revealed
Join Our Newsletter

Subscribe to our newsletter to get the latest updates and offers

* Will send you weekly updates on new features, tips, and developer resources.

What the Data Actually Shows About AI-Assisted Contract Development

TL;DR:

  • High-frequency AI coding use correlates with increased smart contract vulnerability rates, particularly for complex business logic flaws that pattern-matching models are structurally ill-equipped to detect
  • Anthropic's red team evaluation found that AI agents identified $4.6 million worth of exploitable vulnerabilities across 405 real-world contracts exploited between 2020 and 2025, establishing a concrete lower bound for the economic harm these capabilities could enable
  • The SCONE-bench evaluation tested Claude Opus 4.5, Claude Sonnet 4.5, and GPT-5 against contracts exploited after their knowledge cutoffs, and both Sonnet 4.5 and GPT-5 uncovered two novel zero-day vulnerabilities across 2,849 recently deployed contracts with no known vulnerabilities
  • Veracode's 2025 GenAI Code Security Report found that AI models chose insecure coding patterns in 45% of cases across more than 100 LLMs tested, a number that compounds when those models are integrated into pipelines without static analysis gates
  • The shift from coding errors to business logic flaws as the dominant vulnerability class means that traditional static analysis tools like Slither and Mythril catch a shrinking share of what actually gets exploited
  • AI-powered exploit generation is now economically viable at scale, with GPT-5 producing exploits at an API cost of $3,476 against contracts worth substantially more, meaning the attacker economics have fundamentally changed
  • Closing the gap requires purpose-built tooling that keeps developers in the comprehension loop, not just faster code generation

The result: High-frequency AI coding without embedded security feedback is not a productivity strategy, it is a liability accumulation strategy.

The conversation around AI-assisted development in Web3 has spent most of its energy on the upside. Teams ship faster, boilerplate disappears, and junior developers can produce Solidity that would have taken a senior engineer twice as long to write. That narrative is real, and the productivity gains are measurable. But the data that has emerged over the past eighteen months tells a more complicated story, one where the same acceleration that compresses development timelines also compresses the window in which vulnerabilities can be caught before they become permanent fixtures on a public ledger.

The Harness data, combined with findings from Anthropic's red team research and Veracode's 2025 GenAI Code Security Report, paints a picture that every Web3 engineering lead needs to sit with. High-frequency AI coding use does not just introduce more code. It introduces more code at a rate that outpaces the comprehension and review capacity of the teams shipping it. When that dynamic plays out in traditional software, the consequence is technical debt and the occasional production incident. When it plays out in smart contract development, the consequence is an immutable vulnerability sitting on-chain, waiting for someone with the right tools and enough time to find it.

The Velocity Trap and How Teams Fall Into It

There is a pattern that emerges when teams adopt AI coding tools aggressively without adjusting their review and testing workflows to match. In the first few weeks, productivity metrics look excellent. Pull request volume increases, sprint velocity goes up, and the backlog shrinks. The problem is that the metrics being tracked are output metrics, not quality metrics. Lines of code shipped per day is not the same as secure lines of code shipped per day, and in smart contract development, that distinction carries financial consequences that do not show up until much later.

Veracode's research is instructive here. Across more than 100 large language models tested on 80 curated coding tasks, AI models chose insecure coding patterns in 45% of cases. That is not a fringe result from a handful of poorly configured models. It is a median outcome across the current generation of production-grade AI coding tools. When a team is using those tools to generate Solidity at high frequency, and the review process has not been redesigned to account for that baseline insecurity rate, the math becomes straightforward and uncomfortable. More code generated per day, multiplied by a 45% insecure pattern rate, multiplied by a review process that was designed for human-paced output, equals a growing backlog of unreviewed risk.

The velocity trap is not a failure of the AI tools themselves. It is a failure of the workflow assumptions that teams carry over from pre-AI development. The assumption that a developer who writes code also understands it deeply enough to catch its security implications does not hold when thatdeveloper is generating fifty functions a day instead of five. The comprehension that comes from writing code line by line, from wrestling with the logic and making deliberate choices about control flow and state management, does not transfer automatically when the code arrives fully formed from a model prompt. Developers who use AI tools at high frequency often report a subtle but significant shift in their relationship to the codebase. They understand what the code is supposed to do, but they have a shallower grasp of what it actually does under edge conditions, under adversarial inputs, or when composed with other contracts in ways the original prompt did not anticipate.

The Shift From Coding Errors to Business Logic Vulnerabilities

For most of the history of smart contract security, the dominant vulnerability classes were well-understood and relatively mechanical. Reentrancy, integer overflow, unchecked external calls, and improper access control accounted for the majority of exploits, and the tooling ecosystem evolved to catch them. Slither, Mythril, and Echidna are genuinely effective at surfacing these patterns, and a team that runs them consistently in CI/CD will catch a meaningful share of the vulnerabilities that plagued early DeFi protocols. The problem is that the threat landscape has moved faster than the tooling assumptions.

The data from recent exploit post-mortems tells a consistent story. The share of vulnerabilities attributable to straightforward coding errors has declined, while the share attributable to flawed business logic has grown substantially. Business logic vulnerabilities are not about writing syntactically incorrect Solidity or using a deprecated pattern. They are about the protocol doing exactly what the code says, but the code saying something subtly wrong about how value should flow, how permissions should compose, or how state should update across multiple transactions. These flaws are invisible to static analysis tools because the tools are checking for known bad patterns, not for whether the protocol's economic model is internally consistent.

AI-generated code accelerates this problem in a specific way. When a developer prompts a model to implement a lending protocol's liquidation logic, the model draws on patterns from its training data. If those patterns are correct for the common case, the generated code will look reasonable and will pass most automated checks. But the edge cases, the scenarios where a liquidator can manipulate oracle prices across multiple blocks, or where a flash loan can temporarily satisfy a collateral check before the state is fully settled, require a depth of protocol-specific reasoning that current models do not reliably provide. The code looks right. The tests pass. The audit catches the obvious issues. And then six months after mainnet deployment, someone finds the edge case that the model did not anticipate and the developer did not think to test.

What the SCONE-bench Evaluation Actually Found

The research published by Anthropic's red team in December 2025 is worth reading carefully, because it establishes something that the industry has been debating in the abstract for years. The SCONE-bench evaluation used 405 contracts that were actually exploited between 2020 and 2025, and tested AI agents against contracts exploited after the models' knowledge cutoffs, specifically to rule out the possibility that the models had simply memorized known exploits from their training data. Claude Opus 4.5, Claude Sonnet 4.5, and GPT-5 collectively developed exploits worth $4.6 million against those contracts. That number is a lower bound, not a ceiling, and it represents the economic harm that could be enabled by capabilities that are already deployed and accessible.

The more operationally significant finding is what happened when the same agents were turned loose on 2,849 recently deployed contracts with no known vulnerabilities. Both Sonnet 4.5 and GPT-5 uncovered two novel zero-day vulnerabilities and produced working exploits worth $3,694. GPT-5 accomplished this at an API cost of $3,476, which means the margin between attack cost and potential return was thin on these particular contracts, but the proof of concept is what matters. The economic viability of autonomous AI-driven exploitation is no longer theoretical. It is a demonstrated capability, and the contracts that were tested were not exotic edge cases. They were recently deployed production contracts that had presumably passed whatever review process their teams used.

The implications for development teams are direct. If AI agents can find zero-day vulnerabilities in recently deployed contracts at a cost measured in thousands of dollars, then the security bar for what constitutes an adequately reviewed contract has risen significantly. A manual audit that would have been considered thorough two years ago may not be sufficient against an adversary who can run automated exploit generation at scale. The question is not whether teams should be worried about this. The question is what they are doing about it.

The Attacker Economics Have Fundamentally Changed

One of the more underappreciated aspects of the current moment is how dramatically the cost structure of smart contract exploitation has shifted. Historically, finding and exploiting a novel vulnerability in a production DeFi protocol required significant expertise, substantial time investment, and a deep understanding of both the specific protocol and the broader EVM execution environment. That expertise was scarce, which provided a form of security through obscurity that nobody liked to admit was part of the threat model but that functionally existed.

The Anthropic findings put a concrete number on how much that has changed. GPT-5 produced working exploits at an API cost of $3,476. That is not a number that requires a well-funded nation-state actor or a sophisticated criminal organization. It is a number that is accessible to a moderately motivated individual with a credit card and enough patience to iterate on prompts. The barrier to entry for smart contract exploitation has dropped by an order of magnitude, and it will continue to drop as models improve and as the tooling around automated exploit generation matures.

This shift in attacker economics has a direct implication for how development teams should think about their security investment. When exploitation was expensive and required rare expertise, a protocol could accept some residual risk after a thorough audit and reasonably expect that the cost of finding and exploiting remaining vulnerabilities would deter most attackers. That calculus no longer holds. When the cost of finding a vulnerability can be measured in thousands of dollars and the potential return can be measured in millions, the expected value calculation for an attacker is favorable across a much wider range of vulnerability severity levels. Teams that are still operating on the old threat model are underinvesting in security relative to the actual risk they are carrying.

Why Advanced Models Still Miss Complex Flaws

There is a tempting assumption that as AI models become more capable, the vulnerability problem will solve itself. If the models generating code become sophisticated enough, they will generate secure code by default, and the security review problem will diminish. The data does not support this assumption, and understanding why requires thinking carefully about what kind of intelligence is needed to write secure smart contracts versus what kind of intelligence current models actually have.

Current large language models are extraordinarily good at pattern recognition and pattern synthesis. They have processed enormous amounts of Solidity code, security research, audit reports, and exploit post-mortems. They can recognize common vulnerability patterns and avoid them in straightforward contexts. What they are not good at is the kind of multi-step, protocol-specific reasoning that catching business logic vulnerabilities requires. Identifying a reentrancy vulnerability is a pattern-matching task. Identifying a vulnerability in a novel AMM design that arises from the interaction between fee accrual logic and liquidity provision mechanics across multiple transactions requires building a mental model of the protocol's economic invariants and reasoning about how they can be violated. That is a different cognitive task, and it is one where current models fail at a rate that should concern any team shipping production contracts.

The SCONE-bench results are consistent with this framing. The AI agents were effective at finding vulnerabilities in contracts that had structural similarities to known exploit patterns. They were less reliable on vulnerabilities that required understanding the protocol's intended behavior well enough to identify where the implementation diverged from it in exploitable ways. This is precisely the category of vulnerability that is growing as a share of total exploits, and it is precisely the category that high-frequency AI coding is most likely to introduce, because the models generating the code have the same blind spots as the models being used to audit it.

The Comprehension Debt Problem at Scale

There is a concept worth naming explicitly here, because it captures something that teams experience but do not always have language for. Comprehension debt is the accumulated gap between code that exists in a codebase and code that the team actually understands well enough to reason about under adversarial conditions. In traditional software development, comprehension debt is a real problem, but it is recoverable. You can refactor, you can add tests, you can bring in a new engineer to do a thorough review, and you can patch vulnerabilities when they are found. The debt is real but it is not permanent.

In smart contract development, comprehension debt has a different character. Once a contract is deployed, the code is immutable. The comprehension debt that existed at deployment time is locked in. If the team did not fully understand the liquidation logic when they shipped it, they cannot go back and understand it retroactively in a way that changes the on-chain reality. They can deploy a new version, but the original contract remains, and any funds that flow through it remain exposed to whatever vulnerabilities the team did not understand well enough to catch. High-frequency AI coding accelerates comprehension debt accumulation because it increases the rate at which code enters the codebase relative to the rate at which developers build genuine understanding of that code.

The teams most at risk are not the ones using AI tools carelessly. They are the ones using AI tools effectively for productivity while maintaining review processes that were designed for a slower pace of development. A review process that was calibrated for a team shipping ten functions per week does not scale linearly to a team shipping fifty functions per week using AI assistance. The review bottleneck becomes the constraint, and when review is the constraint, the temptation is to compress it rather than to redesign the workflow. Compressing review in smart contract development is where comprehension debt turns into on-chain liability.

What Robust Security Looks Like at the Current Threat Level

Given the shift in both the vulnerability landscape and the attacker economics, what does an adequate security posture actually look like for a team shipping AI-assisted smart contracts in 2026? The answer has several components, and none of them are optional if the goal is production-grade security rather than the appearance of it.

The first component is automated static analysis that runs on every commit, not just before deployment. Tools like Slither and Mythril should be integrated into CI/CD pipelines as blocking checks, not advisory ones. They will not catch business logic vulnerabilities, but they will catch the mechanical issues that still account for a meaningful share of exploits, and catching them early is substantially cheaper than catching them in audit. The second component is fuzz testing with tools like Echidna or Foundry's built-in fuzzer, specifically targeting the protocol's economic invariants. This means writing property-based tests that express what should always be true about the protocol's state, and then running the fuzzer against those properties with enough iterations to surface edge cases that unit tests will not find.

The third component is the one that most teams underinvest in, which is structured comprehension review. This is not a code review in the traditional sense. It is a process where the developer who generated the code, or a senior engineer reviewing it, explicitly works through the logic under adversarial assumptions. What happens if a caller provides the maximum possible input? What happens if this function is called in the same transaction as that one? What happens if the oracle price moves by 50% between the first and second call? These questions cannot be answered by running a linter. They require a human who understands the protocol well enough to reason about it under conditions that the original prompt did not specify.

The Role of AI-Native Tooling in Closing the Gap

The irony of the current situation is that the same class of tools creating the vulnerability problem is also the most promising path toward solving it. AI agents that can find $4.6 million worth of exploitable vulnerabilities in production contracts are, by definition, capable of finding those vulnerabilities before deployment if they are integrated into the development workflow rather than used adversarially after the fact. The question is whether teams are using AI offensively, to generate code faster, or defensively, to find the vulnerabilities in the code they are generating.

The most effective teams are doing both, and they are doing them in sequence. Code generation happens first, with AI assistance accelerating the implementation of protocol logic. Security analysis happens second, with AI-assisted tools running against the generated code to surface vulnerabilities before they reach testnet. This is not a novel workflow in principle. It is the standard shift-left security model applied to smart contract development. What is novel is that the AI tools available for the security analysis step are now capable enough to find the kinds of complex, protocol-specific vulnerabilities that previously required a senior auditor with deep domain expertise.

The gap that remains is tooling that keeps developers in the comprehension loop throughout this process. Generating code and then running an AI security scanner against it is better than generating code without any security analysis, but it still leaves the developer at a distance from the code's actual behavior. The most valuable tooling is the kind that surfaces security insights inline, during the development process, in a way that builds the developer's understanding of the code rather than just flagging issues for them to address. When a developer understands why a particular pattern is vulnerable, they are less likely to reproduce that pattern in the next function they generate. When they just receive a flag and fix it, the comprehension gap persists.

The Harness Data in Context

The Harness data on high-frequency AI coding and vulnerability rates fits into a broader pattern that is now well-documented across multiple research sources. The finding that increased AI coding frequency correlates with increased vulnerability rates is not surprising given the mechanisms described above, but it is important to have it quantified rather than just theorized. The correlation is not uniform across vulnerability types. For mechanical vulnerabilities, the correlation is weaker, because AI tools are reasonably good at avoiding known bad patterns. For business logic vulnerabilities, the correlation is stronger, because those are precisely the vulnerabilities that emerge from the gap between what a developer intended and what the generated code actually does.

The practical implication is that teams should not interpret the Harness data as an argument against using AI coding tools. The productivity gains are real and the competitive pressure to use them is not going away. The argument is for using them with a security workflow that is calibrated to the actual risk profile of AI-generated code, not the risk profile of human-written code from five years ago. Those are different risk profiles, they require different mitigations, and treating them as equivalent is how teams end up with production vulnerabilities that a more thoughtful workflow would have caught.

It is also worth noting that the data on AI-assisted security analysis is equally compelling in the other direction. The same research that documents AI's ability to generate vulnerable code also documents AI's ability to find vulnerable code. Anthropic's red team findings are not just a warning about attacker capabilities. They are a demonstration of what defensive tooling can do when it is applied with the same rigor. The teams that will navigate this environment successfully are the ones that treat AI as a tool for both generation and analysis, and that build workflows where the two are tightly coupled rather than separated by weeks or months.

Building Toward a More Defensible Development Practice

The path forward for Web3 development teams is not to slow down. The competitive dynamics of the space do not permit it, and the productivity gains from AI-assisted development are too significant to leave on the table. The path forward is to build development practices where security analysis keeps pace with code generation, where comprehension is treated as a first-class output alongside functionality, and where the tooling supports both goals simultaneously rather than forcing a tradeoff between them.

This means investing in CI/CD pipelines that include automated security gates as blocking checks, not advisory ones. It means writing property-based tests that express economic invariants before writing the implementation code, so that the fuzzer has something meaningful to test against. It means conducting structured comprehension reviews that go beyond checking whether the code does what the prompt asked, and into whether the code behaves correctly under the full range of conditions it will encounter on a live blockchain. And it means staying current with the evolving threat landscape, because the attacker tooling is improving at the same rate as the development tooling, and a security posture that was adequate six months ago may not be adequate today.

The teams that get this right will have a genuine competitive advantage, not just because they will ship fewer vulnerabilities, but because they will ship with more confidence. The ability to move fast without accumulating hidden liability is what separates teams that scale successfully from teams that hit a wall when their first major exploit lands. That confidence comes from process, from tooling, and from a culture that treats security as a development concern rather than a pre-deployment checklist.

Where Cheetah AI Fits Into This Picture

If you are building on-chain and using AI coding tools at any meaningful frequency, the workflows described in this post are not aspirational. They are the baseline for operating responsibly in an environment where AI-assisted exploitation is economically viable and where the vulnerability landscape has shifted toward the kinds of flaws that traditional tooling misses.

Cheetah AI is built specifically for this environment. It is a crypto-native IDE that integrates AI-assisted code generation with inline security analysis, designed to keep developers in the comprehension loop rather than widening the gap between code generation velocity and developer understanding. The goal is not to slow down development. It is to make the security analysis fast enough that it does not create a bottleneck, and contextual enough that it builds developer understanding rather than just flagging issues. If you are thinking about how to close the gap between how fast your team ships and how well your team understands what it ships, Cheetah AI is worth a look.


The broader point is that the data from Anthropic, Veracode, and the SCONE-bench evaluation is not a reason to be pessimistic about AI-assisted Web3 development. It is a reason to be precise about what the tooling needs to do. Faster code generation without embedded security feedback is a liability accumulation engine. Faster code generation with security analysis that keeps pace, and that builds developer understanding rather than just flagging issues, is a genuine force multiplier. Cheetah AI is built around that second model, and if the research covered in this post resonates with what your team is experiencing, it is a good time to see what that looks like in practice.

Related Posts

Cheetah Architecture: Building Intelligent Code Search

Cheetah Architecture: Building Intelligent Code Search

Building Intelligent Code Search: A Hybrid Approach to Speed and Relevance TL;DR: We built a hybrid code search system that:Runs initial text search locally for instant response Uses

user
Cheetah AI Team
02 Dec, 2025
Reasoning Agents: Rewriting Smart Contract Development

Reasoning Agents: Rewriting Smart Contract Development

TL;DR:Codex CLI operates as a multi-surface coding agent with OS-level sandboxing, 1M context windows via GPT-5.4, and the ability to read, patch, and execute against live codebases, making it

user
Cheetah AI Team
09 Mar, 2026
The New Bottleneck: AI Shifts Code Review

The New Bottleneck: AI Shifts Code Review

TL;DR:AI coding assistants now account for roughly 42% of all committed code, a figure projected to reach 65% by 2027, yet teams using these tools are delivering software slower and less relia

user
Cheetah AI Team
09 Mar, 2026