$CHEETAH is live!
Type something to search...
Blog

Anthropic Code Review: Catching What AI Generates

Anthropic's Code Review tool brings automated security scanning directly into the AI development loop. Here's what it means for smart contract security and Web3 teams shipping at scale.

Anthropic Code Review: Catching What AI GeneratesAnthropic Code Review: Catching What AI Generates
Anthropic Code Review: Catching What AI Generates
Join Our Newsletter

Subscribe to our newsletter to get the latest updates and offers

* Will send you weekly updates on new features, tips, and developer resources.

The Problem That Shipped With the Solution

TL;DR:

  • Anthropic launched Code Review as part of Claude Code, a multi-agent system designed to automatically analyze AI-generated code for logic errors and security vulnerabilities before they reach production
  • Claude Code's revenue run-rate has exceeded $2.5 billion, and enterprise subscriptions have grown fourfold since the start of the year, creating a volume of AI-generated pull requests that human reviewers cannot keep pace with
  • Anthropic's own red team demonstrated that AI agents using Claude Opus 4.5 and Sonnet 4.5 identified $4.6 million worth of exploitable vulnerabilities in 405 real-world smart contracts, establishing a concrete lower bound for what automated attackers can already do
  • The same models generating code are now being used to review it, creating a closed loop where AI output is validated by AI analysis trained to understand the failure modes of AI generation
  • Token-based pricing puts typical code reviews in the $15 to $25 range, making automated security scanning economically viable at the pull request level rather than reserved for pre-launch audits
  • Traditional static analysis tools like Slither and Mythril were not designed to catch the specific vulnerability patterns that emerge from AI-assisted code generation, creating a gap that purpose-built review tooling is now beginning to fill
  • Smart contracts remain the highest-stakes environment for this class of tooling, given that deployment is irreversible and exploits are immediate and financially quantifiable

The result: Anthropic's Code Review is not just a productivity tool, it is a structural response to a security problem that the AI coding wave created and that only AI-native tooling can realistically address.

When the Code Generator Becomes the Vulnerability Source

There is a particular irony embedded in the current state of AI-assisted development. The tools that have made it possible to generate thousands of lines of Solidity in an afternoon are the same tools that have introduced a new and poorly understood class of vulnerability into production codebases. Research published by Veracode in 2025 found that AI models selected insecure coding patterns in 45 percent of cases across more than 100 large language models tested against 80 curated development tasks. That is not a marginal failure rate. That is a near-coin-flip on whether any given AI-generated function is handling security-sensitive logic in a way that would survive a competent audit.

The Moonwell DeFi protocol made this concrete in a way that numbers alone cannot. A $1.78 million exploit traced back to AI-generated vulnerable code demonstrated that the comprehension gap between code that exists in a codebase and code that developers actually understand is not a theoretical concern. It is a financial liability with a specific dollar figure attached. What makes the Moonwell case instructive is not just the loss itself, but the mechanism: the vulnerability was not exotic. It was the kind of access control gap that a senior Solidity developer reviewing the code manually would likely have caught. The AI generated it, the developer trusted it, and the gap between those two moments is where the exploit lived.

This pattern is repeating across the industry at a scale that is difficult to track precisely because most incidents do not get attributed to AI-generated code in post-mortems. Teams are shipping faster than ever, the code looks syntactically correct, it passes basic tests, and it clears CI pipelines that were designed for a world where humans wrote every line. The security tooling has not caught up to the generation tooling, and that asymmetry is what Anthropic's Code Review is attempting to close.

What Anthropic's Code Review Actually Does

Code Review is a multi-agent system launched as part of Claude Code, Anthropic's terminal-based coding environment. The architecture matters here because it is not a single model reading a diff and returning comments. It is a coordinated set of agents that analyze code from multiple angles simultaneously, one focused on logic correctness, another on security patterns, another on consistency with the broader codebase context. The output is a structured review that flags issues with enough specificity to be actionable rather than the kind of vague warnings that developers learn to ignore.

The tool integrates directly into the pull request workflow, which is the right place to put it. Cat Wu, Anthropic's head of product, described the problem it was built to solve in straightforward terms: the volume of pull requests generated by Claude Code had created a bottleneck that was slowing down deployment even as the code generation itself was accelerating. Enterprise teams were generating more code than their review capacity could handle, and the result was either delayed shipping or reduced review quality. Code Review is designed to handle the first pass automatically, surfacing the issues that matter and letting human reviewers focus their attention on the decisions that require judgment rather than pattern recognition.

The token-based pricing model puts a typical review in the $15 to $25 range, which is a meaningful number when you think about it in context. A traditional smart contract audit from a firm like Trail of Bits or OpenZeppelin runs anywhere from $15,000 to $150,000 depending on contract complexity and scope. That audit happens once, usually close to deployment, and covers a snapshot of the code at a specific moment. Code Review at $15 to $25 per pull request means you can run security analysis on every meaningful change throughout the development cycle, not just at the end. The economics of shifting security left have never been more favorable.

The Pull Request Bottleneck Nobody Planned For

The enterprise adoption curve for Claude Code has been steep enough that it has created operational problems that nobody anticipated when these tools were first deployed. Anthropic reported a fourfold increase in enterprise subscriptions since the beginning of 2026, and Claude Code's revenue run-rate crossing $2.5 billion is a signal that this is not experimental usage. These are production engineering teams using AI to generate real code that ships to real users. The pull request volume that follows from that kind of adoption is not something that existing review processes were designed to absorb.

Consider what happens at a mid-sized DeFi protocol with a team of 15 engineers, all using Claude Code for the majority of their development work. A team that previously opened 30 pull requests per week might now be opening 120 or more, each containing code that was generated at a speed no human could match and reviewed at a depth that no human has time to provide. The bottleneck is not the generation. The bottleneck is the review, and when review quality degrades under volume pressure, the vulnerabilities that slip through are not random. They tend to be the subtle ones, the reentrancy conditions that only manifest under specific call sequences, the integer overflow paths that require a particular combination of inputs, the access control gaps that look correct in isolation but break down when composed with other contracts.

This is the operational context that makes Code Review more than a convenience feature. It is a response to a structural problem that emerged directly from the success of AI code generation. The faster teams ship, the more they need automated review that can keep pace with automated generation. The two capabilities are not independent. They are coupled, and treating them as separate concerns is how vulnerabilities accumulate.

The Red Team Evidence That Made This Urgent

Anthropic's own red team published findings in December 2025 that should be required reading for anyone building or auditing smart contracts. Using a benchmark called SCONE-bench, which comprises 405 real-world exploited contracts from 2020 through 2025 across Ethereum, Binance Smart Chain, and Base, researchers evaluated whether AI agents could autonomously identify vulnerabilities, write working exploits, and extract value from those contracts in a sandboxed environment. The results were not ambiguous. Claude Opus 4.5, Claude Sonnet 4.5, and GPT-5 collectively developed exploits worth $4.6 million against contracts exploited after the models' knowledge cutoffs, meaning the agents were not simply recalling known exploits from training data. They were reasoning about vulnerability patterns and generating novel attack paths.

The researchers went further. They evaluated both Sonnet 4.5 and GPT-5 against 2,849 recently deployed contracts with no known vulnerabilities. Both agents found two novel zero-day vulnerabilities and produced exploits worth $3,694 in simulated value. GPT-5 accomplished this at an API cost of $3,476, which means the economics of automated smart contract exploitation are already within reach of a moderately funded attacker. The red team was careful to note that all testing was conducted in blockchain simulators with no impact on live assets, but the proof of concept is clear: profitable autonomous exploitation of smart contracts is technically feasible today.

What this research establishes is not just a threat model. It establishes a baseline for what defensive tooling needs to be capable of matching. If an AI agent can autonomously identify a reentrancy vulnerability in a newly deployed contract and construct a working exploit, then the review tooling protecting that contract needs to be operating at a comparable level of sophistication. Traditional static analysis tools that pattern-match against known vulnerability signatures are not sufficient. The attack surface has evolved, and the defense needs to evolve with it.

Why Traditional Static Analysis Falls Short

Slither, Mythril, and Echidna are genuinely useful tools. They have caught real vulnerabilities in production contracts and they belong in any serious Web3 security pipeline. But they were designed for a world where humans wrote the code they were analyzing, and the vulnerability patterns that emerge from AI-generated Solidity are meaningfully different from the patterns those tools were trained to detect. AI-generated code tends to be syntactically clean and structurally plausible, which means it passes the surface-level checks that traditional static analysis performs well. The vulnerabilities it introduces are more often semantic, arising from incorrect assumptions about execution context, state management, or the interaction between functions that individually look correct.

Reentrancy is a useful example. A classic reentrancy vulnerability has a recognizable structure: an external call before a state update, in a function that modifies a balance or a position. Slither catches this pattern reliably. But AI-generated code sometimes introduces reentrancy through more indirect paths, where the external call is several function calls deep, or where the state update happens in a modifier rather than the function body, or where the vulnerability only manifests when a specific sequence of transactions is executed against a contract that has been in a particular state for some time. These are the cases where semantic understanding of the code's intent matters more than pattern matching against known signatures.

The false positive problem compounds this. Traditional static analysis tools generate enough false positives that experienced developers learn to tune them aggressively, which means they also tune out some true positives. A tool that cries wolf on 40 percent of its findings trains developers to dismiss warnings, and that learned dismissal is exactly the wrong habit to have when the stakes are irreversible deployment to a live blockchain. Code Review's design goal of reducing false positives is not a minor quality-of-life improvement. It is a prerequisite for the tool being taken seriously in production workflows.

Multi-Agent Architecture and Why It Matters for Security

The multi-agent design of Code Review is worth examining in some detail because it represents a meaningful departure from how most automated review tools work. A single model reviewing a pull request is constrained by the context window, the breadth of its training on security-relevant patterns, and the inherent tension between being helpful and being conservative. A multi-agent system can decompose the review task into specialized subtasks, run them in parallel, and synthesize the results in a way that is more comprehensive than any single pass could achieve.

In practice, this means one agent can focus on the logical correctness of the code relative to its stated intent, checking whether the implementation actually does what the comments and function names suggest it should do. Another agent can focus specifically on security patterns, looking for the classes of vulnerability that are most common in the contract's category, whether that is a lending protocol, an AMM, a bridge, or a governance contract. A third agent can examine the code in the context of the broader codebase, checking for consistency with existing patterns and identifying places where the new code makes assumptions about state or behavior that the rest of the system does not guarantee.

This decomposition is particularly valuable for smart contract review because the vulnerability surface in DeFi is highly context-dependent. A function that is perfectly safe in isolation can become exploitable when composed with a flash loan, or when called in a specific order relative to an oracle update, or when the contract is interacting with a token that has non-standard transfer behavior. Catching these compositional vulnerabilities requires understanding the contract's environment, not just its internal logic, and that is the kind of reasoning that benefits most from a multi-agent approach where different agents can hold different aspects of the context simultaneously.

Token Economics and the Real Cost of Security

The $15 to $25 per review pricing deserves more analysis than it typically receives in coverage of this tool. The number sounds low in isolation, but the relevant comparison is not the cost of a single review. It is the cost of the security posture that becomes achievable when review is economically viable at the pull request level rather than reserved for milestone audits.

A DeFi protocol that ships a major contract upgrade every two weeks and runs Code Review on every pull request in that cycle might spend $500 to $1,000 per month on automated security analysis. That same protocol, relying on traditional audit firms for security coverage, would spend that amount in the first hour of an engagement. The economics of continuous security coverage have fundamentally changed, and the implication for Web3 teams is that there is no longer a credible argument for treating security as a pre-launch gate rather than an ongoing practice. The cost barrier that made continuous security review impractical for all but the largest protocols has been removed.

There is a counterargument worth addressing, which is that token-based pricing creates unpredictability for teams reviewing large codebases or complex contracts. A 2,000-line Solidity file with extensive NatSpec documentation and test coverage will consume more tokens than a 200-line utility contract, and the cost difference can be significant. Teams building review workflows around Code Review need to think about how they scope reviews, whether they run full-file analysis on every change or focus token consumption on the functions and modules that carry the most financial risk. This is not a reason to avoid the tool. It is a reason to be deliberate about how it is integrated into the pipeline.

The Comprehension Debt Problem in AI-Generated Contracts

There is a concept worth naming precisely here: comprehension debt. It is the accumulated gap between code that exists in a codebase and code that the team actually understands well enough to reason about under pressure. In traditional software development, comprehension debt builds slowly, through years of accumulated patches, refactors, and undocumented decisions. In AI-assisted development, it can build in an afternoon.

A developer who uses Claude Code to generate a 500-line Solidity contract in two hours has not spent two hours understanding that contract. They have spent two hours directing its generation, reviewing its outputs at a surface level, and verifying that it compiles and passes the tests they thought to write. The contract may be correct. It may also contain a subtle assumption about token decimals, or a reentrancy path through a callback, or an access control gap in an admin function, that the developer would have caught if they had written the code themselves but missed because the generation was fast enough that deep reading felt unnecessary. This is not a criticism of AI-assisted development. It is a description of how comprehension debt accumulates when generation velocity outpaces review depth.

Code Review addresses this directly by making the review process independent of the developer's comprehension of the code they generated. The agents analyzing the pull request are not relying on the developer's mental model of what the code does. They are reading the code as it is, reasoning about its behavior from first principles, and flagging the gaps between what the code appears to do and what it should do. This is the right architecture for a world where the person who generated the code and the person who understands it are not always the same person, and sometimes are not the same entity at all.

Where This Fits in the Broader Security Stack

Code Review is not a replacement for formal verification, professional audits, or on-chain monitoring. It is a layer in a security stack that needs all of those things, and understanding where it fits helps teams use it effectively rather than treating it as a silver bullet that obviates the need for other controls.

The right mental model is to think of the security stack as having three distinct time horizons. Pre-deployment security covers everything that happens before code reaches a live blockchain: static analysis, automated review, formal verification for critical functions, and professional audits for high-value contracts. On-deployment security covers the deployment process itself: multi-sig governance, timelocks, upgrade patterns that preserve the ability to respond to discovered vulnerabilities. Post-deployment security covers ongoing monitoring: on-chain event indexing, anomaly detection, circuit breakers, and incident response playbooks. Code Review belongs firmly in the pre-deployment layer, and it is most valuable when it is running continuously throughout development rather than as a final check before an audit engagement begins.

The interaction between Code Review and professional audits is worth thinking through carefully. A common pattern in Web3 security is that audit firms spend a significant portion of their engagement time on issues that automated tooling could have caught, which means teams are paying audit rates for work that should have been handled earlier in the pipeline. Teams that run Code Review throughout development and arrive at an audit engagement with a cleaner codebase get more value from the audit itself, because the auditors can focus their time on the architectural and compositional risks that require human judgment rather than the pattern-level vulnerabilities that automated analysis handles well.

The Enterprise Adoption Signal and What It Means

The fourfold growth in Claude Code enterprise subscriptions since the beginning of 2026 is a signal worth interpreting carefully. Enterprise adoption of developer tooling tends to be conservative and slow-moving, driven by procurement processes, security reviews, and the organizational inertia that comes with large engineering teams. When enterprise adoption accelerates at this rate, it usually means the tool is solving a problem that teams were already experiencing acutely, not a problem that the vendor had to convince them they had.

The problem in this case is the pull request bottleneck that Cat Wu described: more AI-generated code than human reviewers can process at the quality level that production systems require. Enterprise engineering organizations at companies like Uber and Salesforce are not adopting Claude Code because it is interesting. They are adopting it because it is measurably accelerating their development velocity, and they are adopting Code Review because that acceleration creates a review capacity problem that they need to solve. The enterprise signal here is that the AI coding wave has moved past the experimental phase and into the operational phase, where the secondary problems created by AI generation are as important as the primary benefits.

For Web3 teams, this enterprise adoption curve has a specific implication. The security standards that enterprise software organizations apply to AI-generated code are going to become the baseline expectation for production smart contracts as well. Institutional capital flowing into DeFi protocols, tokenized assets, and on-chain financial infrastructure comes with institutional expectations about security practices. Teams that can demonstrate continuous automated security review as part of their development process are going to have an easier time meeting those expectations than teams that rely on point-in-time audits and manual review.

Building Smarter Pipelines for Irreversible Code

The irreversibility of smart contract deployment is the fact that makes everything else in this conversation more consequential than it would be in traditional software development. A web application with a security vulnerability can be patched in minutes. A smart contract with a reentrancy vulnerability is vulnerable until it is replaced, and replacement requires governance processes, migration of user funds, and the kind of operational complexity that creates its own risks. The asymmetry between the cost of deploying vulnerable code and the cost of catching that vulnerability before deployment is larger in Web3 than in any other software domain, and that asymmetry is the fundamental argument for investing in pre-deployment security tooling.

Automated code review at the pull request level is the right response to this asymmetry because it moves the detection point as early as possible in the development cycle. A vulnerability caught during code review costs a developer an hour of remediation work. The same vulnerability caught during a pre-launch audit costs a team a week of rework and a delay to their launch timeline. The same vulnerability caught after deployment costs users their funds and the protocol its reputation. The economics of early detection are not subtle, and the availability of tooling that makes early detection practical at scale removes the last credible excuse for treating security as a late-stage concern.

Cheetah AI and the Future of Crypto-Native Security Tooling

The emergence of Code Review as a production tool reflects a broader shift in how the industry thinks about the relationship between AI code generation and AI code analysis. The two capabilities are not in tension. They are complementary, and the teams that will build the most secure and most productive Web3 applications are the ones that treat them as a unified workflow rather than separate concerns managed by separate tools.

Cheetah AI is built around exactly this principle. As the first crypto-native AI IDE, Cheetah AI is designed for developers who understand that smart contract development requires a different class of tooling than general-purpose software development. The irreversibility of deployment, the financial stakes of every function, and the adversarial environment in which on-chain code operates all demand an IDE that treats security as a first-class concern throughout the development process, not as a gate at the end of it. If you are building on-chain and you want to understand how AI-assisted development and AI-assisted security review can work together in a single environment designed for the specific demands of Web3, Cheetah AI is worth a closer look.


The conversation around AI-generated smart contract vulnerabilities is going to get louder before it gets quieter. More protocols will ship AI-generated code, more of that code will reach production without adequate review, and more exploits will be traced back to the comprehension gap between generation and understanding. The tools to close that gap exist today. Anthropic's Code Review is one of them. Cheetah AI is another, built specifically for the developers who understand that crypto-native development demands crypto-native tooling, and that the IDE is where security culture either gets built or gets skipped.

Related Posts

Cheetah Architecture: Building Intelligent Code Search

Cheetah Architecture: Building Intelligent Code Search

Building Intelligent Code Search: A Hybrid Approach to Speed and Relevance TL;DR: We built a hybrid code search system that:Runs initial text search locally for instant response Uses

user
Cheetah AI Team
02 Dec, 2025
Reasoning Agents: Rewriting Smart Contract Development

Reasoning Agents: Rewriting Smart Contract Development

TL;DR:Codex CLI operates as a multi-surface coding agent with OS-level sandboxing, 1M context windows via GPT-5.4, and the ability to read, patch, and execute against live codebases, making it

user
Cheetah AI Team
09 Mar, 2026
The New Bottleneck: AI Shifts Code Review

The New Bottleneck: AI Shifts Code Review

TL;DR:AI coding assistants now account for roughly 42% of all committed code, a figure projected to reach 65% by 2027, yet teams using these tools are delivering software slower and less relia

user
Cheetah AI Team
09 Mar, 2026