On-Chain Data Engineering: Pipelines That Actually Scale
Building production-grade blockchain analytics pipelines requires more than a node and a SQL query. Here is how the modern on-chain data stack works, and where AI dev tools are changing the equation.



Subscribe to our newsletter to get the latest updates and offers
* Will send you weekly updates on new features, tips, and developer resources.
TL;DR:
- On-chain data is structurally unlike traditional application data, requiring purpose-built ingestion, decoding, and transformation layers before it becomes analytically useful
- The Graph Protocol indexes over 40 networks and processes billions of queries monthly, making it the de facto standard for structured on-chain data access in production systems
- A production-grade blockchain analytics pipeline spans three distinct layers: a blockchain data layer, a processing and transformation layer, and an analytics or application layer
- AI models applied to on-chain transaction graphs can detect wash trading, Sybil attacks, and MEV extraction patterns with significantly higher precision than rule-based threshold systems
- Tenderly's simulation and monitoring infrastructure processes over 2 billion transactions per month, giving developers real-time visibility into contract behavior without requiring custom node infrastructure
- The tooling gap in on-chain data engineering is significant, with developers spending more time writing boilerplate indexing code and ABI decoding logic than building the actual analytics
- AI-assisted development environments reduce the time to scaffold a working subgraph or event indexer from hours to minutes, shifting developer attention toward data modeling and query design
The result: Building production-grade blockchain analytics pipelines is an infrastructure engineering problem as much as a data problem, and AI dev tools are becoming the fastest path from raw chain data to actionable insight.
The Data Problem Hiding in Plain Sight
There is a version of blockchain analytics that looks deceptively simple from the outside. You have a public ledger, every transaction is recorded, and the data is theoretically accessible to anyone running a node. Compared to the closed data silos of traditional finance, this sounds like a data engineer's dream. The reality is considerably more complicated, and the gap between "the data exists" and "the data is useful" is where most analytics projects stall out or collapse entirely.
The core issue is that blockchain data is optimized for consensus and verification, not for querying. A raw Ethereum block contains transactions, receipts, logs, and state diffs, but none of it arrives in a format that maps cleanly to the kinds of questions product teams, risk analysts, or compliance officers actually need to answer. A DeFi protocol tracking liquidity pool utilization across 50 pools on three chains is not going to get far by polling an RPC endpoint and parsing hex-encoded calldata by hand. The data needs to be decoded, normalized, enriched, and stored in a way that supports fast, flexible querying at scale.
What makes this harder than it sounds is that the volume is genuinely large. Ethereum alone produces roughly 7,000 to 8,000 transactions per block, with blocks arriving every 12 seconds. Across all EVM-compatible chains, including Arbitrum, Optimism, Base, Polygon, and others, the aggregate transaction throughput is orders of magnitude higher. A serious analytics pipeline needs to handle this volume continuously, without falling behind, while also maintaining the ability to backfill historical data when new indexing requirements emerge. That combination of real-time throughput and historical completeness is what separates a toy pipeline from a production system.
What Raw On-Chain Data Actually Looks Like
Before you can build a pipeline, you need a clear mental model of what you are actually working with. Blockchain data comes in several distinct forms, and each one requires different handling. At the top level, you have blocks, which contain ordered lists of transactions. Each transaction has a sender, a recipient, a value field, and an input data field that encodes the function call and its arguments when the recipient is a smart contract. Transactions also produce receipts, which contain the gas used, the status, and most importantly, the event logs emitted during execution.
Event logs are where most of the analytically interesting data lives for DeFi and NFT applications. When a Uniswap pool executes a swap, it emits a Swap event containing the amounts in and out, the price impact, and the addresses involved. When an ERC-20 token is transferred, it emits a Transfer event. These logs are indexed by topic, where the first topic is the keccak256 hash of the event signature, and subsequent topics are indexed parameters. To decode a log back into human-readable form, you need the contract's ABI, which defines the event signatures and parameter types. Without the ABI, you have a stream of hex-encoded bytes that tells you very little.
Beyond logs, there are traces, which represent the internal execution tree of a transaction. A single top-level transaction can trigger dozens of internal calls between contracts, and traces expose all of them. This matters enormously for MEV analysis, liquidation tracking, and any use case where you need to understand not just what happened at the surface level but the full execution path that produced it. Traces are expensive to compute and not always available from standard RPC providers, which is one reason why purpose-built data providers like Alchemy, QuickNode, and Goldsky exist alongside raw node access. The data is public, but accessing it efficiently at scale requires infrastructure investment that most teams are not positioned to build themselves.
The Three-Layer Architecture That Works
The teams building reliable on-chain analytics pipelines in 2025 have largely converged on a three-layer architecture, even if they use different tools at each layer. The first layer is the blockchain data layer, responsible for raw data access and initial extraction. This is where you interact with nodes, RPC providers, or specialized data services to pull blocks, transactions, receipts, logs, and traces. The second layer is the processing and transformation layer, where raw data gets decoded, normalized, enriched with off-chain context, and loaded into a queryable store. The third layer is the analytics and application layer, where dashboards, APIs, alerting systems, and ML models consume the processed data.
The reason this three-layer separation matters is that each layer has fundamentally different scaling characteristics and failure modes. The ingestion layer needs to handle bursty throughput, network interruptions, and chain reorganizations gracefully. A reorg on Ethereum can invalidate several blocks worth of data, and a pipeline that does not handle this correctly will silently corrupt its own state. The transformation layer needs to be idempotent and replayable, because you will inevitably need to reprocess historical data when your schema evolves or when you discover a bug in your decoding logic. The analytics layer needs to support fast ad-hoc queries without putting load back on the ingestion infrastructure.
In practice, most teams use a combination of managed services and custom code to fill these layers. Goldsky and Substreams handle ingestion and streaming for many teams. The Graph Protocol handles structured indexing and query serving. Dune Analytics and Flipside Crypto provide managed SQL environments over pre-indexed chain data. For teams that need more control, a common pattern is to stream raw data into a message queue like Kafka or Redpanda, process it with a streaming framework like Apache Flink or a custom worker pool, and land the results in a columnar store like ClickHouse or BigQuery. The specific tools matter less than the architectural discipline of keeping the layers cleanly separated and independently scalable.
Ingestion: Getting Data Off the Chain
The ingestion layer is where most pipelines encounter their first serious engineering challenges. The naive approach is to poll an RPC endpoint for new blocks and process them sequentially. This works fine for low-volume use cases, but it breaks down quickly when you need to index multiple chains simultaneously, handle high-throughput chains like Solana or Base, or maintain sub-second latency for real-time alerting. The polling model also creates a tight coupling between your pipeline's processing speed and the chain's block production rate, which means any slowdown in your processing logic causes you to fall behind.
The more robust approach is to use a streaming data source that pushes new data to your pipeline rather than requiring you to pull it. Several infrastructure providers now offer WebSocket subscriptions for new blocks and filtered log streams, which eliminates the polling overhead and reduces latency significantly. For teams that need even lower latency or more complex filtering, Substreams, developed by StreamingFast and now part of The Graph ecosystem, provides a Rust-based streaming framework that runs close to the node and can process data at the speed of block production. Substreams modules are composable and cacheable, which means you can build a library of reusable extraction logic rather than rewriting the same decoding patterns for every new project.
Chain reorganizations deserve special attention in any ingestion design. On proof-of-work chains, reorgs of one to two blocks were common. On proof-of-stake Ethereum, deep reorgs are rare but shallow reorgs still occur, and any pipeline that does not account for them will eventually produce incorrect data. The standard approach is to treat blocks below a certain confirmation depth as provisional and only mark them as finalized once they have enough confirmations. For Ethereum, 64 blocks is the commonly used finality threshold for most applications, though the actual finality mechanism is more nuanced. Building reorg handling correctly from the start is significantly easier than retrofitting it into a pipeline that was designed assuming linear block progression.
The Graph Protocol and the Subgraph Model
The Graph Protocol has become the most widely adopted solution for structured on-chain data access, and understanding how it works is essential for anyone building in this space. At its core, The Graph allows developers to define subgraphs, which are declarative specifications of what on-chain data to index and how to transform it. A subgraph definition includes a manifest that specifies which contracts to watch and which events to handle, a GraphQL schema that defines the data model, and a set of AssemblyScript mapping functions that transform raw event data into entities matching the schema.
When a subgraph is deployed, The Graph's indexing nodes replay the relevant portion of chain history, applying the mapping functions to each matching event and building up the entity store. Once caught up to the chain tip, the indexer continues processing new blocks in real time. Consumers query the subgraph via a standard GraphQL API, which means the complexity of blockchain data access is entirely abstracted away from the application layer. A frontend developer querying a Uniswap subgraph does not need to know anything about ABI decoding or log topics. They write a GraphQL query and get back structured data.
The tradeoffs of the subgraph model are worth understanding clearly. Subgraphs are excellent for event-driven data, where you know in advance which contracts and events you care about. They are less well-suited for ad-hoc exploratory analysis, where you might want to query across all contracts of a certain type without knowing their addresses in advance. They also have limitations around complex aggregations and time-series queries, which is why many teams use The Graph for application-layer data access while maintaining a separate data warehouse for analytics workloads. The Graph's hosted service has also been transitioning to a fully decentralized network, which introduces considerations around indexer selection, query fees, and data freshness that did not exist in the centralized model.
Transformation: From Raw Bytes to Useful Schema
The transformation layer is where the real data engineering work happens, and it is also where the most technical debt tends to accumulate. Raw blockchain data, even after basic decoding, is not immediately useful for most analytics purposes. Token amounts are stored as raw integers without decimal normalization, addresses are checksummed hex strings that need to be lowercased for consistent joins, timestamps are block numbers that need to be converted to wall-clock time, and prices are absent entirely because blockchains do not have native price oracles.
Normalizing token amounts requires knowing the decimal precision of each token, which means maintaining a registry of token metadata that maps contract addresses to their symbol, name, and decimals fields. This registry needs to be kept current as new tokens are deployed, and it needs to handle edge cases like tokens that change their decimals or tokens with non-standard ERC-20 implementations. Converting block numbers to timestamps requires either storing block timestamps during ingestion or maintaining a block-to-timestamp lookup table. Neither approach is trivial at scale, and the choice has downstream implications for how you handle time-series queries and historical backfills.
Enrichment is the process of joining on-chain data with off-chain context to make it more useful. The most common form of enrichment is price data, where you join transaction data with historical price feeds to express token amounts in USD terms. This is essential for any analytics that involves comparing values across different tokens or time periods. Other common enrichment operations include entity labeling, where known addresses like exchange hot wallets, protocol treasuries, and identified whale wallets are tagged with human-readable labels, and protocol classification, where contract addresses are mapped to the protocols they belong to. Nansen has built a significant business around exactly this kind of enriched on-chain data, with a labeled address database covering millions of wallets and contracts across dozens of chains.
Where AI Enters the Analytics Pipeline
The intersection of AI and on-chain data analytics is more substantive than the typical hype cycle suggests, and it operates at several distinct layers of the pipeline. The most mature application is anomaly detection, where machine learning models trained on historical transaction patterns can identify unusual behavior in real time. A rule-based system for detecting wash trading might flag transactions where the same address appears on both sides of a trade within a short time window. An ML-based system can identify wash trading rings that span dozens of addresses, use intermediate hops to obscure the circular flow, and operate across multiple protocols simultaneously, patterns that are essentially invisible to threshold-based rules.
Graph neural networks have shown particular promise for on-chain fraud detection because blockchain transaction data is inherently graph-structured. Each address is a node, each transaction is a directed edge, and the full transaction history of a chain is a massive temporal graph. GNN-based models can learn structural patterns associated with known fraud schemes, Ponzi contracts, rug pulls, and Sybil attack clusters, and apply those patterns to identify new instances with high precision. Research published in 2024 demonstrated that GNN-based classifiers could identify Ponzi scheme contracts on Ethereum with over 97% accuracy using only on-chain behavioral features, without requiring any off-chain signals.
AI is also being applied at the query and exploration layer, where natural language interfaces allow analysts to query on-chain data without writing SQL or GraphQL. Platforms like Dune have begun integrating AI-assisted query generation, where a user can describe what they want to know in plain English and receive a working SQL query against the indexed chain data. This is not a replacement for understanding the underlying data model, but it dramatically lowers the barrier to entry for analysts who are comfortable with data concepts but unfamiliar with the specific schema conventions of blockchain data warehouses. For teams building internal analytics tools, similar capabilities can be embedded directly into dashboards using LLM APIs with appropriate schema context.
Real-Time Monitoring, Fraud Detection, and Compliance
Real-time monitoring is one of the highest-value applications of on-chain analytics pipelines, and it is also one of the most technically demanding. The use case is straightforward: you want to know within seconds when something significant happens on-chain, whether that is a large withdrawal from a protocol, an unusual sequence of transactions that resembles a known attack pattern, or a wallet address associated with a sanctioned entity interacting with your contracts. The challenge is that achieving sub-second latency from chain event to alert requires careful optimization at every layer of the pipeline.
Tenderly has become a widely used platform for this use case, offering a combination of transaction simulation, real-time alerting, and contract monitoring that covers most of what a DeFi protocol needs for operational visibility. Its monitoring infrastructure processes over 2 billion transactions per month across multiple chains, and its alerting system can trigger webhooks within seconds of a matching transaction being included in a block. For teams that need more customization than a managed platform provides, building a custom monitoring stack typically involves a combination of WebSocket subscriptions for real-time log streaming, a stateful stream processing layer for pattern matching, and a notification service for alert delivery.
Compliance is an increasingly important driver of on-chain analytics investment, particularly as regulatory frameworks around crypto assets mature in the US, EU, and Asia. The Travel Rule, which requires virtual asset service providers to share sender and recipient information for transactions above certain thresholds, creates a direct need for on-chain data pipelines that can identify transaction counterparties and assess their risk profiles. Chainalysis and Elliptic have built substantial businesses around exactly this use case, offering APIs that accept a wallet address or transaction hash and return a risk score based on the address's transaction history and known associations. Building equivalent capabilities in-house requires a combination of the enriched address labeling discussed earlier and ML models trained on known illicit activity patterns.
The Developer Experience Gap Nobody Talks About
There is a persistent and underappreciated problem with the current state of on-chain data engineering tooling, and it is not about the availability of data or the sophistication of the analytics platforms. The problem is that building and maintaining the code that connects these systems is genuinely tedious, error-prone work that consumes a disproportionate share of developer time. Writing a subgraph from scratch involves generating TypeScript types from an ABI, writing mapping functions that correctly handle every event variant, managing entity relationships in the schema, and debugging indexing failures that often manifest as silent data gaps rather than explicit errors.
The boilerplate problem is particularly acute for teams that need to index multiple protocols or multiple chains. Each new contract requires a new set of ABI-generated types, new mapping functions, and new schema entities. The logic across these mappings is often structurally similar, handling token transfers, updating balances, recording timestamps, but the specific field names and event signatures differ enough that copy-paste reuse is fragile and error-prone. Teams end up maintaining large volumes of nearly-identical indexing code, and any change to the underlying schema requires coordinated updates across all of it.
The debugging experience for on-chain data pipelines is also significantly worse than for traditional backend services. When a subgraph produces incorrect data, the typical debugging workflow involves inspecting the raw transaction data on a block explorer, tracing through the mapping logic manually, and comparing the expected output to what actually ended up in the entity store. There are no standard debugging tools that let you step through a mapping function with a specific transaction as input, set breakpoints, and inspect intermediate state. The tooling gap here is real, and it is one of the reasons that experienced on-chain data engineers are scarce and expensive relative to the demand for their skills.
How AI Dev Tools Are Closing the Gap
The emergence of AI-assisted development environments is beginning to address the developer experience problems described above in concrete, measurable ways. The most immediate impact is on the scaffolding and boilerplate problem. Given a contract ABI and a description of the data model you want to produce, an AI dev tool can generate a complete subgraph definition, including the manifest, schema, and mapping functions, in a fraction of the time it would take to write by hand. More importantly, it can do this while applying the conventions and best practices that experienced subgraph developers have accumulated over years of production deployments, things like proper entity ID construction, correct handling of BigDecimal arithmetic for token amounts, and defensive null checking in mapping functions.
Beyond scaffolding, AI dev tools are changing how developers interact with unfamiliar codebases and data schemas. When you are working with a new protocol's subgraph or trying to understand how an existing indexer handles a particular edge case, an AI assistant with context about the codebase can answer questions that would otherwise require reading through hundreds of lines of mapping code. This is particularly valuable in on-chain data engineering because the domain knowledge required to understand what a piece of code is doing is spread across the code itself, the contract ABI, the protocol documentation, and the broader context of how the protocol works. Bringing all of that context together in a single development environment dramatically reduces the cognitive overhead of working in this space.
The impact on debugging is also significant. AI dev tools can analyze a failing mapping function alongside the raw transaction data that triggered the failure and suggest the specific line of code that is likely causing the issue. They can identify when a mapping function is not handling a particular event variant correctly, when an entity relationship is being constructed with the wrong ID format, or when a BigInt conversion is silently truncating a value. These are exactly the kinds of subtle bugs that are hardest to catch through manual code review and most likely to produce silent data corruption rather than explicit errors. Having an AI assistant that can reason about both the code and the on-chain data simultaneously compresses debugging cycles from hours to minutes.
Building the Next Generation of On-Chain Analytics
The trajectory of on-chain data engineering is clear. The infrastructure is maturing, the tooling ecosystem is consolidating around a set of proven patterns, and the demand for production-grade analytics capabilities is growing faster than the supply of engineers who know how to build them. The teams that will build the most valuable analytics products over the next few years are not necessarily the ones with the deepest blockchain expertise in isolation. They are the ones that can move quickly from a data question to a working pipeline, iterate on their data models without accumulating crippling technical debt, and maintain their indexing infrastructure reliably as the protocols they track evolve.
That combination of speed, quality, and maintainability is exactly what AI-assisted development environments are designed to enable. The on-chain data engineering space is a particularly good fit for AI dev tools because so much of the work involves translating between well-defined specifications, ABIs, GraphQL schemas, SQL table definitions, and the code that connects them. These are translation tasks where AI models perform well, and where the cost of errors is high enough that having an AI assistant that can catch mistakes before they reach production is genuinely valuable.
Cheetah AI is built specifically for this kind of work. It understands the conventions of Web3 development, the structure of on-chain data, and the tooling ecosystem that serious blockchain data engineers rely on. If you are building analytics pipelines, indexing protocols, or trying to get more signal out of on-chain data, it is worth seeing what a development environment designed for this domain can do for your team's velocity and the quality of what you ship.
What makes Cheetah AI different from a general-purpose coding assistant in this context is the domain specificity. Understanding that a subgraph mapping function needs to handle BigDecimal arithmetic carefully to avoid precision loss, knowing that entity IDs in The Graph should be constructed deterministically from transaction hash and log index to avoid collisions, recognizing when a Solidity event signature has been incorrectly decoded because the ABI version does not match the deployed contract, these are not things a generic tool handles well. They require the kind of accumulated, domain-specific knowledge that takes engineers months to develop through direct experience with production systems.
The on-chain data engineering space is still early enough that the teams investing in better tooling now are building a compounding advantage. Every hour saved on boilerplate is an hour that goes toward data modeling, query optimization, or building the analytics features that actually differentiate a product. Every bug caught before it reaches production is a data quality problem that never has to be investigated and remediated downstream. If your team is serious about building on-chain analytics infrastructure that scales, the development environment you choose is not a peripheral concern. It is part of the foundation.
Related Posts

The New Bottleneck: AI Shifts Code Review
TL;DR:AI coding assistants now account for roughly 42% of all committed code, a figure projected to reach 65% by 2027, yet teams using these tools are delivering software slower and less relia

AI Upskilling: Strategies for Web3 Developers
TL;DR:Model releases from major labs are now happening every six to eight weeks on average, creating a continuous upskilling burden for developers who rely on AI tooling as part of their daily

MCP Goes Mainstream: Rebuilding Crypto Developer Tooling
TL;DR:MCP, introduced by Anthropic in late 2024, grew from an experimental standard to a protocol adopted by OpenAI, Microsoft, Google, and hundreds of third-party tool builders within roughly