On-Chain Analytics: dbt as the Transformation Layer
A technical deep-dive into using dbt as the transformation layer for blockchain data pipelines, covering staging models, incremental strategies, data quality testing, and orchestration for production-grade on-chain analytics.



Subscribe to our newsletter to get the latest updates and offers
* Will send you weekly updates on new features, tips, and developer resources.
Why Blockchain Data Demands a Dedicated Transformation Layer
TL;DR:
- Blockchain data is append-only, event-driven, and structurally unlike traditional relational data, requiring purpose-built transformation strategies before it becomes analytically useful
- dbt has emerged as the dominant SQL-based transformation layer for on-chain analytics, enabling modular, version-controlled, and testable data models at scale
- Platforms like Dune Analytics adopted dbt in 2020 to power thousands of community-built models, before eventually migrating to SQLMesh as scale exposed dbt's macro-based programming model limitations
- Incremental models in dbt are the primary mechanism for handling blockchain's append-only data structure, but they require careful configuration to avoid reprocessing entire chain histories on every run
- Orchestration tools like Apache Airflow integrate with dbt to schedule and sequence transformation jobs, enabling production-grade pipelines that handle chain reorganizations and late-arriving data
- Data quality testing in dbt, using schema tests and custom SQL assertions, is the primary defense against corrupted or incomplete on-chain data reaching downstream consumers
- AI-assisted development environments are beginning to accelerate the authoring of dbt models for blockchain data, reducing the time to build production-ready transformation layers from weeks to days
The result: dbt is not a perfect fit for every blockchain data problem, but it remains the most practical starting point for teams building scalable, maintainable on-chain analytics infrastructure.
Most teams building on-chain analytics start with the same assumption: that blockchain data is just another data source, and that the same ETL patterns they use for application databases will transfer cleanly. That assumption breaks quickly. Ethereum alone produces roughly 6,000 to 7,000 transactions per block, with a new block arriving approximately every 12 seconds. Across a full year, that adds up to hundreds of millions of transaction records, each carrying nested event logs, hex-encoded calldata, and references to contract addresses that require separate ABI decoding before they carry any semantic meaning. The raw data is technically complete, but analytically useless without a structured transformation layer sitting between the node and the dashboard.
The transformation layer is where dbt has carved out a dominant position in the on-chain analytics stack. dbt, which stands for data build tool, was originally designed for traditional data warehouse workflows, but its core model, SQL-based transformations organized into a directed acyclic graph with built-in testing and documentation, maps surprisingly well to the problems blockchain data engineers face. The tool does not move data. It transforms data that has already been loaded into a warehouse or query engine, which means it fits naturally into the extract-load-transform pattern that most blockchain data pipelines follow. You extract raw chain data using a node or a provider like Alchemy or QuickNode, load it into a warehouse like BigQuery or Snowflake, and then use dbt to build the clean, queryable models that analysts and applications actually consume.
Whatmakes dbt particularly well-suited to this environment is that it treats SQL as a first-class citizen while adding the software engineering practices that raw SQL scripts lack entirely: dependency management, modular composition, automated testing, and version control integration. For a blockchain data team managing hundreds of models across multiple chains, those properties are not nice-to-haves. They are the difference between a pipeline that survives team turnover and one that collapses the moment the engineer who wrote it leaves.
What Raw Blockchain Data Actually Looks Like
Before you can appreciate what dbt does for on-chain analytics, you need a clear picture of what the raw data looks like when it arrives in your warehouse. Most teams ingest blockchain data through one of three paths: a self-hosted archive node with a custom indexer, a managed data provider like The Graph or Goldsky, or a platform like Dune that exposes decoded chain data through a shared query engine. Regardless of the ingestion path, the raw layer tends to look similar. You have a blocks table with fields like block number, timestamp, gas used, and miner address. You have a transactions table with from and to addresses, value in wei, input data as a hex string, and a receipt status. And you have a logs table, which is where most of the analytical complexity lives.
The logs table is the raw output of every event emitted by every smart contract on the chain. Each row contains a contract address, a topics array where the first element is the keccak256 hash of the event signature, and a data field containing the ABI-encoded event parameters. None of that is human-readable without decoding. A Transfer event on an ERC-20 contract looks like a 32-byte padded hex string until you apply the ABI schema to extract the from address, to address, and amount. At scale, across hundreds of contracts and millions of blocks, this decoding step is one of the most computationally expensive parts of the entire pipeline, and it is typically handled either at ingestion time by a specialized indexer or in the staging layer of your dbt project.
The other structural challenge is that blockchain data is not normalized in the relational sense. A single DeFi transaction might touch five contracts, emit twelve events, and involve three intermediate token transfers, all within a single atomic transaction. Reconstructing the semantic meaning of that transaction, what protocol was used, what the user intended, what the net financial effect was, requires joining across blocks, transactions, logs, and often external reference tables like token metadata or protocol contract registries. That join complexity is exactly where dbt's DAG-based model composition shines, because you can build up the semantic layer incrementally, one model at a time, with each model depending explicitly on the ones below it.
Structuring Your dbt Project for On-Chain Data
The most durable pattern for organizing a blockchain dbt project follows a three-layer architecture that maps closely to the medallion model popularized in the Databricks ecosystem. The first layer is raw sources, which are the tables your ingestion pipeline writes directly. You declare these in dbt using source definitions, which lets you reference them in downstream models without hardcoding warehouse paths and gives you a place to attach freshness checks that alert you when ingestion falls behind. The second layer is staging models, which handle the mechanical transformations: decoding hex values, casting timestamps from Unix epoch integers to proper timestamp types, renaming columns to consistent conventions, and filtering out known bad data like failed transactions that should not appear in financial calculations.
The third layer is where the analytical work happens. This is where you build mart models that answer specific business questions: daily active addresses per protocol, cumulative DEX volume by token pair, gas cost distributions by transaction type, or liquidity pool depth over time. These models join across staging tables, apply business logic, and produce the clean, denormalized outputs that BI tools and application APIs consume. The key discipline is keeping business logic out of the staging layer and keeping raw source references out of the mart layer. When those boundaries blur, you end up with the same kind of spaghetti dependency graph that makes traditional SQL warehouses so painful to maintain.
One pattern that works particularly well for multi-chain analytics is using dbt's variable system to parameterize chain-specific logic. Rather than duplicating an entire model for Ethereum and Arbitrum, you define a chain variable and use Jinja templating to swap in the correct source tables and any chain-specific constants, like block time or native token decimals. This approach keeps your model count manageable as you expand to additional chains, though it does introduce the risk of subtle bugs when chain behavior diverges in ways your parameterization did not anticipate. The Dune team, which operates one of the largest open-source dbt model communities in existence with thousands of community-contributed models, ran into exactly this kind of complexity at scale, eventually finding that dbt's macro-based programming model made code reuse difficult and led to what their CTO Mats Olsen described as a naive approach to abstraction.
Incremental Models and the Append-Only Problem
The most important dbt concept for blockchain data engineers is the incremental model. Blockchain data is fundamentally append-only: new blocks arrive continuously, old blocks do not change (with the exception of chain reorganizations, which we will address separately), and the total dataset grows without bound. Running a full refresh of a model that covers three years of Ethereum transaction history every time you want updated analytics is not a viable strategy. On a warehouse like BigQuery, a full scan of the raw transactions table for a major chain can process terabytes of data per run, which translates directly into cost and latency that make sub-hourly analytics economically impractical.
Incremental models solve this by persisting the transformed output of previous runs and only processing new source data on subsequent runs. In dbt, you configure this with the incremental materialization strategy and define a filter that selects only rows added since the last run. For blockchain data, the natural filter is block number or block timestamp. You store the maximum block number processed in the previous run, and on the next run you only transform blocks above that watermark. The implementation looks straightforward, but there are several failure modes that catch teams off guard. If your ingestion pipeline delivers blocks out of order, which happens with some provider APIs under load, a simple max block number filter will silently miss blocks that arrive late. The safer pattern is to use a lookback window, reprocessing the last N blocks on every run to catch any gaps, at the cost of some redundant computation.
The merge strategy for incremental models is also worth thinking through carefully. dbt supports append, merge, and insert-overwrite strategies, and the right choice depends on whether your source data can produce duplicate rows. Most blockchain data providers guarantee exactly-once delivery at the block level, but if you are ingesting from multiple sources or running backfills in parallel, duplicates are possible. Using a merge strategy with a unique key on transaction hash or log index prevents duplicates from accumulating in your mart tables, but merge operations are significantly more expensive than appends on most warehouse engines. For high-volume tables like raw logs, many teams use insert-overwrite on date-partitioned tables as a middle ground, reprocessing a rolling window of recent partitions on each run rather than merging individual rows.
Handling Chain Reorganizations Without Breaking Your Models
Chain reorganizations, commonly called reorgs, are the blockchain-specific failure mode that has no direct analog in traditional data engineering. A reorg occurs when the canonical chain switches to a different fork, invalidating blocks that were previously considered final. On Ethereum post-merge, reorgs deeper than one or two blocks are rare, but they do happen, and on chains with faster finality assumptions or weaker validator sets, reorgs of ten or more blocks are not uncommon. If your dbt models have already processed and persisted data from a block that gets reorged away, you now have incorrect data in your analytics layer with no automatic mechanism to detect or correct it.
The standard mitigation is to treat recent blocks as provisional and only include them in your incremental models after a configurable confirmation depth. For Ethereum, most production analytics pipelines use a confirmation threshold of somewhere between 12 and 64 blocks, depending on the risk tolerance of the downstream use case. Financial reporting models that feed into accounting systems typically use higher confirmation thresholds than real-time monitoring dashboards. In dbt, you implement this by subtracting the confirmation depth from the maximum block number when defining your incremental filter, which means your models always lag the chain tip by a predictable amount. The tradeoff is latency: a 64-block confirmation depth on Ethereum translates to roughly 13 minutes of lag, which is acceptable for daily analytics but not for applications that need near-real-time data.
For teams that need both safety and low latency, a common pattern is to maintain two separate model layers: a confirmed layer that applies the full confirmation depth and is used for all financial and compliance reporting, and a provisional layer that processes recent unconfirmed blocks for real-time dashboards, with explicit documentation that the provisional data may be revised. dbt's exposure feature, which lets you declare which models are intended for external consumption, is useful here for making the distinction explicit in your project documentation and preventing downstream consumers from accidentally building on provisional data.
Data Quality Testing as a First-Class Concern
One of the most underutilized features of dbt in blockchain analytics contexts is its built-in testing framework. dbt supports two categories of tests: schema tests, which are declarative assertions about column properties like uniqueness, non-null constraints, and referential integrity, and data tests, which are custom SQL queries that return rows when a condition is violated. For blockchain data, both categories are essential, and the failure to invest in them early is one of the most common reasons analytics pipelines lose stakeholder trust over time.
Schema tests on blockchain data catch a surprising number of real problems. A not-null test on transaction hash will surface ingestion gaps where your provider returned incomplete data. A unique test on block number will catch duplicate ingestion runs that doubled up on a range of blocks. An accepted-values test on transaction status will alert you if your ingestion pipeline starts delivering unexpected status codes that your staging models do not handle. These tests run automatically as part of your dbt job and fail loudly, which means problems get caught before they propagate to downstream models and dashboards.
Custom data tests are where you encode blockchain-specific invariants that schema tests cannot express. A well-designed blockchain dbt project will include tests like: the sum of all token transfer amounts for a given contract should never exceed the total supply recorded in the token metadata table; the gas used for any transaction should never exceed the block gas limit for the block it belongs to; the cumulative transaction count for any address should be monotonically increasing when ordered by block number. These tests encode domain knowledge about how blockchains work, and when they fail, they almost always indicate either a bug in your transformation logic or a genuine anomaly in the source data that warrants investigation. Building this test suite takes time upfront, but it pays back that investment every time it catches a silent data corruption before it reaches a production dashboard.
Orchestrating dbt with Airflow for Production Pipelines
Running dbt in development is straightforward: you execute dbt run from the command line and watch your models build. Running dbt in production, on a schedule, with dependency management, retry logic, alerting, and coordination with upstream ingestion jobs, requires an orchestration layer. Apache Airflow is the most common choice for teams building production blockchain data pipelines, though alternatives like Prefect, Dagster, and dbt Cloud's native scheduler are all viable depending on your infrastructure preferences.
The typical Airflow DAG for a blockchain analytics pipeline follows a linear sequence: first, a sensor task waits for new blocks to be available in the raw ingestion layer; then a dbt run task executes the staging models; then a second dbt run task executes the mart models that depend on the staging layer; and finally a dbt test task runs the full test suite against the freshly built models. If any step fails, Airflow's retry and alerting mechanisms handle notification and recovery. The sensor task is particularly important for blockchain pipelines because ingestion latency is variable. Some blocks arrive within seconds of being produced; others, particularly during periods of high network congestion or provider instability, can be delayed by minutes. A time-based schedule that assumes ingestion is always complete by a fixed offset will produce incorrect results during those delay windows.
One pattern that significantly improves the reliability of Airflow-orchestrated dbt pipelines is using dbt's defer feature in combination with a production manifest. When a dbt run fails partway through, defer allows subsequent runs to skip models that already built successfully and only rerun the failed models and their dependents. For a large blockchain dbt project with hundreds of models, this can reduce recovery time from a failed run from hours to minutes. The production manifest approach also enables a useful development workflow where engineers can run only the models they are actively changing against a development warehouse, while deferring all unchanged upstream models to their production counterparts, which avoids the cost and time of rebuilding the entire model graph on every development iteration.
Macros, Jinja, and the Limits of dbt's Abstraction Model
dbt's Jinja templating system is both its most powerful feature and the source of its most significant scaling problems in large blockchain analytics projects. Jinja macros let you define reusable SQL logic that can be called across multiple models, which is essential for blockchain data where the same patterns appear repeatedly: decoding a specific event type, computing a rolling window aggregate, joining against a token metadata reference table. Without macros, you end up copying and pasting the same fifty-line SQL block into dozens of models, which makes maintenance a nightmare when the underlying logic needs to change.
The problem is that Jinja macros are not functions in any conventional programming sense. They are text substitution templates that get expanded at compile time, which means they cannot be unit tested in isolation, they do not have type signatures, and they do not compose cleanly when one macro calls another. In a small project with a handful of macros, this is manageable. In a large project like Dune's, where the community has contributed hundreds of macros covering everything from ERC-20 transfer decoding to cross-chain address normalization, the macro layer becomes a source of subtle bugs and unexpected interactions that are extremely difficult to debug. Mats Olsen's observation that dbt's programming model felt naive at a philosophical level reflects a real tension: the tool was designed for analytics engineering at the scale of a single organization's data warehouse, not for an open-source community building thousands of interdependent models across dozens of chains.
This does not mean dbt is the wrong tool for blockchain analytics. It means that teams need to be deliberate about macro design from the beginning, treating macros as a last resort for genuinely reusable logic rather than a first resort for any SQL that appears more than once. The teams that get the most out of dbt for blockchain data tend to invest heavily in a small, well-documented macro library covering the core decoding and normalization patterns, and then build their mart models using straightforward SQL that references those macros sparingly. The teams that struggle are the ones that reach for macros to solve every abstraction problem and end up with a codebase where understanding any single model requires tracing through four layers of macro expansion.
Connecting dbt Models to Downstream Consumers
A transformation layer that produces clean, tested models is only valuable if those models are accessible to the tools and teams that need them. For blockchain analytics, downstream consumers typically fall into three categories: BI tools like Metabase, Tableau, or Hex that analysts use to build dashboards; application APIs that serve on-chain data to frontend interfaces; and data science environments where researchers run ad-hoc queries and build predictive models. Each category has different latency, freshness, and access pattern requirements, and your dbt project structure should reflect those differences.
For BI tools, dbt's exposure feature lets you declare which models are intended for external consumption and document their intended audience and refresh cadence. This is particularly useful in blockchain analytics where the same underlying data might power both a real-time trading dashboard that needs sub-minute freshness and a weekly protocol health report that can tolerate a 24-hour lag. Exposures make those distinctions explicit in your project documentation and help prevent analysts from building dashboards on top of intermediate models that were never intended for direct consumption. Dune's platform, which exposes dbt-modeled blockchain data through a shared query engine, uses a similar concept to distinguish between community-contributed models that are considered stable and those that are still experimental.
For application APIs, the key consideration is query performance. dbt models are typically materialized as tables or views in your warehouse, and warehouse query engines are optimized for analytical workloads, not for the low-latency point lookups that application APIs require. Teams that need sub-100-millisecond response times for on-chain data queries typically use dbt to build the transformation layer and then sync the output to a purpose-built serving layer like PostgreSQL, Redis, or a specialized blockchain data API like Sim from Dune. The dbt models define the source of truth; the serving layer is a read-optimized cache of that truth.
Real-Time Pipelines and the Limits of Batch Transformation
The traditional dbt workflow is batch-oriented: you run your models on a schedule, and the freshness of your analytics is bounded by the interval between runs. For many blockchain analytics use cases, hourly or even daily batch runs are perfectly adequate. Protocol revenue reports, governance participation metrics, and long-term holder behavior analysis do not require real-time data. But the blockchain ecosystem increasingly demands analytics that are fresh within seconds, not hours. Liquidation monitoring, MEV detection, real-time portfolio valuation, and on-chain risk management all require data pipelines that can keep pace with the chain itself.
dbt was not designed for streaming or micro-batch workloads, and trying to run dbt on a one-minute schedule to approximate real-time analytics creates more problems than it solves. The overhead of spinning up a dbt run, resolving the model graph, and executing even a small set of incremental models adds latency that makes true real-time analytics impossible. Teams that need genuine real-time on-chain analytics typically use a separate streaming layer, built on tools like Apache Flink, Kafka Streams, or purpose-built blockchain streaming platforms, and reserve dbt for the batch transformation of historical data that feeds longer-horizon analytics. The two layers complement each other: the streaming layer handles recency, and the dbt layer handles depth and correctness.
The emerging pattern in production blockchain data stacks is a lambda architecture where both layers write to the same serving layer, with the batch layer periodically overwriting the streaming layer's output for historical windows where the batch data is more reliable. This is operationally complex, but it is the honest engineering answer to the tension between freshness and correctness in blockchain analytics. dbt's role in this architecture is well-defined: it owns the historical transformation layer, it enforces data quality through its test suite, and it provides the documentation and lineage tracking that makes the overall system auditable.
AI-Assisted Development and the Future of On-Chain Analytics Engineering
The practical challenge of building a production-grade blockchain dbt project is not conceptual. Most experienced data engineers understand the patterns described in this post. The challenge is execution: writing hundreds of SQL models, maintaining a comprehensive test suite, keeping documentation current as protocols evolve, and debugging subtle data quality issues that only manifest at scale. That execution burden is where AI-assisted development tools are beginning to make a measurable difference.
AI code generation in the context of dbt for blockchain data is most useful for the mechanical, pattern-heavy work that consumes a disproportionate share of engineering time. Writing staging models for a new ERC-20 contract follows a predictable template: decode the Transfer and Approval events, cast the token amounts using the correct decimal precision from the token metadata table, join against the blocks table to get timestamps, and apply the standard column naming conventions. An AI assistant that understands your project's conventions and the structure of your existing models can generate a first draft of that staging model in seconds. The engineer's job shifts from writing boilerplate to reviewing, testing, and refining the generated output, which is a much better use of their time.
The more sophisticated application is AI-assisted debugging of data quality failures. When a dbt test fails in a blockchain analytics project, diagnosing the root cause often requires tracing through multiple layers of the model graph, cross-referencing against raw chain data, and understanding the specific behavior of the protocol whose data is failing the assertion. An AI assistant with context about your dbt project structure, your test definitions, and the relevant on-chain data can significantly compress the time from test failure to root cause identification. Rather than spending two hours manually tracing a data lineage issue, an engineer can describe the failing test to their AI assistant, share the relevant model code, and get a focused set of hypotheses to investigate.
Building Smarter Blockchain Data Workflows with Cheetah AI
The patterns covered in this post represent the current state of the art for on-chain analytics engineering, but the tooling landscape is still maturing rapidly. dbt has proven its value as a transformation layer for blockchain data, and the ecosystem of connectors, adapters, and community models continues to grow. At the same time, the limitations of dbt's macro system, its batch-only execution model, and its lack of native blockchain awareness mean that teams building at scale will continue to hit friction points that require custom solutions or migration to more specialized tools.
What changes the calculus significantly is having an AI-powered development environment that understands both the data engineering patterns and the blockchain-specific context that makes on-chain analytics different from conventional warehouse work. Cheetah AI is built for exactly this environment. It understands the structure of dbt projects, the semantics of on-chain data, and the specific patterns that production blockchain analytics teams rely on. Whether you are scaffolding a new dbt project for a chain you have not worked with before, debugging a failing incremental model, or trying to understand why your token transfer volumes diverge from a reference source, having an IDE that brings domain knowledge to the conversation compresses the feedback loop in ways that generic tools simply cannot match. If you are building on-chain analytics infrastructure and want to move faster without accumulating the kind of technical debt that makes pipelines brittle, it is worth seeing what a crypto-native development environment can do for your workflow.
Cheetah AI is designed to be the development environment where that compounding works in your favor. It brings context about your project structure, your chain-specific conventions, and the broader ecosystem of tools you are integrating with, so the assistance you get is grounded in what you are actually building rather than generic suggestions that require significant adaptation. If your team is at the stage of designing a new blockchain analytics stack, or if you are inheriting one that has accumulated years of undocumented macro complexity, Cheetah AI is worth a look.
Related Posts

Bittensor Architecture: What It Means for Crypto Developers
TL;DR:Bittensor's architecture is structured around three core components: the Subtensor blockchain (a Polkadot parachain with EVM compatibility), 64 specialized subnets, and a governance-focu

Stablecoin Payments: The Production Engineering Guide
TL;DR:The GENIUS Act, signed into law on July 18, 2025, mandates 1:1 reserve backing and regular audits for stablecoins, and has directly contributed to $46 trillion in tracked transaction vol

Bitcoin Treasury Protocols: Engineering On-Chain BTC Management
TL;DR:61 publicly listed companies hold Bitcoin treasury positions, with collective holdings reaching 848,100 BTC in H1 2025, representing 4% of the entire Bitcoin supply Corporate treasurie