Indexing onchain data with gRPC streaming

Blockchain indexing is a critical part of developing crypto applications. In the EVM ecosystem, most indexers use JSON-RPC endpoints to fetch, transform and store onchain data in a structured and searchable format. This article explores how gRPC streaming can serve as an efficient and performant alternative for blockchain indexing on the EVM. If you are already familiar with EVM blockchain indexing, you can skip to the gRPC section.

Why index onchain data at all?

Suppose we are building an Ethereum wallet. A functional wallet should display the wallet's balance and a list of transactions linked to that wallet's address. Ethereum nodes expose a JSON-RPC API that allows anyone to interact with the Ethereum blockchain. The JSON-RPC API exposes the eth_getBalance method to get a wallet's ETH balance at a given block (usually the latest block). Getting a wallet's balance is as simple as making the following request:

curl -X POST \
  -H "Content-Type: application/json" \
  -d '{
  "jsonrpc": "2.0",
  "method": "eth_getBalance",
  "params": [
    "0xfe3b557e8fb62b89f4916b721be55ceb828dbd73",
    "latest"
  ],
  "id": 1
}' https://eth.llamarpc.com

You can try it here.

On the other hand, fetching a list of transactions for a wallet address is trickier. The closest thing the JSON-RPC API exposes for listing transactions is eth_getBlockByNumber/eth_getBlockByHash with a flag to fetch the block with all of its transactions. There is no way to filter transactions by address. This is because blockchain transactions are not indexed by address on the node. To make matters worse, when working with tokens (e.g. stablecoins and NFTs), there's no transaction list. On Ethereum, token transfers are created by calling functions on the token's smart contract. Those function calls emit events that are captured in the transaction's receipt as logs. Although logs can be fetched using eth_getLogs, there's no way to query those logs by wallet address. Hence, transactions and logs need to be indexed so they can be queried efficiently.

Indexing with JSON-RPC

At a high level, blockchain indexing is a straightforward process. The JSON-RPC API is used to fetch blocks by number with their associated transactions and receipts. That data is then transformed and saved in a database. For example, to index transactions for the wallet application above:

Choose a starting block: Starting from the genesis block (block 0) would mean consuming terabytes of data that may never be used. A recent block works as a starting point if the application doesn't handle imported wallets. For historical transaction support, a starting point like block 18,000,000 (around Oct 2023) is a reasonable trade-off.
Fetch all blocks from the starting block: Using eth_getBlockByNumber, blocks are fetched one by one with their transactions. To speed things up, blocks can be fetched in parallel batches before transforming and storing them.
Transform and store transactions: Once block transactions are fetched, they can be stored in a database with indexes on the from and to fields for querying by address. If only specific addresses matter, transactions can be filtered before storing to save space.
Handle chain reorgs: Once the tip of the chain is reached, new blocks are processed as they are produced. In some cases, two competing blocks can be proposed simultaneously, resulting in a disagreement on the chain's state and a temporary fork. When this happens, the network chooses the longer chain (the one with the most validator attestations) and drops previously accepted blocks that are not part of it. This is called a chain reorganization or reorg for short. Reorgs must be handled to avoid storing stale or inaccurate data. Typically this is done by checking that the parent hash of the new block matches the hash of the previously processed block. If it doesn't, a reorg occurred and all data related to the dropped blocks must be discarded.

Although this works, it's very inefficient. Every block requires an HTTP request to the node, adding roundtrip latency. The node then fetches the relevant data, serializes it to JSON, and sends it over the wire. On the client side, the JSON data is deserialized and stored in a database. Each of these steps carries CPU and network overhead that compounds quickly when processing gigabytes of data. On top of that, most nodes cap both the payload size and the number of concurrent requests, further limiting backfill throughput.

gRPC to the rescue

Blockchain indexing is essentially an ETL pipeline over a stream of immutable temporal events. Through this lens, the polling approach that JSON-RPC forces makes little sense. Modeling blockchain data as a stream allows it to be used easily for indexing.

This is exactly what gRPC streams enable. Blocks (with their transactions and logs) can be encoded and streamed to clients over a long-lived HTTP/2 connection. This gives several advantages:

Reduced bandwidth usage at scale by using Protocol Buffers, which are smaller and faster to serialize than JSON, and their binary format works well with HTTP/2's binary framing.
The roundtrip latency of repeated HTTP requests is eliminated as new data is pushed to clients as soon as it's available.
We get built-in reorg handling by streaming blocks with a status flag (committed or reorged), so clients can react to reorgs without tracking chain state themselves.
HTTP/2 flow control also provides backpressure for free. If the client falls behind during a heavy backfill due to slow database writes, the stream slows down by sending new messages only when the client is ready to receive them instead of buffering without bounds or dropping data.

How does it work?

To test this idea, we built a proof of concept on top of Tempo's Reth node. We chose Tempo because it is an EVM-compatible L1 with sub-second finality (i.e. new blocks are produced and finalized in less than a second). We also leveraged Reth's Execution Extension to get notified when new blocks are committed or reorged.

The gRPC server runs alongside the Tempo node using Reth's spawn_critical_task and it has three RPC methods:

service BlockStream {
  rpc Live(LiveRequest) returns (stream BlockChunk) {}
  rpc Backfill(BackfillRequest) returns (stream BlockChunk) {}
  rpc BackfillToLive(BackfillToLiveRequest) returns (stream BlockChunk) {}
}
 
message LiveRequest {}
 
message BackfillRequest {
  uint64 from = 1;
  uint64 to = 2;
  uint64 size = 3;
}
 
message BackfillToLiveRequest {
  uint64 from = 1;
  uint64 size = 2;
}
 
message BlockChunk {
  repeated Block items = 1;
}

Full proto definition file here.

Live

Live streams new blocks to the client as soon as they are executed by the node. Each block carries a status flag (committed or reorged) so clients can handle reorgs without extra bookkeeping.

Under the hood, this is powered by a Reth execution extension. The extension listens for execution notifications from Reth and pushes them into a shared broadcast channel with a capacity of 1. If no consumers are connected, the message is overwritten when the next block arrives. When a client subscribes, it gets its own receiver on the channel and new blocks are forwarded to its stream in real time.

Backfill

Backfill streams blocks from a specified range in batches. Block data is read directly from the node's data directory using the Reth provider in parallel batches and streamed to the client. Although Reth execution extensions provide backfill functionality, it re-executes every transaction in the range, making it significantly slower. Reading from the data directory directly avoids this overhead entirely.

BackfillToLive

BackfillToLive combines both methods: it backfills from a starting block to the chain tip, then transitions to streaming new blocks as they are executed.

It works by listening to execution notifications from the Reth ExEx broadcast channel in a loop. On each notification:

If the notification contains a reverted chain that overlaps with the current cursor, the reorged blocks are pushed to the client and the cursor is rolled back.
If the committed chain starts exactly at the cursor, those blocks are pushed to the client and the stream switches to live mode.
If the chain tip is ahead of the cursor, a backfill is run from the cursor to the tip, and the cursor advances to continue from where it left off.
If the chain tip is still behind the cursor, the notification is skipped.

Benchmarks

To put this to the test, a backfill was run for blocks 9,000,000 to 10,000,000 on the Tempo testnet. The client ran on a 4 vCPU, 16 GB RAM server in the same region as the node, connected over a 10 Gbps link.

Metric	Rows	Size
Blocks	1,000,000	158.22 MiB
Transactions	36.04M	13.70 GiB
Logs	96.13M	21.28 GiB
Total		~35.14 GiB
Total time		293s

That works out to roughly 3K blocks, 123K transactions, and 328K logs per second, or about 122 MiB/s of uncompressed data on modest hardware.

The same backfill was also run from a machine in Europe against the node in Canada to test performance over a cross-region network. The client was an AMD Ryzen PRO 3600 (6C/12T, 3.6 GHz) with 32 GB RAM and a 500 Mbps connection. Despite the transatlantic latency, the backfill completed in ~548s (roughly 1.9x slower). Most of the difference is attributable to the lower bandwidth and added network round trips.

Pushing this further

There are a few ways to improve this setup and some ideas worth exploring:

Most EVM indexing workflows boil down to ingesting, filtering, and storing logs. Instead of streaming full blocks with their transactions and logs, a dedicated RPC method for filtering and streaming logs directly would increase throughput (logs are cheaper to read from disk) and reduce bandwidth by skipping data the client doesn't need.
On high-throughput payment chains (like Tempo and Arc), streaming balance changes directly could be useful. That said, ERC-20 balance changes are trivial to reconstruct from Transfer event logs on the client side.

gRPC streaming turns blockchain indexing from a slow polling process into a fast, push-based pipeline. Fetching data is only half the equation though. Our next article will explore how ClickHouse can be leveraged for efficient storage and performant querying.

The code for the proof of concept is available on GitHub.