feat(storage): diff-layer state storage with an in-memory LRU cache by MegaRedHand · Pull Request #444 · lambdaclass/ethlambda

MegaRedHand · 2026-06-18T18:53:22Z

Closes #238

Summary

Replaces aggressive state pruning with a diff-layer storage model plus an in-memory cache, so the full state history stays available cheaply and no state is ever pruned. Builds on the block-signature pruning from #453 (merged to main): keeping block headers/bodies forever is what lets historical states be reconstructed.

State storage

Every non-genesis state is stored as a parent-linked StateDiff (StateDiffs table, never pruned), preserving full state history.
Full-state snapshots (States table) are written only at anchors: the bootstrap state plus one per SNAPSHOT_ANCHOR_INTERVAL (1024-slot) window crossing. Also never pruned. There is no separate anchor table; an anchor is just a state whose slot crosses a 1024 boundary relative to its parent.
StateDiff stores slot, both checkpoints, and the justification fields in full, plus the appended historical_block_hashes tail. config/validators come from the nearest ancestor snapshot (they never change); latest_block_header is read back from BlockHeaders (the stored state caches the real state_root there, so it matches byte-for-byte). These assumptions are documented on StateDiff::from_base.
get_state resolves in three tiers: (1) in-memory LRU cache, (2) a full snapshot in States, else (3) reconstruct by walking base_root back to the nearest ancestor snapshot and replaying appended tails forward. Results are memoized in the cache (STATE_CACHE_CAPACITY = 32) so recent/repeated reads stay hot.
DiffBase captures the parent (root, hbh_len, slot) before it is consumed into the post-state; StateDiff/DiffBase live in the storage crate.

Why this over snapshot eviction

The earlier approach pruned old snapshots on a hot window and scanned the whole States table to do it. Anchors are kept forever and spread across all slots, so a slot||root-keyed range scan still touches every anchor: bounding the scan doesn't help. Keeping diffs-only (never pruned) plus an LRU for hot reads removes the periodic scan entirely and keeps full history, at the cost of on-demand reconstruction for cold, non-anchor states.

Tests

StateDiff build/SSZ round-trip; state reconstruction (single-diff and multi-diff) verified via a cold store with an empty cache so the diff walk is actually exercised; snapshot-only-on-boundary-crossing; bootstrap-anchor snapshot.
Storage + blockchain suites green; clippy -D warnings clean.

Validation

Local 4-node devnet, ~52-slot soak: all nodes finalize and stay byte-identical (same head/justified/finalized), aggregated attestations packed, zero reconstruction errors / panics.
A >1024-slot soak is running to also exercise the second-anchor snapshot write and anchor-to-anchor reconstruction (the short soak only crosses the bootstrap anchor).
Migration: existing DBs keep their full states as snapshots; diffs start accruing going forward (no backfill).

… forever (#453) ## Summary Relaxes block pruning: instead of deleting old blocks wholesale, keep block headers and bodies **forever** and prune only the signatures of old finalized blocks. This preserves the full block history (for debugging, re-org safety, and state reconstruction) while still reclaiming the heavy signature data (~3KB+ per block). ## Change - Replaces `prune_old_blocks` (deleted headers, bodies, and signatures beyond a fixed `BLOCKS_TO_KEEP` window) with `prune_old_block_signatures(finalized_slot, tip_slot)`. - Policy, with `cutoff = tip_slot - SIGNATURE_PRUNING_RANGE`: - **healthy finality** (`cutoff <= finalized_slot`): delete signatures for `slot < cutoff` (entirely within finalized history); - **deep non-finality** (non-finalized range exceeds the window): prune nothing, so non-finalized signatures are never touched. - `BlockHeaders` and `BlockBodies` are kept forever; all non-finalized signatures are always retained. - `get_signed_block` returns `None` when a signature is absent, now including a pruned finalized block (deep historical signed-block serving via BlocksByRoot is lost; peers use checkpoint sync). - `prune_old_data` derives the tip slot from the head header and runs signature pruning alongside state pruning. ## Key layout (slot-ordered pruning) - `BlockSignatures` is now keyed by **`slot||root`** (big-endian slot), reusing the shared `encode_slot_root_key` codec also used by `LiveChain`. - Pruning iterates in slot order and **stops at the first entry past the cutoff** (`take_while`), and no longer does a per-row `BlockHeaders` lookup to recover the slot. Every read site already has the slot: `get_signed_block` loads the header first. - `InMemoryBackend::prefix_iterator` now returns keys in sorted order to match the RocksDB backend, which these slot-ordered range scans rely on. ## Migration Changes the `BlockSignatures` key format: existing root-keyed entries are not read after upgrade (old finalized blocks read as pruned). Fine for fresh / checkpoint-synced nodes; no backfill. ## Tests - `prune_signatures_keeps_recent_window_when_finality_healthy`, `prune_signatures_noop_when_non_finalized_range_exceeds_window`, `prune_signatures_noop_when_tip_within_window`. - Storage lib suite green (42 tests); clippy `-D warnings` clean. ## Context Split out of #444 (diff-layer state storage), which depends on this: reconstructing historical states needs block headers retained. #444 stacks on top of this PR. Co-authored-by: Pablo Deymonnaz <pdeymon@fi.uba.ar>

Store every non-genesis state as a parent-linked StateDiff (StateDiffs, never pruned) plus a full snapshot (States) written only at 1024-slot anchors (and the bootstrap). Neither table is ever pruned, so the full state history is preserved cheaply. - get_state returns an anchor snapshot directly, else reconstructs by walking base_root back to the nearest anchor and replaying the appended historical_block_hashes tails; config/validators come from the snapshot and latest_block_header from BlockHeaders. - Reconstructed and freshly imported states are memoized in an in-memory LRU (STATE_CACHE_CAPACITY), keyed by block root. States are immutable per root, so the cache never needs invalidation; it keeps recent reads (e.g. a child block's parent state) hot without reconstruction. - DiffBase captures the parent (root, hbh_len, slot) before it is consumed into the post-state. StateDiff/DiffBase live in the storage crate. - No snapshot eviction and no StateAnchors table: anchors are simply the snapshots in States, so the prune-states scan is gone entirely.

…ktree-feat+state-diff-layers

The migrated test built its DiffBase from the target post-state itself (DiffBase::from_state(a, &head_state)), so base.slot/base.hbh_len came from the target rather than the parent. That made the anchor-boundary check always false (no snapshot written, contradicting the comment) and left the diff self-referential, passing only via the cache memoization. Diff against the genesis anchor already present in the store instead, so the base correctly describes the parent state.

The reconstruction and boundary tests built DiffBase struct literals directly from the crate-internal fields instead of the public DiffBase::from_state constructor. Use the constructor so the tests exercise the real construction path (and would catch a regression in from_state itself).

github-actions · 2026-06-24T20:23:08Z

🤖 Kimi Code Review

Overall this is a well-architected optimization that significantly reduces storage overhead by diffing states against their parents rather than storing full copies. The logic appears correct and the tests are comprehensive.

Critical Observations

Fragile panic in reconstruction (crates/storage/src/state_diff.rs:182)
The reconstruct function panics if diffs is empty. While reconstruct_state currently ensures this is unreachable (it only calls reconstruct after pushing at least one diff), this invariant is fragile. Consider returning Option<State> or asserting with a detailed message linking to the invariant.
N+1 query pattern in reconstruction (crates/storage/src/store.rs:1169-1180)
reconstruct_state performs one DB read per diff in the chain (up to 1024 reads for a full anchor interval). Each get_ssz creates a new read view. For performance, consider opening a single read transaction for the entire walk, though this is acceptable for the current milestone.
Cache lock held during DB operations (crates/storage/src/store.rs:1139, 1146)
The LRU cache is locked separately for get and put operations. While the window is small, between releasing the lock after the miss and acquiring it for the insert, another thread could reconstruct the same state. Consider restructuring to minimize lock contention, though this is not a correctness issue.

Code Quality & Safety

Correctness: The diff logic correctly assumes append-only historical_block_hashes and immutable config/validators. The boundary check for anchor snapshots (slot / INTERVAL > base.slot / INTERVAL) is correct.
SSZ Safety: HistoricalBlockHashesTail properly uses SszList with the same limit as the full list, ensuring encoding consistency.
Memory Safety: The DiffBase capture before state consumption (line 559 in blockchain/src/store.rs) correctly avoids cloning the multi-MB historical hashes list.
Error Handling: The transition from insert_state to insert_state_with_diff in tests properly uses the genesis snapshot as the diff base.

Minor Suggestions

Documentation: Add a comment in reconstruct_state explaining why diffs is never empty when reaching the reconstruct call to prevent future refactors from breaking the invariant.
Metrics: Consider adding a counter for cache hits vs. reconstructions to monitor performance in production.

Consensus Security

The reconstruction logic preserves all consensus-critical fields (latest_justified, latest_finalized, justification fields) verbatim from the diff. The latest_block_header is retrieved from the BlockHeaders table, which preserves the state_root that was computed when the block was imported. This maintains the cryptographic link between the reconstructed state and the block root.

Verdict: Approved with minor defensive programming suggestions.

Automated review by Kimi (Moonshot AI) · kimi-k2.5 · custom prompt

…ktree-feat+state-diff-layers

github-actions · 2026-06-24T20:24:12Z

🤖 Codex Code Review

Medium: crates/storage/src/state_diff.rs and crates/storage/src/store.rs make the diff format depend on “parent historical_block_hashes is an exact prefix of child” but only validate len >= base_hbh_len. If that invariant is ever broken by a future STF change or bad DB repair, reconstruction will silently fabricate the wrong state; for one subset of failures it will instead panic on assert! during block import. In consensus code, this should be a fallible check with enough base metadata to validate the prefix/root relationship explicitly.
Medium: crates/storage/src/store.rs, crates/storage/src/store.rs, and crates/storage/src/store.rs add a fairly expensive cold path. reconstruct_state() opens a fresh begin_read() for every diff hop, and get_state() / insert_state_with_diff() deep-clone full States into and out of the LRU. In the worst case that is roughly 1024 diff hops, so one cache miss turns into thousands of DB-read creations plus multiple multi-MB copies. Reusing a single read view for reconstruction and caching Arc<State>-style values would avoid a noticeable restart/RPC regression.

I did not run the test suite successfully here: cargo test hit read-only rustup/cargo home paths in this environment before dependency resolution completed.

Automated review by OpenAI Codex · gpt-5.4 · custom prompt

The table dropped the 4 stale rows (LatestKnown/LatestNewAttestations, GossipSignatures, AggregatedPayloads) that no longer exist as persisted tables: attestations and gossip signatures are now in-memory Store buffers. Also fixes the BlockSignatures key (slot||root, not H256).

greptile-apps · 2026-06-24T20:27:11Z

Greptile Summary

This PR replaces aggressive state pruning with a diff-layer storage model: every non-genesis post-state is persisted as a parent-linked StateDiff (never pruned), while full State snapshots are written only at 1024-slot anchor boundaries. An in-memory LRU cache memoises recently accessed states so the hot path (parent state of the next block) avoids any reconstruction.

StateDiff captures slot, checkpoints, justification fields, and the appended historical_block_hashes tail; config/validators are omitted (taken from the nearest ancestor snapshot); latest_block_header is read from BlockHeaders.
get_state resolves via (1) LRU cache, (2) States snapshot, or (3) diff-chain walk + reconstruction, memoising the result on every miss.
prune_old_states is removed; prune_old_data now only prunes finalized block signatures.

Confidence Score: 4/5

The core diff-layer write and reconstruction paths are correct and the tests cover single-diff, multi-diff, anchor-crossing, and bootstrap-snapshot cases with a cold store. No state is silently lost or incorrectly reconstructed under normal operation.

The doc comment on prune_old_data still describes snapshot eviction that the new design abandoned entirely, which could mislead future maintainers. Anchor states also get a redundant StateDiffs row that is never read back. The backward walk in reconstruct_state ignores the LRU cache for intermediate steps and lacks any depth bound, so DB corruption could cause an indefinite hang. None of these are correctness issues under well-formed data, but the documentation inaccuracy and unbounded walk are worth addressing before the second anchor soak completes.

crates/storage/src/store.rs — specifically the prune_old_data doc comment, insert_state_with_diff (redundant anchor diff), and the reconstruct_state backward walk loop.

Important Files Changed

Filename	Overview
crates/storage/src/state_diff.rs	New file implementing StateDiff/DiffBase types, SSZ-encode/decode, and the reconstruct() function for replaying diffs onto a snapshot. Well-documented with explicit invariant contracts.
crates/storage/src/store.rs	Core storage changes: adds LRU state cache, replaces insert_state with insert_state_with_diff (snapshot-on-anchor), adds reconstruct_state for diff-chain replay, removes prune_old_states. Doc comment on prune_old_data misleadingly describes snapshot eviction that no longer happens; anchor states get a redundant StateDiffs entry; the backward walk has no cycle-detection/depth guard.
crates/blockchain/src/store.rs	Captures DiffBase before parent is consumed into post-state, then calls insert_state_with_diff. Test updated to use the new diff API correctly.
crates/storage/src/api/tables.rs	Adds StateDiffs table variant and updates ALL_TABLES, name(), plus updates rocksdb.rs to delegate cf_name to table.name() for single source of truth.
crates/storage/src/lib.rs	Adds state_diff module and re-exports DiffBase for callers.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant Caller
    participant Store
    participant LRU as LRU Cache
    participant States as States Table (DB)
    participant Diffs as StateDiffs Table (DB)
    participant Headers as BlockHeaders Table (DB)

    Note over Caller,Headers: get_state(root)
    Caller->>Store: get_state(root)
    Store->>LRU: get(root)
    alt Cache hit
        LRU-->>Store: State
        Store-->>Caller: Some(State)
    else Cache miss
        Store->>States: get(root)
        alt Snapshot found
            States-->>Store: State (full snapshot)
        else Not an anchor
            Store->>Store: reconstruct_state(root)
            loop walk back until snapshot found
                Store->>States: get(cursor)
                alt Snapshot found
                    States-->>Store: snapshot State
                else
                    Store->>Diffs: get(cursor)
                    Diffs-->>Store: StateDiff
                    Note over Store: cursor = diff.base_root
                end
            end
            Store->>Headers: get(root)
            Headers-->>Store: latest_block_header
            Store->>Store: reconstruct(snapshot, diffs[], header)
        end
        Store->>LRU: put(root, state)
        Store-->>Caller: Some(State)
    end

%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant Caller
    participant Store
    participant LRU as LRU Cache
    participant States as States Table (DB)
    participant Diffs as StateDiffs Table (DB)
    participant Headers as BlockHeaders Table (DB)

    Note over Caller,Headers: get_state(root)
    Caller->>Store: get_state(root)
    Store->>LRU: get(root)
    alt Cache hit
        LRU-->>Store: State
        Store-->>Caller: Some(State)
    else Cache miss
        Store->>States: get(root)
        alt Snapshot found
            States-->>Store: State (full snapshot)
        else Not an anchor
            Store->>Store: reconstruct_state(root)
            loop walk back until snapshot found
                Store->>States: get(cursor)
                alt Snapshot found
                    States-->>Store: snapshot State
                else
                    Store->>Diffs: get(cursor)
                    Diffs-->>Store: StateDiff
                    Note over Store: cursor = diff.base_root
                end
            end
            Store->>Headers: get(root)
            Headers-->>Store: latest_block_header
            Store->>Store: reconstruct(snapshot, diffs[], header)
        end
        Store->>LRU: put(root, state)
        Store-->>Caller: Some(State)
    end

Prompt To Fix All With AI

Fix the following 4 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 4
crates/storage/src/store.rs:848-852
The doc comment says "evicting old state snapshots" but the new design never prunes snapshots — anchors are kept forever. Only block signatures are removed. The second sentence ("only full-state snapshots outside the hot window … are removed") reinforces the misconception. This will mislead future readers who rely on this doc to understand storage bounds.

```suggestion
    /// Prune finalized block signatures to keep signature storage bounded.
    ///
    /// State diffs, block headers, block bodies, and full-state snapshots are
    /// all retained for the full history and are never pruned. Only signatures
    /// of finalized blocks older than the pruning window are removed.
```

### Issue 2 of 4
crates/storage/src/store.rs:1213-1235
**Redundant `StateDiffs` entry written for every anchor state**

`insert_state_with_diff` always writes a `StateDiffs` row, even when `is_anchor` is true and a full `States` snapshot is also written. In `reconstruct_state` the loop checks `States[cursor]` before `StateDiffs[cursor]`, so when reconstruction reaches an anchor it breaks immediately on the snapshot — the diff row is never read. Over a long-running chain each anchor accumulates a useless diff entry (anchor-slot states get two on-disk representations).

### Issue 3 of 4
crates/storage/src/store.rs:1164-1189
**`reconstruct_state` backward walk skips the LRU cache for intermediate states**

The loop only checks `States` (the DB snapshot table) to decide whether to stop. If an intermediate state (e.g. the direct parent, already imported a moment ago) is present in the LRU cache, the walk continues all the way back to the nearest DB snapshot anyway, doing up to `SNAPSHOT_ANCHOR_INTERVAL` extra DB reads. Checking `self.state_cache.lock().unwrap().get(&cursor)` at the top of each iteration would short-circuit the walk and use the cached state as the reconstruction base, which is safe because states are immutable per root.

### Issue 4 of 4
crates/storage/src/store.rs:1164-1175
**Unbounded backward walk with no cycle detection or depth limit**

The `loop` in `reconstruct_state` terminates only when it finds a snapshot in `States` or a missing diff. With DB corruption (e.g. a `StateDiff` whose `base_root` points to itself or forms a multi-hop cycle), the loop runs indefinitely and hangs the calling thread. A simple guard — capping iterations at `SNAPSHOT_ANCHOR_INTERVAL + 1` and returning `None` if exceeded — would bound the worst case and surface corruption as a clean error rather than a hang.

_{Reviews (1): Last reviewed commit: "docs: correct storage tables list to mat..." | Re-trigger Greptile}

greptile-apps · 2026-06-24T20:27:16Z

+    /// Bound storage by evicting old state snapshots and finalized signatures.
+    ///
+    /// State diffs, block headers, and block bodies are retained for the full
+    /// history; only full-state snapshots outside the hot window (diffs remain)
+    /// and signatures of finalized blocks are removed.


The doc comment says "evicting old state snapshots" but the new design never prunes snapshots — anchors are kept forever. Only block signatures are removed. The second sentence ("only full-state snapshots outside the hot window … are removed") reinforces the misconception. This will mislead future readers who rely on this doc to understand storage bounds.

Suggested change

/// Bound storage by evicting old state snapshots and finalized signatures.

///

/// State diffs, block headers, and block bodies are retained for the full

/// history; only full-state snapshots outside the hot window (diffs remain)

/// and signatures of finalized blocks are removed.

/// Prune finalized block signatures to keep signature storage bounded.

///

/// State diffs, block headers, block bodies, and full-state snapshots are

/// all retained for the full history and are never pruned. Only signatures

/// of finalized blocks older than the pruning window are removed.

Prompt To Fix With AI

This is a comment left during a code review. Path: crates/storage/src/store.rs Line: 848-852 Comment: The doc comment says "evicting old state snapshots" but the new design never prunes snapshots — anchors are kept forever. Only block signatures are removed. The second sentence ("only full-state snapshots outside the hot window … are removed") reinforces the misconception. This will mislead future readers who rely on this doc to understand storage bounds. ```suggestion /// Prune finalized block signatures to keep signature storage bounded. /// /// State diffs, block headers, block bodies, and full-state snapshots are /// all retained for the full history and are never pruned. Only signatures /// of finalized blocks older than the pruning window are removed. ``` How can I resolve this? If you propose a fix, please make it concise.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

greptile-apps · 2026-06-24T20:27:18Z

+    pub fn insert_state_with_diff(&mut self, root: H256, base: DiffBase, state: State) {
+        let slot = state.slot;
+        let is_anchor = slot / SNAPSHOT_ANCHOR_INTERVAL > base.slot / SNAPSHOT_ANCHOR_INTERVAL;
+
+        // Snapshot only at anchors; serialize before `state` is consumed.
+        let snapshot_bytes = is_anchor.then(|| state.to_ssz());
+        // Memoize the post-state for fast reads, then move it into the diff so
+        // its multi-MB justification fields are not cloned again.
+        self.state_cache.lock().unwrap().put(root, state.clone());
+        let diff_bytes = StateDiff::from_base(base.root, base.hbh_len, state).to_ssz();
+
+        let key = root.to_ssz();
        let mut batch = self.backend.begin_write().expect("write batch");
-        let entries = vec![(root.to_ssz(), state.to_ssz())];
-        batch.put_batch(Table::States, entries).expect("put state");
+        batch
+            .put_batch(Table::StateDiffs, vec![(key.clone(), diff_bytes)])
+            .expect("put state diff");
+        if let Some(snapshot_bytes) = snapshot_bytes {
+            batch
+                .put_batch(Table::States, vec![(key, snapshot_bytes)])
+                .expect("put state snapshot");
+        }
        batch.commit().expect("commit");
    }


Redundant StateDiffs entry written for every anchor state

insert_state_with_diff always writes a StateDiffs row, even when is_anchor is true and a full States snapshot is also written. In reconstruct_state the loop checks States[cursor] before StateDiffs[cursor], so when reconstruction reaches an anchor it breaks immediately on the snapshot — the diff row is never read. Over a long-running chain each anchor accumulates a useless diff entry (anchor-slot states get two on-disk representations).

Prompt To Fix With AI

This is a comment left during a code review. Path: crates/storage/src/store.rs Line: 1213-1235 Comment: **Redundant `StateDiffs` entry written for every anchor state** `insert_state_with_diff` always writes a `StateDiffs` row, even when `is_anchor` is true and a full `States` snapshot is also written. In `reconstruct_state` the loop checks `States[cursor]` before `StateDiffs[cursor]`, so when reconstruction reaches an anchor it breaks immediately on the snapshot — the diff row is never read. Over a long-running chain each anchor accumulates a useless diff entry (anchor-slot states get two on-disk representations). How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-06-24T20:27:18Z

+    fn reconstruct_state(&self, root: &H256) -> Option<State> {
+        // Walk back collecting diffs until we reach a snapshot.
+        let mut diffs: Vec<StateDiff> = Vec::new();
+        let mut cursor = *root;
+        let snapshot = loop {
+            if let Some(snapshot) = self.get_ssz::<State>(Table::States, &cursor) {
+                break snapshot;
+            }
+            let diff = self.get_ssz::<StateDiff>(Table::StateDiffs, &cursor)?;
+            cursor = diff.base_root;
+            diffs.push(diff);
+        };
+
+        // `diffs` runs target -> snapshot child; reverse to snapshot child -> target.
+        diffs.reverse();
+
+        // The latest block header lives in BlockHeaders; the stored state caches
+        // the real state_root there, so it equals the header byte-for-byte.
+        let latest_block_header = self.get_block_header(root)?;
+
+        Some(crate::state_diff::reconstruct(
+            snapshot,
+            &diffs,
+            latest_block_header,
+        ))
+    }


reconstruct_state backward walk skips the LRU cache for intermediate states

The loop only checks States (the DB snapshot table) to decide whether to stop. If an intermediate state (e.g. the direct parent, already imported a moment ago) is present in the LRU cache, the walk continues all the way back to the nearest DB snapshot anyway, doing up to SNAPSHOT_ANCHOR_INTERVAL extra DB reads. Checking self.state_cache.lock().unwrap().get(&cursor) at the top of each iteration would short-circuit the walk and use the cached state as the reconstruction base, which is safe because states are immutable per root.

Prompt To Fix With AI

This is a comment left during a code review. Path: crates/storage/src/store.rs Line: 1164-1189 Comment: **`reconstruct_state` backward walk skips the LRU cache for intermediate states** The loop only checks `States` (the DB snapshot table) to decide whether to stop. If an intermediate state (e.g. the direct parent, already imported a moment ago) is present in the LRU cache, the walk continues all the way back to the nearest DB snapshot anyway, doing up to `SNAPSHOT_ANCHOR_INTERVAL` extra DB reads. Checking `self.state_cache.lock().unwrap().get(&cursor)` at the top of each iteration would short-circuit the walk and use the cached state as the reconstruction base, which is safe because states are immutable per root. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-06-24T20:27:19Z

+    fn reconstruct_state(&self, root: &H256) -> Option<State> {
+        // Walk back collecting diffs until we reach a snapshot.
+        let mut diffs: Vec<StateDiff> = Vec::new();
+        let mut cursor = *root;
+        let snapshot = loop {
+            if let Some(snapshot) = self.get_ssz::<State>(Table::States, &cursor) {
+                break snapshot;
+            }
+            let diff = self.get_ssz::<StateDiff>(Table::StateDiffs, &cursor)?;
+            cursor = diff.base_root;
+            diffs.push(diff);
+        };


Unbounded backward walk with no cycle detection or depth limit

The loop in reconstruct_state terminates only when it finds a snapshot in States or a missing diff. With DB corruption (e.g. a StateDiff whose base_root points to itself or forms a multi-hop cycle), the loop runs indefinitely and hangs the calling thread. A simple guard — capping iterations at SNAPSHOT_ANCHOR_INTERVAL + 1 and returning None if exceeded — would bound the worst case and surface corruption as a clean error rather than a hang.

Prompt To Fix With AI

This is a comment left during a code review. Path: crates/storage/src/store.rs Line: 1164-1175 Comment: **Unbounded backward walk with no cycle detection or depth limit** The `loop` in `reconstruct_state` terminates only when it finds a snapshot in `States` or a missing diff. With DB corruption (e.g. a `StateDiff` whose `base_root` points to itself or forms a multi-hop cycle), the loop runs indefinitely and hangs the calling thread. A simple guard — capping iterations at `SNAPSHOT_ANCHOR_INTERVAL + 1` and returning `None` if exceeded — would bound the worst case and surface corruption as a clean error rather than a hang. How can I resolve this? If you propose a fix, please make it concise.

github-actions · 2026-06-24T20:41:09Z

🤖 Claude Code Review

Now I have everything to write the final review.

PR Review: feat(storage): diff-layer state storage with an in-memory LRU cache

This PR replaces time-windowed state pruning with a parent-linked diff-layer model: every non-genesis block gets a StateDiff entry (never pruned), full-state snapshots are written only at 1024-slot boundary crossings and the bootstrap anchor, and get_state resolves in three tiers — LRU cache, snapshot, or diff-chain reconstruction. The design is conceptually sound and the test coverage is good, but there are a few issues worth addressing before merge.

Finding 1 — Cache populated before DB commit (liveness hazard on crash)

crates/storage/src/store.rs:1221

insert_state_with_diff writes the state into the LRU cache at line 1221, then commits the StateDiffs write batch at line 1234:

self.state_cache.lock().unwrap().put(root, state.clone());   // ← cache, in-memory only
let diff_bytes = StateDiff::from_base(..., state).to_ssz();
let mut batch = self.backend.begin_write()...;
batch.put_batch(Table::StateDiffs, ...)...;
batch.commit().expect("commit");                              // ← DB commit

If the process is killed (OOM, SIGKILL, panic from .expect() in the batch path) between these two points, the cache is discarded on restart (new_state_cache() is called fresh in every construction path) and no StateDiff row exists in RocksDB. On restart, get_state(root) misses the empty cache, misses Table::States, calls reconstruct_state → get_ssz::<StateDiff> returns None → returns None → any child block fails with MissingParentState. Recovery requires the parent block to be re-gossiped from the network (which re-triggers on_block_core since has_state checks only the DB). The fix is simple: move state_cache.lock().unwrap().put(...) to after batch.commit().

Finding 2 — Undocumented write-order precondition for snapshot/reconstruction consistency

crates/storage/src/store.rs:1213 / crates/blockchain/src/store.rs:571

reconstruct_state reads latest_block_header from BlockHeaders[root] (line 1182), which stores the proposer's block.state_root. This matches post_state.latest_block_header.state_root only because on_block_core:571 explicitly sets it after state transition:

post_state.latest_block_header.state_root = state_root;   // blockchain/src/store.rs:571
// ...
store.insert_state_with_diff(block_root, diff_base, post_state);  // :585

If insert_state_with_diff is ever called with a state where lbh.state_root is still H256::ZERO (as it comes out of process_block_header), then the anchor snapshot serialized from that state would have lbh.state_root = ZERO, while reconstruction via get_block_header would return the block's non-zero state_root. The snapshot path and reconstruction path would then disagree on the reconstructed state — making child-block process_block_header parent-root validation non-deterministic depending on whether the parent was hot in cache or reconstructed cold. The state_diff.rs doc comment mentions this on the reader side ("the persisted post-state caches the real state_root there") but insert_state_with_diff's own doc comment does not list it as a precondition on the writer side.

Finding 3 — No depth cap or cycle guard in `reconstruct_state`

crates/storage/src/store.rs:1172

The diff-walk loop has no visited-set, depth counter, or iteration cap:

let snapshot = loop {
    if let Some(snapshot) = self.get_ssz::<State>(Table::States, &cursor) {
        break snapshot;
    }
    let diff = self.get_ssz::<StateDiff>(Table::StateDiffs, &cursor)?;  // None exits via ?
    cursor = diff.base_root;   // no cycle check
    diffs.push(diff);          // unbounded push
};

If storage corruption produces StateDiff[A].base_root = B and StateDiff[B].base_root = A, neither entry misses the ? and neither is in Table::States, so the loop pushes into diffs indefinitely, eventually OOM-ing the process. This path is not reachable through any correct write sequence (real base_root values come from block hashes, which cannot form cycles), but the loop is one corrupt DB entry away from a hung thread. A simple depth cap of SNAPSHOT_ANCHOR_INTERVAL + 1 would make corruption fail-fast rather than hang.

Finding 4 — Cold reconstruction: up to 1023 DB reads with no intermediate caching

crates/storage/src/store.rs:106

STATE_CACHE_CAPACITY = 32 is appropriate for steady-state block import (each block's parent is hot from the previous write). However, reconstruct_state does one get_ssz call — and thus one begin_read() → view.get() round-trip — per step in the diff chain, without caching intermediate states. On a cold node (after restart, or for any historical API read), requesting a state 1023 diffs past the last anchor triggers 1023 separate DB reads and populates the cache with only the single target state, not the intermediates. A second request for that state's sibling (same ancestor snapshot, 1022 diffs away) re-walks from scratch. For a validator node that restarts mid-window or an API serving historical block requests, this means every non-anchor get_state is O(depth) in DB round-trips rather than O(1). Consider either increasing STATE_CACHE_CAPACITY significantly (e.g., to SNAPSHOT_ANCHOR_INTERVAL / 8 ≈ 128) or memoizing intermediate states during the walk.

Finding 5 — `put_batch` called with inline `vec![]` (CLAUDE.md formatting rule)

crates/storage/src/store.rs:1227–1232

batch
    .put_batch(Table::StateDiffs, vec![(key.clone(), diff_bytes)])
    .expect("put state diff");
if let Some(snapshot_bytes) = snapshot_bytes {
    batch
        .put_batch(Table::States, vec![(key, snapshot_bytes)])
        .expect("put state snapshot");
}

CLAUDE.md states: "Extract long arguments into variables so formatter can join lines", and shows let entries = vec![...] before batch.put_batch(Table::X, entries) as the preferred form. Every other multi-entry batch write in the same file (e.g., init_store around line 662) follows this pattern. The two new calls use the inline form the convention explicitly discourages.

Automated review by Claude (Anthropic) · sonnet · custom prompt

MegaRedHand force-pushed the feat/state-diff-layers branch 4 times, most recently from 2cb4ca3 to a342fa7 Compare June 23, 2026 18:56

MegaRedHand mentioned this pull request Jun 23, 2026

feat(storage): prune only old finalized block signatures, keep blocks forever #453

Merged

MegaRedHand changed the base branch from main to feat/prune-block-signatures June 23, 2026 18:57

MegaRedHand force-pushed the feat/prune-block-signatures branch from 6600cb2 to fb1cf37 Compare June 23, 2026 19:27

MegaRedHand force-pushed the feat/state-diff-layers branch from a342fa7 to 34d6345 Compare June 23, 2026 19:31

Base automatically changed from feat/prune-block-signatures to main June 23, 2026 20:56

MegaRedHand force-pushed the feat/state-diff-layers branch 3 times, most recently from 608e97f to d1a76aa Compare June 24, 2026 17:56

MegaRedHand force-pushed the feat/state-diff-layers branch from d1a76aa to 1a060f2 Compare June 24, 2026 18:31

MegaRedHand added 4 commits June 24, 2026 16:08

Merge branch 'main' into feat/state-diff-layers

d0c7110

tune(storage): set STATE_CACHE_CAPACITY to 32

0ea9936

Merge remote-tracking branch 'origin/feat/state-diff-layers' into wor…

027532f

…ktree-feat+state-diff-layers

MegaRedHand linked an issue Jun 24, 2026 that may be closed by this pull request

Block building optimizations #465

Open

MegaRedHand changed the title ~~feat(storage): diff-layer state storage with bounded pruning~~ feat(storage): diff-layer state storage with an in-memory LRU cache Jun 24, 2026

Merge branch 'main' into feat/state-diff-layers

39a7f50

MegaRedHand marked this pull request as ready for review June 24, 2026 20:21

Merge remote-tracking branch 'origin/feat/state-diff-layers' into wor…

a8cf352

…ktree-feat+state-diff-layers

greptile-apps Bot reviewed Jun 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(storage): diff-layer state storage with an in-memory LRU cache#444

feat(storage): diff-layer state storage with an in-memory LRU cache#444
MegaRedHand wants to merge 9 commits into
mainfrom
feat/state-diff-layers

MegaRedHand commented Jun 18, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 24, 2026

Uh oh!

github-actions Bot commented Jun 24, 2026

Uh oh!

greptile-apps Bot commented Jun 24, 2026

Important Files Changed

Uh oh!

greptile-apps Bot Jun 24, 2026

Uh oh!

greptile-apps Bot Jun 24, 2026

Uh oh!

greptile-apps Bot Jun 24, 2026

Uh oh!

greptile-apps Bot Jun 24, 2026

Uh oh!

github-actions Bot commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

MegaRedHand commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

State storage

Why this over snapshot eviction

Tests

Validation

Uh oh!

github-actions Bot commented Jun 24, 2026

🤖 Kimi Code Review

Uh oh!

github-actions Bot commented Jun 24, 2026

🤖 Codex Code Review

Uh oh!

greptile-apps Bot commented Jun 24, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 24, 2026

🤖 Claude Code Review

PR Review: feat(storage): diff-layer state storage with an in-memory LRU cache

Finding 1 — Cache populated before DB commit (liveness hazard on crash)

Finding 2 — Undocumented write-order precondition for snapshot/reconstruction consistency

Finding 3 — No depth cap or cycle guard in reconstruct_state

Finding 4 — Cold reconstruction: up to 1023 DB reads with no intermediate caching

Finding 5 — put_batch called with inline vec![] (CLAUDE.md formatting rule)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

MegaRedHand commented Jun 18, 2026 •

edited

Loading

Finding 3 — No depth cap or cycle guard in `reconstruct_state`

Finding 5 — `put_batch` called with inline `vec![]` (CLAUDE.md formatting rule)