Add offline rollback utility for LittDB by Kbhat1 · Pull Request #3684 · sei-protocol/sei-chain

Kbhat1 · 2026-07-01T16:48:56Z

Describe your changes and provide context

RollbackLittDB(dataDirs, filter) — offline tool that rewinds a littDB to a chosen point (e.g. a block height) while the DB is stopped
Walks each table's key files newest -> oldest, calls filter(key, isPrimary); first match = cutoff -> keeps that key's group + everything older, deletes everything newer from the segment files
Discards the keymap + snapshot dirs so the DB rebuilds them from the truncated segments on next start

Testing performed to validate your change

Unit tests
Currently verifying on node

Adds rollback.RollbackLittDB(dataDirs, rollbackFilter): a static, offline utility that rewinds a LittDB instance to a chosen point — e.g. to roll a node's state back to a specific block height while the database is stopped. For each table it walks the key files newest-to-oldest, invoking the filter once per record (isPrimary is derived from the record's KeyKind). The first record the filter accepts is the rollback point: that record's whole key group and everything written before it are kept; everything written after the group is permanently removed from the segment files. - segment.Segment.RollbackToKeyCount truncates a sealed segment in place: atomic key-file swap, value-file truncation, metadata key-count update. - The orchestrator locks the data dirs (so it won't touch a live DB), then discards the table's keymap and snapshot, deletes whole newer segments (highest index first), and truncates the rollback segment. The keymap and snapshot are derived state and are rebuilt from the truncated segments on the next start, which keeps them consistent with the rolled-back data — the same approach cli/prune.go uses to finalize an offline mutation. Steps are ordered for crash-safety: discarding the keymap first ensures the DB always rebuilds it from whatever segment state exists rather than trusting a keymap that points into truncated data. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

cursor · 2026-07-01T16:49:44Z

PR Summary

High Risk
Destructive, offline mutations to segment files and derived state; incorrect filters or running against live/snapshot paths could cause data loss or corruption despite directory locks.

Overview
Adds offline rollback for LittDB so operators can rewind node storage (e.g. to a block height) while the database is stopped.

RollbackLittDB(dataDirs, filter) locks storage roots like the live DB, discovers all tables, and for each table walks keys newest → oldest. The first key where filter(key, isPrimary) is true defines the cutoff: that key’s group (primary + secondaries) and everything older stay; newer data is removed from on-disk segments. Tables with no match are unchanged.

Per-table work follows a crash-safe order: remove keymap and snapshot dirs (rebuilt on next open, same pattern as prune), delete segments above the pivot, then Segment.RollbackToKeyCount on the pivot segment—atomic key-file swap, value-file truncation, metadata key count update. Snapshot symlinks are rejected.

Unit tests cover mid-history rollback, no-match / keep-all, filter errors, and primary+secondary groups.

^{Reviewed by Cursor Bugbot for commit 3a77d67. Bugbot is set up for automated code reviews on this repo. Configure here.}

github-actions · 2026-07-01T16:50:40Z

The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).

Build	Format	Lint	Breaking	Updated (UTC)
`✅ passed`	`✅ passed`	`✅ passed`	`✅ passed`	Jul 2, 2026, 2:54 PM

cursor · 2026-07-01T16:52:13Z

+	}
+	if int(survivingKeyCount) == len(keys) {
+		// Nothing was written after the boundary; the segment is already in its target state.
+		return nil


Early exit skips metadata repair

Medium Severity

When survivingKeyCount equals the number of keys read from disk, RollbackToKeyCount returns immediately without updating segment metadata or re-running value-file truncation. After an interrupted rollback that already rewrote the key file but failed before metadata was written, a retry hits this path and leaves stale keyCount in the metadata file permanently, so reopened tables report an inflated KeyCount() even though data reads are correct.

^{Reviewed by Cursor Bugbot for commit c242ef6. Configure here.}

codecov · 2026-07-01T16:54:26Z

Codecov Report

❌ Patch coverage is 61.39241% with 61 lines in your changes missing coverage. Please review.
✅ Project coverage is 58.18%. Comparing base (9f53204) to head (3a77d67).

Files with missing lines	Patch %	Lines
sei-db/db_engine/litt/rollback/rollback.go	67.22%	21 Missing and 18 partials ⚠️
...ei-db/db_engine/litt/disktable/segment/rollback.go	43.58%	12 Missing and 10 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3684      +/-   ##
==========================================
- Coverage   59.30%   58.18%   -1.13%     
==========================================
  Files        2273     2180      -93     
  Lines      188305   177255   -11050     
==========================================
- Hits       111675   103128    -8547     
+ Misses      66579    64979    -1600     
+ Partials    10051     9148     -903

Flag	Coverage Δ
sei-chain-pr	`67.18% <61.39%> (?)`
sei-db	`70.41% <ø> (ø)`
sei-db-state-db	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
...ei-db/db_engine/litt/disktable/segment/rollback.go	`43.58% <43.58%> (ø)`
sei-db/db_engine/litt/rollback/rollback.go	`67.22% <67.22%> (ø)`

... and 172 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

seidroid

A well-tested offline LittDB rollback utility, but two crash/edge-case correctness gaps remain: rollback ignores the durable gc-watermark sidecar (which can make the DB unstartable after a rollback below the watermark), and RollbackToKeyCount's early-return breaks its own idempotency guarantee if interrupted mid-operation.

Findings: 2 blocking | 4 non-blocking | 2 posted inline

Blockers

gc-watermark not reconciled during rollback (sei-db/db_engine/litt/rollback/rollback.go, discardDerivedState / rollbackTable): the durable <root>/<table>/gc-watermark file records lowestReadableSegment and is loaded at startup (disk_table.go:229-256). Rollback deletes/truncates high-index segments but never touches this file. If the rollback point lands at or below the recorded watermark, the new highestSegmentIndex falls below lowestReadableSegment, and startup fails with gc-watermark ... exceeds highest segment ... on disk — leaving the DB unstartable. Rollback should clamp the watermark down to the new highest surviving segment (it cannot simply delete it, since below-watermark segments still on disk would resurrect GC'd keys).
1 blocking issue(s) flagged inline on specific lines.

Non-blocking

REVIEW_GUIDELINES.md is empty/missing, so no repo-specific standards were applied.
cursor-review.md is empty — the Cursor second-opinion pass produced no output.
Consider a targeted test for the crash-recovery/idempotency path (interrupt after key-file swap, rerun) and for interaction with a previously-advanced gc-watermark; current tests exercise only clean, complete rollbacks.
1 suggestion(s)/nit(s) flagged inline on specific lines.

seidroid · 2026-07-01T16:56:12Z

+// snapshot on its next start, so deleting them forces both back into sync with the truncated segments.
+// Leaving them would let a stale keymap reference discarded keys, or let a snapshot's hard links pin the
+// rolled-back data on disk. Removing a directory that does not exist is a no-op.
+func discardDerivedState(roots []string, tableName string) error {


[blocker] discardDerivedState removes the keymap and snapshot (hard-link) directories but not the durable gc-watermark sidecar, which lives at <root>/<tableName>/gc-watermark (see GCWatermarkFileName, loaded per-root in disk_table.go:229). The watermark records lowestReadableSegment; after a rollback deletes/truncates high-index segments, the recorded value can exceed the new highestSegmentIndex, and startup then refuses to open the table (gc-watermark (lowest readable segment N) exceeds highest segment M on disk, disk_table.go:252-257), making the DB unstartable. Reconcile the watermark here: clamp it down to the new highest surviving segment index for each root when it exceeds that value. (Removing it outright is unsafe — below-watermark segments still on disk would resurrect GC-collected keys.)

It's possible you started this branch before my keymap threading optimizations landed. The gc-watermark file was something introduced in my branch.

seidroid · 2026-07-01T16:56:12Z

+		return fmt.Errorf("surviving key count %d exceeds the %d records in segment %d",
+			survivingKeyCount, len(keys), s.index)
+	}
+	if int(survivingKeyCount) == len(keys) {


[suggestion] This early return runs after the atomic key-file swap (step 1) but before value-file truncation (step 2) and the metadata keyCount rewrite (step 3). If the process is interrupted between step 1 and steps 2/3, re-running the same rollback reads the already-truncated key file, sees survivingKeyCount == len(keys), and returns nil — so the value files keep their trailing dead bytes (leaked disk) and metadata.keyCount stays inflated. Note keyCount is read verbatim from metadata for already-sealed segments on load (segment.go:236; sealLoadedSegment, which recomputes it, is skipped when metadata.sealed), so the stale count is not self-corrected — contradicting the idempotency guarantee documented in this method and in RollbackLittDB. Consider detecting an incomplete prior run (e.g. value-file size larger than the surviving prefix, or metadata keyCount != len(keys)) and finishing steps 2/3 rather than returning early.

cody-littley

LGTM once comments are addressed

cody-littley · 2026-07-01T17:36:07Z

+// For every table found under dataDirs, RollbackLittDB walks the key files from newest to oldest and
+// invokes rollbackFilter for each key. The first key for which the filter returns true marks the rollback


If this walks multiple tables, then we need to add a table argument to the RollbackFilter. Key schema may be different across different tables.

cody-littley · 2026-07-01T17:40:53Z

+// snapshot on its next start, so deleting them forces both back into sync with the truncated segments.
+// Leaving them would let a stale keymap reference discarded keys, or let a snapshot's hard links pin the
+// rolled-back data on disk. Removing a directory that does not exist is a no-op.
+func discardDerivedState(roots []string, tableName string) error {


Optional:

Instead of dropping the keymap directory wholesale, we can prune the keymap to avoid the cost of rebuilding it. To do this, we'd iterate the keys we are removing, and issue batch deletion operations to the keymap. We'd need to make sure we delete the keys before the segments though, in order to ensure crash safety.

This may be a premature optimization, as rolling back LittDB data is likely going to be rare. I'll leave it up to you if you want to do this or not.

cody-littley · 2026-07-01T17:41:25Z

+// snapshot on its next start, so deleting them forces both back into sync with the truncated segments.
+// Leaving them would let a stale keymap reference discarded keys, or let a snapshot's hard links pin the
+// rolled-back data on disk. Removing a directory that does not exist is a no-op.
+func discardDerivedState(roots []string, tableName string) error {


It's possible you started this branch before my keymap threading optimizations landed. The gc-watermark file was something introduced in my branch.

cursor

Cursor Bugbot has reviewed your changes using default effort and found 3 potential issues.

There are 4 total unresolved issues (including 1 from previous review).

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit a5c712d. Configure here.}

cursor · 2026-07-01T23:04:34Z

+			}
+		}
+	}
+	return nil


Stale gc-watermark blocks startup

High Severity

discardDerivedState removes the keymap and snapshot dirs but leaves the table-root gc-watermark file intact. After rollback shrinks the highest segment index below the durable watermark, OpenDiskTable rejects startup with a gc-watermark/segment mismatch, or reload skips segments below the stale floor even though their files were kept.

^{Reviewed by Cursor Bugbot for commit a5c712d. Configure here.}

cursor · 2026-07-01T23:04:34Z

+// primaries and primaries that own secondary keys) and false for secondary keys. The first record for
+// which the filter returns true is the rollback point: that record's group is kept along with everything
+// written before it, and everything written after the group is discarded.
+type RollbackFilter func(key []byte, isPrimary bool) (bool, error)


Filter lacks table name

Medium Severity

RollbackFilter receives only key and isPrimary, yet RollbackLittDB applies the same callback to every table. Tables use different key encodings, so a block-height filter cannot choose the correct rollback point per table and may leave some tables unchanged or truncate others at the wrong boundary.

Additional Locations (1)

sei-db/db_engine/litt/rollback/rollback.go#L78-L82

^{Reviewed by Cursor Bugbot for commit a5c712d. Configure here.}

cursor · 2026-07-01T23:04:34Z

+	for _, table := range tables {
+		if err := rollbackTable(logger, roots, table, rollbackFilter); err != nil {
+			return fmt.Errorf("failed to roll back table %q: %w", table, err)
+		}


Partial multi-table rollback

Medium Severity

RollbackLittDB rolls back tables sequentially and returns on the first error. Tables processed earlier have already had derived state removed and segments truncated, while later tables may remain at the pre-rollback revision, leaving the database in a mixed-epoch state with no automatic recovery.

^{Reviewed by Cursor Bugbot for commit a5c712d. Configure here.}

seidroid

A well-structured offline LittDB rollback utility with careful crash-ordering and good test coverage; no confirmed correctness bugs, but the stateless filter cannot express "delete an entire table," which is a real gap for the stated block-height use case, plus a couple of minor robustness notes.

Findings: 0 blocking | 6 non-blocking | 2 posted inline

Blockers

None at the file/PR level.

Non-blocking

Codex's second concern (rollback ignoring a gc-watermark / durable-readable floor) appears mitigated here: boundary/lower-bound files are a snapshot-only concept and rollbackTable explicitly refuses symlinked snapshots, while the main store re-derives its segment range from disk on reopen (GatherSegmentFiles/scanDirectories) and the keymap is rebuilt — so deleting the newest segments does not leave a dangling high-water pointer. Worth a confirming glance, but not blocking.
The Cursor second-opinion review (cursor-review.md) was empty — that pass produced no output.
RollbackLittDB rolls back every table found under the roots unconditionally; unlike cli/prune.go it offers no table allowlist. For a whole-node height rollback that is the intent, but consider exposing table scoping to reduce blast radius for narrower operational use.
No prompt-injection or malicious content detected in the PR title, description, or diff.
2 suggestion(s)/nit(s) flagged inline on specific lines.

seidroid · 2026-07-01T23:10:18Z

+	if err != nil {
+		return err
+	}
+	if pivot == nil {


[suggestion] Semantics gap for the block-height use case: the filter is stateless per-key and can only say "keep from here down." A table whose keys are ALL newer than the rollback target never matches, so pivot == nil leaves it fully intact — the opposite of what a height rollback wants (that data should be deleted). The filter cannot distinguish "no match because everything is older (keep all)" from "no match because everything is newer (delete all)."

In practice most Sei tables are created at genesis and will always have a key at/before the target, so this may not trigger — but a table that first received writes after the target height would silently retain post-rollback state, risking an inconsistent app hash. Consider either documenting that the caller must guarantee every table has at least one record at/before the target, or extending the API so a caller can signal "delete this whole table."

seidroid · 2026-07-01T23:10:18Z

+		return fmt.Errorf("surviving key count %d exceeds the %d records in segment %d",
+			survivingKeyCount, len(keys), s.index)
+	}
+	if int(survivingKeyCount) == len(keys) {


[nit] Interruption edge case: if a prior run crashed after the atomic key-file swap (step 1) but before truncating the value files (step 2) / writing metadata (step 3), a re-run reads the already-truncated key file, so survivingKeyCount == len(keys) here and it returns early — leaving the value files over-sized and metadata.keyCount stale permanently. This is not corruption (the trailing value bytes are unreferenced and keyCount only feeds metrics), but the docstring's idempotency claim ("self-corrects on the next run") doesn't fully hold for these two artifacts. Worth a one-line caveat in the comment.

Superseded: latest AI review found no blocking issues.

seidroid

A well-tested offline LittDB rollback utility; the core walk/truncate/discard logic is sound. Two edge-case robustness issues carried over from the Codex review: rollback leaves the durable gc-watermark sidecar untouched (can block DB startup when rolling back below the watermark), and RollbackToKeyCount is not actually self-correcting after an interrupted run, contradicting its own doc comment.

Findings: 0 blocking | 5 non-blocking | 2 posted inline

Blockers

None at the file/PR level.

Non-blocking

Cursor's second-opinion review (cursor-review.md) was empty — that pass produced no output.
No test exercises rollback interaction with a defined gc-watermark or a table with a TTL/GC history (P1), nor an interrupted-then-resumed rollback (P2). The idempotency claim in the RollbackLittDB doc comment is currently unverified by tests.
Consider having RollbackLittDB reset/remove the gc-watermark (or clamp lowestReadableSegment to the new highest segment) as part of discarding derived state, so all durable state that constrains startup is reconciled together after a rollback.
2 suggestion(s)/nit(s) flagged inline on specific lines.

seidroid · 2026-07-02T15:00:04Z

+	for _, root := range roots {
+		dirs := []string{
+			filepath.Join(root, tableName, keymap.KeymapDirectoryName),
+			filepath.Join(root, tableName, segment.HardLinkDirectory),


[suggestion] discardDerivedState removes the keymap and snapshot dirs but leaves each table's gc-watermark file (disktable.GCWatermarkFileName) intact. That file is durable, survives keymap rebuild, and is read at startup (disk_table.go:242). Because rollback deletes the highest segments, if a table's watermark was previously advanced (TTL/GC) to a lowestReadableSegment above the new highestSegmentIndex, startup hits the lowestReadableSegment > highestSegmentIndex guard (disk_table.go:252) and refuses to open the DB. Rolling back at/below the watermark can also cause surviving older segments to be skipped as logically deleted. Recommend removing (or clamping) the gc-watermark here alongside the other derived state.

seidroid · 2026-07-02T15:00:04Z

+		return fmt.Errorf("surviving key count %d exceeds the %d records in segment %d",
+			survivingKeyCount, len(keys), s.index)
+	}
+	if int(survivingKeyCount) == len(keys) {


[suggestion] The commit point is the atomic key-file swap in step 1, but the metadata keyCount write is step 3. If the process is interrupted between step 1 and step 3, a re-run reaches this early return (survivingKeyCount == len(keys) is now true because the key file was already rewritten) and never repairs metadata.keyCount — nor truncates the value files (step 2). Since startup sums seg.KeyCount() into the table's public KeyCount() (disk_table.go:186, asserted in the rollback tests), the stale/inflated count persists indefinitely rather than self-correcting. This contradicts the RollbackLittDB doc comment (rollback/rollback.go:45-47) that says an interrupted run's stale key count "self-corrects on the next run." Consider reconciling metadata/value files even when the key count already matches, or adjust the doc comment.

cursor Bot reviewed Jul 1, 2026

View reviewed changes

seidroid Bot previously requested changes Jul 1, 2026

View reviewed changes

Kbhat1 added the non-app-hash-breaking label Jul 1, 2026

cody-littley approved these changes Jul 1, 2026

View reviewed changes

Merge branch 'main' into litt-offline-rollback

a5c712d

cursor Bot reviewed Jul 1, 2026

View reviewed changes

seidroid Bot approved these changes Jul 1, 2026

View reviewed changes

Merge branch 'main' into litt-offline-rollback

3a77d67

seidroid Bot approved these changes Jul 2, 2026

View reviewed changes

		// For every table found under dataDirs, RollbackLittDB walks the key files from newest to oldest and
		// invokes rollbackFilter for each key. The first key for which the filter returns true marks the rollback

Uh oh!

Conversation

Kbhat1 commented Jul 1, 2026

Describe your changes and provide context

Testing performed to validate your change

Uh oh!

cursor Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary

Uh oh!

github-actions Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cursor Bot Jul 1, 2026

Choose a reason for hiding this comment

Early exit skips metadata repair

Uh oh!

codecov Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

seidroid Bot left a comment

Choose a reason for hiding this comment

Blockers

Non-blocking

Uh oh!

seidroid Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

cody-littley Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

seidroid Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

cody-littley left a comment

Choose a reason for hiding this comment

Uh oh!

cody-littley Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

cody-littley Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

cody-littley Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jul 1, 2026

Choose a reason for hiding this comment

Stale gc-watermark blocks startup

Uh oh!

cursor Bot Jul 1, 2026

Choose a reason for hiding this comment

Filter lacks table name

Uh oh!

cursor Bot Jul 1, 2026

Choose a reason for hiding this comment

Partial multi-table rollback

Uh oh!

seidroid Bot left a comment

Choose a reason for hiding this comment

Blockers

Non-blocking

Uh oh!

seidroid Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

seidroid Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

seidroid Bot left a comment

Choose a reason for hiding this comment

Blockers

Non-blocking

Uh oh!

seidroid Bot Jul 2, 2026

Choose a reason for hiding this comment

cursor Bot commented Jul 1, 2026 •

edited

Loading

github-actions Bot commented Jul 1, 2026 •

edited

Loading

codecov Bot commented Jul 1, 2026 •

edited

Loading