Fix race condition for memiavl by yzang2019 · Pull Request #3687 · sei-protocol/sei-chain

yzang2019 · 2026-07-01T16:59:39Z

Fix

Serialize the hash-mutating operations on each memIAVL tree with the tree's
write lock, and add no-lock internal helpers so the public entry points
don't re-acquire the lock recursively.
sei-db/state_db/sc/memiavl/tree.go

RootHash() now takes t.mtx.Lock() (it mutates MemNode.hash), delegating to a new rootHashNoLock().
GetProof() now takes t.mtx.Lock() for the whole proof build, delegating to getProofNoLock().
Added no-lock read helpers used by the proof builders: getWithIndexNoLock(), getByIndexNoLock(), hasNoLock().
sei-db/state_db/sc/memiavl/proof.go
GetMembershipProof() / GetNonMembershipProof() now take the write lock and delegate to getMembershipProofNoLock() / getNonMembershipProofNoLock().
createExistenceProof() documented as "caller must hold the write lock".
The read-only fast paths (Get, GetWithIndex, GetByIndex, Has, Iterator)
are unchanged and still use RLock. Locking is per-tree (per-store).

Lock-ordering safety

Commit path acquires db.mtx then per-tree t.mtx; the query path acquires
only t.mtx. No inversion, no new deadlock.
No in-package caller invokes the now-locking methods while already holding
t.mtx (verified); internal callers use the *NoLock helpers.

Performance impact

Validators (no queries): the write lock is uncontended, so Lock() costs
the same as the previous RLock() (a single atomic op). RootHash runs
~once/twice per store per block → single-digit microseconds per block. No
measurable impact on block/consensus/tx-execution time.
Tx execution: unaffected — reads go through cache stores (RLock,
unchanged); GetProof is not on the execution path.
RPC nodes under heavy prove=true: a commit's RootHash and a proof build
now serialize per store. This is bounded (proof build is O(tree height))
and per-store (committing store A never blocks proofs on store B). It is the
necessary cost of correctness — the previous "cheaper" path was the bug.

Testing

Added sei-db/state_db/sc/memiavl/tree_race_test.go:

TestTreeProofRaceWithCommit — commit (Set + RootHash) concurrent with
membership/non-membership GetProof and RootHash.
TestTreeConcurrentRootHash — concurrent RootHash over a tree with unfilled
hash caches; asserts all callers agree.
Verification:
The tests were confirmed to reproduce the bug: reverting the two locks back
to RLock makes go test -race report DATA RACE on MemNode.hash.
With the fix, go test -race ./sei-db/state_db/sc/memiavl/... passes clean
(0 data races), as do the commitment and rootmulti suites.

Risk & rollout

Read-path behavior and hash outputs are unchanged; this is a locking-only fix.
Safe to roll out without coordination (not AppHash-format changing). It
prevents divergence rather than changing committed state.

## Describe your changes and provide context - Pebble prune leaves tombstones uncompacted, so prune slows over uptime - Bump compaction concurrency to {1,4} and compact each pruned range right after ## Testing performed to validate your change - Verifying in unit tests + on node Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The Disable WASM integration shard was flaky because the test script ran a single bare `npm ci` before starting Hardhat. In the failed CI job, npm aborted with a transient `ECONNRESET` while fetching packages, so the shard failed before `DisableWasmTest.js` actually ran. Retry `npm ci` with a short backoff and remove any partial `node_modules` contents between attempts. Persistent install failures still fail the job, but temporary registry/network resets no longer make the test shard flaky. Flaked in [unrelated changes](https://github.com/sei-protocol/sei-chain/actions/runs/28434399067/job/84258153127?pr=3663).

The EVM RPC parity tests sampled eth_maxPriorityFeePerGas against "latest" while the devnet was still producing blocks and other fee-market tests could leave a congested block at the head. A new empty block could land between the priority-fee RPC and the eth_feeHistory/latest-block read, or the test could assume the 1 gwei idle default while the sampled head was still congested. Add a stable-head sampler for eth_maxPriorityFeePerGas, query fee history by the pinned block height, and wait for a stable uncongested head before asserting the default priority fee. Flaked in [unrelated changes](https://github.com/sei-protocol/sei-chain/actions/runs/28464992080/job/84364444593).

Re-generate fresh change log in prep to cut v6.6 RC1

cursor · 2026-07-01T17:00:27Z

PR Summary

Medium Risk
State-store prune and compaction behavior affects node storage performance and long-run stability; OTel setup now fails on empty chain ID, which is a small operational breaking change for misconfigured nodes.

Overview
State store (Pebbledb MVCC) now compacts only the key span deleted during each prune (ascending and descending paths), raises Pebble compaction concurrency (1–4), and adds tests that prune triggers compaction while no-op prunes skip it—aimed at tombstone buildup and rising prune latency on long-lived nodes.

Observability: SetupOtelMetricsProvider requires chain ID, merges it into the OTel resource, and exposes chain_id as a constant Prometheus label on scraped series; app construction passes bApp.ChainID.

Integration / tests: EVM RPC specs pin eth_maxPriorityFeePerGas / eth_gasPrice to a stable head and treat congestion as EVM receipt gas > 80% of block limit (not total block gas). disable_wasm.sh retries npm ci with backoff. IBC types tests close the temp rootmulti store after setup.

CHANGELOG adds v6.6 backport entries (prune compaction, metrics, iterator hardening, evmone path, go-ethereum bump, etc.).

^{Reviewed by Cursor Bugbot for commit 1c66d87. Bugbot is set up for automated code reviews on this repo. Configure here.}

github-actions · 2026-07-01T17:01:58Z

The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).

Build	Format	Lint	Breaking	Updated (UTC)
`✅ passed`	`✅ passed`	`✅ passed`	`✅ passed`	Jul 1, 2026, 8:17 PM

codecov · 2026-07-01T17:02:21Z

Codecov Report

❌ Patch coverage is 84.44444% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 59.29%. Comparing base (8ebedda) to head (1c66d87).
⚠️ Report is 1 commits behind head on yzang/fix-rpc-apphash.

Files with missing lines	Patch %	Lines
sei-db/db_engine/pebbledb/mvcc/db.go	88.88%	1 Missing and 1 partial ⚠️
sei-db/db_engine/pebbledb/mvcc/db_ascending.go	71.42%	1 Missing and 1 partial ⚠️
utils/metrics/metrics_util.go	89.47%	1 Missing and 1 partial ⚠️
app/app.go	0.00%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@                    Coverage Diff                    @@
##           yzang/fix-rpc-apphash    #3687      +/-   ##
=========================================================
+ Coverage                  59.24%   59.29%   +0.05%     
=========================================================
  Files                       2272     2272              
  Lines                     188175   188210      +35     
=========================================================
+ Hits                      111479   111594     +115     
+ Misses                     66657    66565      -92     
- Partials                   10039    10051      +12

Flag	Coverage Δ
sei-chain	`58.31% <84.44%> (+0.04%)`	⬆️
sei-chain-pr	`56.64% <84.44%> (?)`
sei-db	`70.41% <ø> (ø)`
sei-db-state-db	`76.51% <ø> (+0.22%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
app/app.go	`71.28% <0.00%> (ø)`
sei-db/db_engine/pebbledb/mvcc/db.go	`67.65% <88.88%> (+1.32%)`	⬆️
sei-db/db_engine/pebbledb/mvcc/db_ascending.go	`53.75% <71.42%> (+35.79%)`	⬆️
utils/metrics/metrics_util.go	`88.53% <89.47%> (-0.03%)`	⬇️

... and 35 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

seidroid

Adds post-prune range compaction to the PebbleDB MVCC store (both descending and ascending paths) plus raised compaction concurrency, with solid new unit tests, and stabilizes two gas-price integration tests by pinning a stable block. No blocking correctness or security issues found; the main considerations are the cost of a synchronous, near-whole-keyspace compaction on every prune and a PR title/description that don't match the diff.

Findings: 0 blocking | 6 non-blocking | 2 posted inline

Blockers

None at the file/PR level.

Non-blocking

The ascending path (sei-db/db_engine/pebbledb/mvcc/db_ascending.go:230) mirrors the descending compactPrunedRange call and shares the same characteristics as the inline notes below (synchronous compaction + iterator still open).
PR title ("Fix race condition for memiavl") and the empty description don't match the actual change, which is PebbleDB prune-compaction + gas-price integration-test stabilization. Please update the title/description so the intent and testing are clear.
Codex's review flagged the open-iterator-during-Compact concern (see inline note where I address its mechanism); Cursor's review file (cursor-review.md) and the repo's REVIEW_GUIDELINES.md were empty, so no additional guidance was applied from them.
No prompt-injection or malicious content was found in the PR diff or description.
2 suggestion(s)/nit(s) flagged inline on specific lines.

seidroid · 2026-07-01T17:06:04Z

+	if err := db.SetEarliestVersion(earliestVersion, false); err != nil {
+		return err
+	}
+	return db.compactPrunedRange(firstDeletedKey, lastDeletedKey)


[suggestion] Two things worth considering on this new compactPrunedRange call:

Performance/scope: because the prune scan walks every store, firstDeletedKey/lastDeletedKey bracket essentially the entire keyspace, so this issues a broad manual Compact over most of the DB on every prune that deletes anything. Compact is synchronous, so Prune now blocks until that compaction finishes. If Prune is on a latency-sensitive path, consider whether a narrower range (e.g. per-store), an async trigger, or a size/count threshold before compacting would be preferable to a full-range compaction each cycle.

Iterator lifetime (raised by Codex): compactPrunedRange runs while itr is still open (its deferred Close() at db.go:638 fires only after this returns). In Pebble, regular NewIter iterators are not registered as snapshots, so this does NOT prevent tombstone elision during compaction (only db.NewSnapshot() would) — Codex's concern about the compaction being unable to reclaim the tombstones is unlikely to hold. The only real effect is that the pinned read-state defers physical deletion of obsolete files until the iterator closes a moment later. Still, closing the iterator before compacting (e.g. by scoping the scan/delete loop in an inner function) is cleaner and removes any doubt.

seidroid · 2026-07-01T17:06:04Z

+	// comparers, so the entire deleted span is covered.
+	end := append(slices.Clone(last), 0)
+	return db.storage.Compact(context.Background(), first, end, true)
+}


[suggestion] nit: end := append(slices.Clone(last), 0) correctly yields a key strictly greater than last under both comparers (good, and it's covered by TestCompactPrunedRangeSingleKey). Worth a one-line note that the extra trailing byte makes end no longer a well-formed MVCC key, which is fine here since Compact treats it purely as a raw range boundary and never parses it.

Close the rootmulti store created by TestTypesTestSuite once the proof fixture has been generated. The store starts asynchronous hash logger work during commit, and leaving it open can race with t.TempDir cleanup under the race-enabled CI job, causing flaky "directory not empty" cleanup failures in channel message validation tests. Flaked [on main](https://github.com/sei-protocol/sei-chain/actions/runs/28517259993/job/84531996505).

seidroid

The automated review did not complete; see the failing AI Review check for details.

## Describe your changes Attaches a `chain_id` label to every OTel metric series emitted by the chain. Previously `SetupOtelMetricsProvider` created a meter provider with no resource attributes, so metrics scraped via Prometheus carried no way to distinguish which chain they came from — a problem when multiple chains scrape into the same backend. This PR: - Adds a `chainID` parameter to `SetupOtelMetricsProvider` and wires `bApp.ChainID` in at application construction (`app/app.go`). - Builds an OTel `resource` with a `chain_id` attribute and registers it on the meter provider. - Uses `WithResourceAsConstantLabels` (scoped to an allow-list of just `chain_id`) so the attribute surfaces as a constant label on every scraped Prometheus series. - Fails closed: returns an error when `chainID` is empty rather than emitting unlabeled metrics. ## Testing Adds `utils/metrics/metrics_util_test.go`: - `TestSetupOtelMetricsProviderRequiresChainID` — empty `chainID` returns an error. - `TestSetupOtelMetricsProviderAttachesChainID` — emits a metric through the installed provider and asserts the scraped Prometheus series carries the expected `chain_id` label. ## Linear PLT-783

seidroid

The automated review did not complete; see the failing AI Review check for details.

Kbhat1 and others added 4 commits July 1, 2026 09:51

Update changelog in prep to cut 6.6 RC1 (#3685)

344fd53

Re-generate fresh change log in prep to cut v6.6 RC1

seidroid Bot approved these changes Jul 1, 2026

View reviewed changes

seidroid Bot reviewed Jul 1, 2026

View reviewed changes

yzang2019 closed this Jul 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix race condition for memiavl#3687

Fix race condition for memiavl#3687
yzang2019 wants to merge 6 commits into
yzang/fix-rpc-apphashfrom
main

yzang2019 commented Jul 1, 2026 •

edited

Loading

Uh oh!

cursor Bot commented Jul 1, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jul 1, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jul 1, 2026 •

edited

Loading

Uh oh!

seidroid Bot left a comment

Uh oh!

seidroid Bot Jul 1, 2026

Uh oh!

seidroid Bot Jul 1, 2026

Uh oh!

seidroid Bot left a comment

Uh oh!

seidroid Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

yzang2019 commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fix

Lock-ordering safety

Performance impact

Testing

Risk & rollout

Uh oh!

cursor Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary

Uh oh!

github-actions Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

seidroid Bot left a comment

Choose a reason for hiding this comment

Blockers

Non-blocking

Uh oh!

seidroid Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

seidroid Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

seidroid Bot left a comment

Choose a reason for hiding this comment

Uh oh!

seidroid Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yzang2019 commented Jul 1, 2026 •

edited

Loading

cursor Bot commented Jul 1, 2026 •

edited

Loading

github-actions Bot commented Jul 1, 2026 •

edited

Loading

codecov Bot commented Jul 1, 2026 •

edited

Loading