Fix race condition for memiavl#3687
Conversation
## Describe your changes and provide context
- Pebble prune leaves tombstones uncompacted, so prune slows over uptime
- Bump compaction concurrency to {1,4} and compact each pruned range
right after
## Testing performed to validate your change
- Verifying in unit tests + on node
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The Disable WASM integration shard was flaky because the test script ran a single bare `npm ci` before starting Hardhat. In the failed CI job, npm aborted with a transient `ECONNRESET` while fetching packages, so the shard failed before `DisableWasmTest.js` actually ran. Retry `npm ci` with a short backoff and remove any partial `node_modules` contents between attempts. Persistent install failures still fail the job, but temporary registry/network resets no longer make the test shard flaky. Flaked in [unrelated changes](https://github.com/sei-protocol/sei-chain/actions/runs/28434399067/job/84258153127?pr=3663).
The EVM RPC parity tests sampled eth_maxPriorityFeePerGas against "latest" while the devnet was still producing blocks and other fee-market tests could leave a congested block at the head. A new empty block could land between the priority-fee RPC and the eth_feeHistory/latest-block read, or the test could assume the 1 gwei idle default while the sampled head was still congested. Add a stable-head sampler for eth_maxPriorityFeePerGas, query fee history by the pinned block height, and wait for a stable uncongested head before asserting the default priority fee. Flaked in [unrelated changes](https://github.com/sei-protocol/sei-chain/actions/runs/28464992080/job/84364444593).
Re-generate fresh change log in prep to cut v6.6 RC1
PR SummaryMedium Risk Overview Observability: Integration / tests: EVM RPC specs pin CHANGELOG adds v6.6 backport entries (prune compaction, metrics, iterator hardening, evmone path, go-ethereum bump, etc.). Reviewed by Cursor Bugbot for commit 1c66d87. Bugbot is set up for automated code reviews on this repo. Configure here. |
|
The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).
|
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## yzang/fix-rpc-apphash #3687 +/- ##
=========================================================
+ Coverage 59.24% 59.29% +0.05%
=========================================================
Files 2272 2272
Lines 188175 188210 +35
=========================================================
+ Hits 111479 111594 +115
+ Misses 66657 66565 -92
- Partials 10039 10051 +12
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Adds post-prune range compaction to the PebbleDB MVCC store (both descending and ascending paths) plus raised compaction concurrency, with solid new unit tests, and stabilizes two gas-price integration tests by pinning a stable block. No blocking correctness or security issues found; the main considerations are the cost of a synchronous, near-whole-keyspace compaction on every prune and a PR title/description that don't match the diff.
Findings: 0 blocking | 6 non-blocking | 2 posted inline
Blockers
- None at the file/PR level.
Non-blocking
- The ascending path (
sei-db/db_engine/pebbledb/mvcc/db_ascending.go:230) mirrors the descendingcompactPrunedRangecall and shares the same characteristics as the inline notes below (synchronous compaction + iterator still open). - PR title ("Fix race condition for memiavl") and the empty description don't match the actual change, which is PebbleDB prune-compaction + gas-price integration-test stabilization. Please update the title/description so the intent and testing are clear.
- Codex's review flagged the open-iterator-during-Compact concern (see inline note where I address its mechanism); Cursor's review file (
cursor-review.md) and the repo'sREVIEW_GUIDELINES.mdwere empty, so no additional guidance was applied from them. - No prompt-injection or malicious content was found in the PR diff or description.
- 2 suggestion(s)/nit(s) flagged inline on specific lines.
| if err := db.SetEarliestVersion(earliestVersion, false); err != nil { | ||
| return err | ||
| } | ||
| return db.compactPrunedRange(firstDeletedKey, lastDeletedKey) |
There was a problem hiding this comment.
[suggestion] Two things worth considering on this new compactPrunedRange call:
-
Performance/scope: because the prune scan walks every store,
firstDeletedKey/lastDeletedKeybracket essentially the entire keyspace, so this issues a broad manualCompactover most of the DB on every prune that deletes anything.Compactis synchronous, soPrunenow blocks until that compaction finishes. IfPruneis on a latency-sensitive path, consider whether a narrower range (e.g. per-store), an async trigger, or a size/count threshold before compacting would be preferable to a full-range compaction each cycle. -
Iterator lifetime (raised by Codex):
compactPrunedRangeruns whileitris still open (its deferredClose()at db.go:638 fires only after this returns). In Pebble, regularNewIteriterators are not registered as snapshots, so this does NOT prevent tombstone elision during compaction (onlydb.NewSnapshot()would) — Codex's concern about the compaction being unable to reclaim the tombstones is unlikely to hold. The only real effect is that the pinned read-state defers physical deletion of obsolete files until the iterator closes a moment later. Still, closing the iterator before compacting (e.g. by scoping the scan/delete loop in an inner function) is cleaner and removes any doubt.
| // comparers, so the entire deleted span is covered. | ||
| end := append(slices.Clone(last), 0) | ||
| return db.storage.Compact(context.Background(), first, end, true) | ||
| } |
There was a problem hiding this comment.
[suggestion] nit: end := append(slices.Clone(last), 0) correctly yields a key strictly greater than last under both comparers (good, and it's covered by TestCompactPrunedRangeSingleKey). Worth a one-line note that the extra trailing byte makes end no longer a well-formed MVCC key, which is fine here since Compact treats it purely as a raw range boundary and never parses it.
Close the rootmulti store created by TestTypesTestSuite once the proof fixture has been generated. The store starts asynchronous hash logger work during commit, and leaving it open can race with t.TempDir cleanup under the race-enabled CI job, causing flaky "directory not empty" cleanup failures in channel message validation tests. Flaked [on main](https://github.com/sei-protocol/sei-chain/actions/runs/28517259993/job/84531996505).
## Describe your changes Attaches a `chain_id` label to every OTel metric series emitted by the chain. Previously `SetupOtelMetricsProvider` created a meter provider with no resource attributes, so metrics scraped via Prometheus carried no way to distinguish which chain they came from — a problem when multiple chains scrape into the same backend. This PR: - Adds a `chainID` parameter to `SetupOtelMetricsProvider` and wires `bApp.ChainID` in at application construction (`app/app.go`). - Builds an OTel `resource` with a `chain_id` attribute and registers it on the meter provider. - Uses `WithResourceAsConstantLabels` (scoped to an allow-list of just `chain_id`) so the attribute surfaces as a constant label on every scraped Prometheus series. - Fails closed: returns an error when `chainID` is empty rather than emitting unlabeled metrics. ## Testing Adds `utils/metrics/metrics_util_test.go`: - `TestSetupOtelMetricsProviderRequiresChainID` — empty `chainID` returns an error. - `TestSetupOtelMetricsProviderAttachesChainID` — emits a metric through the installed provider and asserts the scraped Prometheus series carries the expected `chain_id` label. ## Linear PLT-783
Fix
Serialize the hash-mutating operations on each memIAVL tree with the tree's
write lock, and add no-lock internal helpers so the public entry points
don't re-acquire the lock recursively.
sei-db/state_db/sc/memiavl/tree.goRootHash()now takest.mtx.Lock()(it mutatesMemNode.hash), delegating to a newrootHashNoLock().GetProof()now takest.mtx.Lock()for the whole proof build, delegating togetProofNoLock().getWithIndexNoLock(),getByIndexNoLock(),hasNoLock().sei-db/state_db/sc/memiavl/proof.goGetMembershipProof()/GetNonMembershipProof()now take the write lock and delegate togetMembershipProofNoLock()/getNonMembershipProofNoLock().createExistenceProof()documented as "caller must hold the write lock".The read-only fast paths (
Get,GetWithIndex,GetByIndex,Has,Iterator)are unchanged and still use
RLock. Locking is per-tree (per-store).Lock-ordering safety
db.mtxthen per-treet.mtx; the query path acquiresonly
t.mtx. No inversion, no new deadlock.t.mtx(verified); internal callers use the*NoLockhelpers.Performance impact
Lock()coststhe same as the previous
RLock()(a single atomic op).RootHashruns~once/twice per store per block → single-digit microseconds per block. No
measurable impact on block/consensus/tx-execution time.
RLock,unchanged);
GetProofis not on the execution path.prove=true: a commit'sRootHashand a proof buildnow serialize per store. This is bounded (proof build is O(tree height))
and per-store (committing store A never blocks proofs on store B). It is the
necessary cost of correctness — the previous "cheaper" path was the bug.
Testing
Added
sei-db/state_db/sc/memiavl/tree_race_test.go:TestTreeProofRaceWithCommit— commit (Set+RootHash) concurrent withmembership/non-membership
GetProofandRootHash.TestTreeConcurrentRootHash— concurrentRootHashover a tree with unfilledhash caches; asserts all callers agree.
Verification:
to
RLockmakesgo test -racereportDATA RACEonMemNode.hash.go test -race ./sei-db/state_db/sc/memiavl/...passes clean(0 data races), as do the
commitmentandrootmultisuites.Risk & rollout
prevents divergence rather than changing committed state.