[lake] Fix DV readable snapshot scanning and premature snapshot deletion#3519
[lake] Fix DV readable snapshot scanning and premature snapshot deletion#3519luoyuxia wants to merge 2 commits into
Conversation
…ue to ineffective options getBucketsWithoutL0AndWithL0() passed SCAN_SNAPSHOT_ID and BATCH_SCAN_MODE via table options to store().newScan(), but these options are only consumed by table-level scans (DataTableBatchScan), not by store-level scans (AbstractFileStoreScan). This means the scan always hit the latest snapshot instead of the specified one. Fix by using the direct .withSnapshot(snapshot) API on FileStoreScan, consistent with getBucketsWithFlushedL0() in the same file. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
For Paimon DV tables, getReadableSnapshotAndOffsets returns an earliestSnapshotIdToKeep telling Fluss which lake snapshots may be deleted. When the latest compacted snapshot had no L0 in any bucket, the code shortcut earliestSnapshotIdToKeep to that compacted snapshot's previous APPEND, assuming nothing earlier was needed. This was unsound: a bucket can be clean (no L0) in the current compacted snapshot yet still be anchored to an older snapshot - it was flushed earlier and has not been flushed since, so its base anchor (the previous APPEND of the latest snapshot that exactly holds its most-recently flushed L0) can be older than the compacted snapshot's previous APPEND. Once such a bucket later receives new L0, recomputation traces back to that older anchor; if it was already deleted, the retrieval returns null and readable offsets can no longer advance. Fix by also computing the base anchor for buckets without L0 and lowering earliestSnapshotIdToKeep to the minimum anchor across all buckets. This is best-effort: if a bucket's flush history has expired and its anchor cannot be determined, keep all previous snapshots rather than risk deleting a needed one; it never blocks the readable-offset advance. The shared anchor computation is extracted into findBaseAnchorAppendSnapshot, reused by both the with-L0 offset traversal and the new no-L0 retention pass. The partitioned test expectation that encoded the old, over-aggressive value is updated accordingly. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
fresh-borzoni
left a comment
There was a problem hiding this comment.
@luoyuxia Thank you, looks good, left minor comments, PTAL
| allAnchorsResolved = false; | ||
| } | ||
|
|
||
| return allAnchorsResolved ? earliestSnapshotIdToKeep : LakeCommitResult.KEEP_ALL_PREVIOUS; |
There was a problem hiding this comment.
If one bucket's flush history has aged out, this returns KEEP_ALL_PREVIOUS and throws away the bound the other buckets already gave us - so retention never shrinks while that bucket stays cold.
The unresolvable anchor is by definition older than earliestSnapshotId (already gone from Paimon), so keep-all can't recover it anyway, flooring at earliestSnapshotId is just as safe and lets retention track Paimon's expiry.
WDYT?
| continue; | ||
| } | ||
| // The first flush encountered going backwards is the bucket's most recent flush. | ||
| for (PaimonPartitionBucket partitionBucket : getBucketsWithFlushedL0(currentSnapshot)) { |
There was a problem hiding this comment.
This re-walks the same snapshots as the with-L0 loop above, so getBucketsWithFlushedL0 scans manifests twice per compacted snapshot. Could it fold into that loop: one pass, offsets for the with-L0 buckets, anchor-only for the rest?
Purpose
Linked issue: close #xxx
For Paimon deletion-vector (DV) enabled tables,
DvTableReadableSnapshotRetriever.getReadableSnapshotAndOffsetscomputes, for a just-committed APPEND (tiered) snapshot, the readable snapshot + per-bucket readable offsets, and anearliestSnapshotIdToKeeptelling Fluss which older lake snapshots are safe to delete. This PR fixes two correctness bugs in that path.Bug 1 - scanning the wrong snapshot (ineffective options).
getBucketsWithoutL0AndWithL0(snapshot)decided which buckets have L0 by scanning viafileStoreTable.copy(scanOptions).store().newScan(), passingSCAN_SNAPSHOT_IDandBATCH_SCAN_MODEas table options. Those options are only consumed by table-level scans (DataTableBatchScan), not by store-level scans (AbstractFileStoreScan), so the scan always hit the latest snapshot instead of the requested compacted snapshot (ManifestsReaderfalls back tosnapshotManager.latestSnapshot()when no snapshot is specified). Fixed by using the direct.withSnapshot(snapshot)API, consistent withgetBucketsWithFlushedL0/PaimonDvTableUtils.Bug 2 - premature snapshot deletion (over-aggressive retention).
When the latest compacted snapshot had no L0 in any bucket, the code shortcut
earliestSnapshotIdToKeepto that compacted snapshot's previous APPEND, assuming nothing earlier was needed. This is unsound: a bucket can be clean (no L0) in the current compacted snapshot yet still be anchored to an older snapshot - it was flushed earlier and has not been flushed since, so its base anchor (the previous APPEND of the latest snapshot that exactly holds its most-recently flushed L0) can be older than the compacted snapshot's previous APPEND. Once such a bucket later receives new L0, recomputation traces back to that older anchor; if it was already deleted, the retrieval returnsnulland readable offsets can no longer advance.Concrete example (the scenario exercised by
testGetReadableSnapshotAndOffsets, using Paimon snapshot ids; focus onbucket0):bucket0is compacted at snapshot 7, which flushes its L0 into base. Its base anchor is snapshot 6 (the latest snapshot still holding that flushed L0; an APPEND, so it is the anchor directly).earliestSnapshotIdToKeep= snapshot 9's previous APPEND = snapshot 8, so Fluss deleted every lake snapshot< 8, including snapshot 6.bucket0then receives new appends (new L0) and a later compaction produces snapshot 11. Nowbucket0has L0 again, so recomputation traces back to its most recent flush (snapshot 7) and needs its base anchor snapshot 6 to read the offset - but snapshot 6 was already deleted in step 2.getReadableSnapshotAndOffsetsreturnsnull, and readable offsets can no longer advance.Bug 2 was masked by Bug 1: scanning the latest (appended) snapshot made freshly-appended buckets look like "with L0", so they went through the per-bucket flush traversal that retained the old anchors. Fixing the scan exposed the retention bug.
Brief change log
FileStoreScan.withSnapshot(snapshot)ingetBucketsWithoutL0AndWithL0instead of relying on table options that the store-level scan ignores.earliestSnapshotIdToKeepto the minimum anchor across all buckets (a bucket's anchor only moves forward when it is flushed again, so until then it must be retained).KEEP_ALL_PREVIOUS(keep everything) rather than risk deleting a still-needed snapshot. This never blocks the readable-offset advance, since these buckets' offsets are already resolved from the latest tiered snapshot.findBaseAnchorAppendSnapshot, reused by both the with-L0 offset traversal and the new no-L0 retention pass.Tests
DvTableReadableSnapshotRetrieverTest#testGetReadableSnapshotAndOffsets(non-partitioned) and#testGetReadableSnapshotAndOffsetsForPartitionedTable(partitioned) - both run against real Paimon tables and now pass with the corrected scan and retention logic.earliestSnapshotIdToKeepexpectation, which encoded the old over-aggressive value, is updated to the correct minimum-anchor value (with an explanatory comment).API and Format
No public API or storage format change.
Documentation
No documentation change; behavior-only fix.
Generative AI disclosure: Yes - authored with Claude Code (Claude Opus 4.8). Reviewed by a human developer.
🤖 Generated with Claude Code