Core: Fix row lineage last updated sequence inheritance by wgtmac · Pull Request #17039 · apache/iceberg

wgtmac · 2026-07-02T09:29:47Z

Summary

Fix _last_updated_sequence_number inheritance to use a data file's data_sequence_number in scan constants.
Add a regression test for v2 rewrite followed by v3 upgrade row ID assignment.

Problem

The row lineage spec says inherited _last_updated_sequence_number values come from the data file manifest entry sequence number, i.e. the data sequence number. Java was using task.file().fileSequenceNumber() when populating the metadata column constant.

That produces the wrong value when a v2 rewrite/compaction adds a replacement file with an older data sequence number, and the table is later upgraded to v3. After upgrade, assigning first_row_id to the existing metadata tree makes readers inherit _last_updated_sequence_number for old files that do not contain row lineage columns. Using the file sequence number reports the rewrite/compaction snapshot instead of the original row update sequence.

Why this is correct

data_sequence_number represents the sequence number associated with the file content and can be preserved across rewrites. That is the value required by the spec for inherited _last_updated_sequence_number. file_sequence_number represents when the physical file was added, which may be later than the row content update when files are rewritten.

Testing

JAVA_HOME=/Users/gangwu/.sdkman/candidates/java/17.0.18-zulu ./gradlew :iceberg-core:test --tests 'org.apache.iceberg.TestRowLineageAssignment.lastUpdatedAfterUpgrade'
git diff --check

AI Disclosure

Model: OpenAI Codex
Platform/Tool: Codex CLI/Desktop
Human Oversight: partially reviewed
Prompt Summary: Investigate row lineage review feedback, add a regression test, and fix Java inheritance to match the spec.

Update the row lineage spec to say that inherited _last_updated_sequence_number values come from a data file's file_sequence_number, not its data_sequence_number. This matches the Java scan constant implementation and the field's meaning as the sequence number of the commit that last updated the row. Using data_sequence_number would diverge for rewrites where the data sequence can differ from the snapshot that added the file. Generated-by: OpenAI Codex

amogh-jahagirdar

I'm not sure this is really neccessary to change in the spec. We know that for updates and rewrites, existing rows are required to have their row IDs and sequence numbers materialized in the new files based on the rules 2 paragraphs below this one. Those are the sources of truth for subsequent lineage information.

The case where this inheritance applies is where there's new rows (e.g. an append) or an assignment from a data file written in an older table format version being put into a v3 manifest.

In both of those cases the file sequence number equals the data sequence number so I think it works correctly as written in the spec.

If you have an example scenario where this becomes ambiguous would be good to understand that.

I do think we probably want to look at the Java implementation just using the task.file().dataSequenceNumber() because I can see how that's confusing but I don't think there's an issue there given the invariant I mentioned that the case where inheritance applies is also where data sequence number == file sequence number. Let me know if I'm missing something!

Restore the row lineage spec wording and update Java scan constants to inherit _last_updated_sequence_number from a data file's data sequence number. This preserves the original row update sequence when a v2 rewrite carries an older data sequence number and the table is later upgraded to v3. Generated-by: OpenAI Codex

wgtmac · 2026-07-02T14:54:03Z

Thanks @amogh-jahagirdar for the quick review!

I just got this confusion while working on v3 row lineage in iceberg-cpp. Now I agree with you that Spec should not be changed. I checked this against the upgrade path and found one case where inheritance applies even though data/file sequence numbers differ.

The scenario is: v2 append -> v2 rewrite using the original data sequence -> upgrade to v3 -> assign first_row_id to the existing metadata tree. The rewritten file has no row lineage columns, gets a non-null first_row_id after upgrade, and has data_sequence_number=1 while file_sequence_number=2. In that case Java was inheriting _last_updated_sequence_number from fileSequenceNumber(), returning 2. Now I've fixed it and added a regression test.

Let me know what you think.

amogh-jahagirdar · 2026-07-02T16:39:36Z

Thanks for the clarification and update @wgtmac ! I think the issue you pointed out is legitimate; the key is that prior to the upgrade we have a rewrite which also does a re-sequencing of the data files to something older than the file sequence number.

Just reasoning about impact of this issue, I think we can materialize bad sequence numbers in files on subsequent updates of those records that were in those pre-upgrade files which is problematic in itself (it'll say row X last changed in sequence 2 whereas it should be 1) for auditability.

However, from a change detection perspective, I think despite this issue, consumers largely end up being OK just because a change detection consumer can really only use lineage for change detection starting from the sequence number post-upgrade. I think this is the case because even if data sequence number < file sequence number for the v2 file, both data and file sequence numbers would neccessarily be less than the sequence number of any writes that happen post-upgrade; whether it was originally 2 or 1 ends up being inconsequential from a lineage-based change detection.

Anyways, either way I do think we should address this and get this into 1.10.3! Thank you @wgtmac , taking a look at the tests

amogh-jahagirdar · 2026-07-02T16:49:04Z

  }

+  @Test
+  public void lastUpdatedAfterUpgrade(@TempDir File altLocation) throws IOException {


testLastUpdatedDataSequencePreservedAfterUpgrade? I think we should still prefix with test in this class to be consistent.

I thought we agreed we would no longer be restrained by the sins of our past :)

More seriously I did think we decided we were doing updates for new code. Just trying not to blanket modify everything when it doesn't matter

got it, in that case i'm good with how it is!

amogh-jahagirdar · 2026-07-02T16:52:52Z

+      assertThat(task.file().firstRowId()).isNotNull();
+
+      assertThat(
+              PartitionUtil.constantsMap(task)


It is a bit strange in these tests to have inspect the constantsMap but I get it because the sequence number is inherited metadata and not present on any of the in memory representations available. Maybe just an inline comment?

amogh-jahagirdar · 2026-07-02T18:19:19Z

we probably also want to add a spark test as well, just to verify the end to end

Add Spark rewrite-data-files coverage for V2 tables upgraded to V3. The test verifies row lineage reads preserve the data sequence number when rewritten files have a newer file sequence number.

wgtmac · 2026-07-03T09:44:44Z

I've added a Spark test with the help of Codex as I'm not that familiar with it. Let me know if it looks good. @amogh-jahagirdar

amogh-jahagirdar

Overall looks great, thank you @wgtmac , I just had some minor suggestions on how we should write the spark test. Would be good to get @RussellSpitzer @rdblue input on this as well

amogh-jahagirdar · 2026-07-03T15:48:39Z

    assertEquals("Rows must match", expectedRecordsWithLineage, actualRecordsWithLineage);
  }

+  @TestTemplate


I think this looks right but I'd cleanup the empty append and do a more realistic write, I'd also just use an explicit assertion on the records (just a bit more readable/less opaque)

+ @TestTemplate + public void testLastUpdatedSequenceInheritedFromDataSequenceAfterUpgrade() { + // Reproduces the inheritance bug for a v2-origin file carried into v3. + assumeThat(formatVersion).isEqualTo(2); + + Table table = createTable(); + writeRecords(2 /* files */, 4 /* records */); + table.refresh(); + shouldHaveFiles(table, 2); + long originalSequenceNumber = table.currentSnapshot().sequenceNumber(); + // Perform a compaction before upgrade which preserves the original sequence number + basicRewrite(table).execute(); + table.refresh(); + shouldHaveFiles(table, 1); + + DataFile compacted = Iterables.getOnlyElement(currentDataFiles(table)); + assertThat(compacted.dataSequenceNumber()).isEqualTo(originalSequenceNumber); + assertThat(compacted.fileSequenceNumber()) + .as("Compaction must bump the file sequence number above the preserved data sequence") + .isGreaterThan(originalSequenceNumber); + + // Upgrade to v3 so the row lineage metadata columns are exposed on read. + table.updateProperties().set(TableProperties.FORMAT_VERSION, "3").commit(); + table.refresh(); + + writeRecords(1 /* files */, 4 /* records */); + table.refresh(); + long appendSequenceNumber = table.currentSnapshot().sequenceNumber(); + + // currentDataWithLineage() projects (_row_id, _last_updated_sequence_number, *) ordered by + // _row_id; the trailing data columns are irrelevant here, so match them with ANY. + List<Object[]> actualLineage = currentDataWithLineage(); + + List<Object[]> expectedLineage = + Lists.newArrayList( + // The post-upgrade append writes a new file, which is assigned the first row IDs (0-3) + // and carries the append commit's sequence number as its last-updated sequence. + row(0L, appendSequenceNumber, ANY, ANY, ANY), + row(1L, appendSequenceNumber, ANY, ANY, ANY), + row(2L, appendSequenceNumber, ANY, ANY, ANY), + row(3L, appendSequenceNumber, ANY, ANY, ANY), + row(4L, originalSequenceNumber, ANY, ANY, ANY), + row(5L, originalSequenceNumber, ANY, ANY, ANY), + row(6L, originalSequenceNumber, ANY, ANY, ANY), + row(7L, originalSequenceNumber, ANY, ANY, ANY)); + + assertEquals( + "Carried-over rows must inherit the data sequence", + expectedLineage, + actualLineage); + } +

Feel free to keep your test name as is, just what i had locally

actually my test is arguably a bit brittle to how we organize the manifests in the manifest list, today it works but if there's a reordering the test would fail needlessly since the first row ID assignment would be different but still valid. Maybe we should key off a logical row to make it more resilient (but it becomes a bit more complex)

Ok here's a more minimal test I came up with so we don't even have to append any records. We can just do a compact + upgrade + rewrite manifests to move the new file into a single v3 manifest. since there's a single data file/single manifest we can then reliably assert the row IDs regardless of any change in the implementation.

+ @TestTemplate + public void testLastUpdatedSequenceInheritedFromDataSequenceAfterUpgrade() { + assumeThat(formatVersion).isEqualTo(2); + + Table table = createTable(); + + // Two data files in a single v2 commit; both share data sequence number 1. + writeRecords(2 /* files */, 4 /* records */); + table.refresh(); + shouldHaveFiles(table, 2); + long committedDataSequence = table.currentSnapshot().sequenceNumber(); + + basicRewrite(table).execute(); + table.refresh(); + shouldHaveFiles(table, 1); + + DataFile compacted = Iterables.getOnlyElement(currentDataFiles(table)); + long dataSequence = compacted.dataSequenceNumber(); + long fileSequence = compacted.fileSequenceNumber(); + assertThat(dataSequence) + .as("Compaction must preserve the original data sequence number") + .isEqualTo(committedDataSequence); + assertThat(fileSequence) + .as("Compaction must bump the file sequence above the preserved data sequence") + shouldHaveFiles(table, 1); + + DataFile compacted = Iterables.getOnlyElement(currentDataFiles(table)); + long dataSequence = compacted.dataSequenceNumber(); + long fileSequence = compacted.fileSequenceNumber(); + assertThat(dataSequence) + .as("Compaction must preserve the original data sequence number") + .isEqualTo(committedDataSequence); + assertThat(fileSequence) + .as("Compaction must bump the file sequence above the preserved data sequence") + .isGreaterThan(dataSequence); + + // Upgrade to v3, then move the carried-over file into a new v3 manifest + table.updateProperties().set(TableProperties.FORMAT_VERSION, "3").commit(); + table.rewriteManifests().rewriteIf(manifest -> true).commit(); + table.refresh(); + + List<Object[]> expected = + Lists.newArrayList( + row(0L, dataSequence, ANY, ANY, ANY), + row(1L, dataSequence, ANY, ANY, ANY), + row(2L, dataSequence, ANY, ANY, ANY), + row(3L, dataSequence, ANY, ANY, ANY)); + + assertEquals( + "Row IDs must be 0..3 and last-updated must inherit the data sequence", + expected, + currentDataWithLineage()); + }

Use manifest rewrite after upgrading the compacted V2 data file to V3 so row ID assignment is deterministic without an empty append. Assert the explicit Spark row lineage output expected from the preserved data sequence.

github-actions Bot added the Specification Issues that may introduce spec changes. label Jul 2, 2026

amogh-jahagirdar requested changes Jul 2, 2026

View reviewed changes

wgtmac force-pushed the codex/fix-row-lineage-sequence branch from 78e103a to e8a31b9 Compare July 2, 2026 14:45

wgtmac changed the title ~~Spec: Fix row lineage sequence number inheritance~~ Core: Fix row lineage last updated sequence inheritance Jul 2, 2026

github-actions Bot added the core label Jul 2, 2026

wgtmac force-pushed the codex/fix-row-lineage-sequence branch from e8a31b9 to 6a6f812 Compare July 2, 2026 14:48

amogh-jahagirdar requested review from RussellSpitzer, rdblue and stevenzwu July 2, 2026 16:43

amogh-jahagirdar reviewed Jul 2, 2026

View reviewed changes

Spark: Test row lineage sequence after upgrade

a489551

Add Spark rewrite-data-files coverage for V2 tables upgraded to V3. The test verifies row lineage reads preserve the data sequence number when rewritten files have a newer file sequence number.

github-actions Bot added the spark label Jul 3, 2026

amogh-jahagirdar reviewed Jul 3, 2026

View reviewed changes

Spark: Simplify row lineage upgrade test

8fe315a

Use manifest rewrite after upgrading the compacted V2 data file to V3 so row ID assignment is deterministic without an empty append. Assert the explicit Spark row lineage output expected from the preserved data sequence.

Uh oh!

Conversation

wgtmac commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Why this is correct

Testing

Uh oh!

amogh-jahagirdar left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wgtmac commented Jul 2, 2026

Uh oh!

amogh-jahagirdar commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amogh-jahagirdar Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

amogh-jahagirdar Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

amogh-jahagirdar Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

amogh-jahagirdar commented Jul 2, 2026

Uh oh!

wgtmac commented Jul 3, 2026

Uh oh!

amogh-jahagirdar left a comment

Choose a reason for hiding this comment

Uh oh!

amogh-jahagirdar Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

amogh-jahagirdar Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

amogh-jahagirdar Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amogh-jahagirdar Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wgtmac commented Jul 2, 2026 •

edited

Loading

amogh-jahagirdar left a comment •

edited

Loading

amogh-jahagirdar commented Jul 2, 2026 •

edited

Loading

amogh-jahagirdar Jul 3, 2026 •

edited

Loading