Skip to content

Core: Fix row lineage last updated sequence inheritance#17039

Open
wgtmac wants to merge 4 commits into
mainfrom
codex/fix-row-lineage-sequence
Open

Core: Fix row lineage last updated sequence inheritance#17039
wgtmac wants to merge 4 commits into
mainfrom
codex/fix-row-lineage-sequence

Conversation

@wgtmac

@wgtmac wgtmac commented Jul 2, 2026

Copy link
Copy Markdown
Member

Summary

  • Fix _last_updated_sequence_number inheritance to use a data file's data_sequence_number in scan constants.
  • Add a regression test for v2 rewrite followed by v3 upgrade row ID assignment.

Problem

The row lineage spec says inherited _last_updated_sequence_number values come from the data file manifest entry sequence number, i.e. the data sequence number. Java was using task.file().fileSequenceNumber() when populating the metadata column constant.

That produces the wrong value when a v2 rewrite/compaction adds a replacement file with an older data sequence number, and the table is later upgraded to v3. After upgrade, assigning first_row_id to the existing metadata tree makes readers inherit _last_updated_sequence_number for old files that do not contain row lineage columns. Using the file sequence number reports the rewrite/compaction snapshot instead of the original row update sequence.

Why this is correct

data_sequence_number represents the sequence number associated with the file content and can be preserved across rewrites. That is the value required by the spec for inherited _last_updated_sequence_number. file_sequence_number represents when the physical file was added, which may be later than the row content update when files are rewritten.

Testing

  • JAVA_HOME=/Users/gangwu/.sdkman/candidates/java/17.0.18-zulu ./gradlew :iceberg-core:test --tests 'org.apache.iceberg.TestRowLineageAssignment.lastUpdatedAfterUpgrade'
  • git diff --check

AI Disclosure

  • Model: OpenAI Codex
  • Platform/Tool: Codex CLI/Desktop
  • Human Oversight: partially reviewed
  • Prompt Summary: Investigate row lineage review feedback, add a regression test, and fix Java inheritance to match the spec.

Update the row lineage spec to say that inherited _last_updated_sequence_number values come from a data file's file_sequence_number, not its data_sequence_number.

This matches the Java scan constant implementation and the field's meaning as the sequence number of the commit that last updated the row. Using data_sequence_number would diverge for rewrites where the data sequence can differ from the snapshot that added the file.

Generated-by: OpenAI Codex
@github-actions github-actions Bot added the Specification Issues that may introduce spec changes. label Jul 2, 2026

@amogh-jahagirdar amogh-jahagirdar left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is really neccessary to change in the spec. We know that for updates and rewrites, existing rows are required to have their row IDs and sequence numbers materialized in the new files based on the rules 2 paragraphs below this one. Those are the sources of truth for subsequent lineage information.

The case where this inheritance applies is where there's new rows (e.g. an append) or an assignment from a data file written in an older table format version being put into a v3 manifest.

In both of those cases the file sequence number equals the data sequence number so I think it works correctly as written in the spec.

If you have an example scenario where this becomes ambiguous would be good to understand that.

I do think we probably want to look at the Java implementation just using the task.file().dataSequenceNumber() because I can see how that's confusing but I don't think there's an issue there given the invariant I mentioned that the case where inheritance applies is also where data sequence number == file sequence number. Let me know if I'm missing something!

@wgtmac wgtmac force-pushed the codex/fix-row-lineage-sequence branch from 78e103a to e8a31b9 Compare July 2, 2026 14:45
@wgtmac wgtmac changed the title Spec: Fix row lineage sequence number inheritance Core: Fix row lineage last updated sequence inheritance Jul 2, 2026
@github-actions github-actions Bot added the core label Jul 2, 2026
Restore the row lineage spec wording and update Java scan constants to inherit _last_updated_sequence_number from a data file's data sequence number.

This preserves the original row update sequence when a v2 rewrite carries an older data sequence number and the table is later upgraded to v3.

Generated-by: OpenAI Codex
@wgtmac wgtmac force-pushed the codex/fix-row-lineage-sequence branch from e8a31b9 to 6a6f812 Compare July 2, 2026 14:48
@wgtmac

wgtmac commented Jul 2, 2026

Copy link
Copy Markdown
Member Author

Thanks @amogh-jahagirdar for the quick review!

I just got this confusion while working on v3 row lineage in iceberg-cpp. Now I agree with you that Spec should not be changed. I checked this against the upgrade path and found one case where inheritance applies even though data/file sequence numbers differ.

The scenario is: v2 append -> v2 rewrite using the original data sequence -> upgrade to v3 -> assign first_row_id to the existing metadata tree. The rewritten file has no row lineage columns, gets a non-null first_row_id after upgrade, and has data_sequence_number=1 while file_sequence_number=2. In that case Java was inheriting _last_updated_sequence_number from fileSequenceNumber(), returning 2. Now I've fixed it and added a regression test.

Let me know what you think.

@amogh-jahagirdar

amogh-jahagirdar commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Thanks for the clarification and update @wgtmac ! I think the issue you pointed out is legitimate; the key is that prior to the upgrade we have a rewrite which also does a re-sequencing of the data files to something older than the file sequence number.

Just reasoning about impact of this issue, I think we can materialize bad sequence numbers in files on subsequent updates of those records that were in those pre-upgrade files which is problematic in itself (it'll say row X last changed in sequence 2 whereas it should be 1) for auditability.

However, from a change detection perspective, I think despite this issue, consumers largely end up being OK just because a change detection consumer can really only use lineage for change detection starting from the sequence number post-upgrade. I think this is the case because even if data sequence number < file sequence number for the v2 file, both data and file sequence numbers would neccessarily be less than the sequence number of any writes that happen post-upgrade; whether it was originally 2 or 1 ends up being inconsequential from a lineage-based change detection.

Anyways, either way I do think we should address this and get this into 1.10.3! Thank you @wgtmac , taking a look at the tests

}

@Test
public void lastUpdatedAfterUpgrade(@TempDir File altLocation) throws IOException {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

testLastUpdatedDataSequencePreservedAfterUpgrade? I think we should still prefix with test in this class to be consistent.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we agreed we would no longer be restrained by the sins of our past :)

More seriously I did think we decided we were doing updates for new code. Just trying not to blanket modify everything when it doesn't matter

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, in that case i'm good with how it is!

assertThat(task.file().firstRowId()).isNotNull();

assertThat(
PartitionUtil.constantsMap(task)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a bit strange in these tests to have inspect the constantsMap but I get it because the sequence number is inherited metadata and not present on any of the in memory representations available. Maybe just an inline comment?

@amogh-jahagirdar

Copy link
Copy Markdown
Contributor

we probably also want to add a spark test as well, just to verify the end to end

Add Spark rewrite-data-files coverage for V2 tables upgraded to V3. The test verifies row lineage reads preserve the data sequence number when rewritten files have a newer file sequence number.
@github-actions github-actions Bot added the spark label Jul 3, 2026
@wgtmac

wgtmac commented Jul 3, 2026

Copy link
Copy Markdown
Member Author

I've added a Spark test with the help of Codex as I'm not that familiar with it. Let me know if it looks good. @amogh-jahagirdar

@amogh-jahagirdar amogh-jahagirdar left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks great, thank you @wgtmac , I just had some minor suggestions on how we should write the spark test. Would be good to get @RussellSpitzer @rdblue input on this as well

assertEquals("Rows must match", expectedRecordsWithLineage, actualRecordsWithLineage);
}

@TestTemplate

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks right but I'd cleanup the empty append and do a more realistic write, I'd also just use an explicit assertion on the records (just a bit more readable/less opaque)

+  @TestTemplate
+  public void testLastUpdatedSequenceInheritedFromDataSequenceAfterUpgrade() {
+    // Reproduces the inheritance bug for a v2-origin file carried into v3.
+    assumeThat(formatVersion).isEqualTo(2);
+
+    Table table = createTable();
+    writeRecords(2 /* files */, 4 /* records */);
+    table.refresh();
+    shouldHaveFiles(table, 2);
+    long originalSequenceNumber = table.currentSnapshot().sequenceNumber();
+    // Perform a compaction before upgrade which preserves the original sequence number
+    basicRewrite(table).execute();
+    table.refresh();
+    shouldHaveFiles(table, 1);
+
+    DataFile compacted = Iterables.getOnlyElement(currentDataFiles(table));
+    assertThat(compacted.dataSequenceNumber()).isEqualTo(originalSequenceNumber);
+    assertThat(compacted.fileSequenceNumber())
+        .as("Compaction must bump the file sequence number above the preserved data sequence")
+        .isGreaterThan(originalSequenceNumber);
+
+    // Upgrade to v3 so the row lineage metadata columns are exposed on read.
+    table.updateProperties().set(TableProperties.FORMAT_VERSION, "3").commit();
+    table.refresh();
+
+    writeRecords(1 /* files */, 4 /* records */);
+    table.refresh();
+    long appendSequenceNumber = table.currentSnapshot().sequenceNumber();
+
+    // currentDataWithLineage() projects (_row_id, _last_updated_sequence_number, *) ordered by
+    // _row_id; the trailing data columns are irrelevant here, so match them with ANY.
+    List<Object[]> actualLineage = currentDataWithLineage();
+
+    List<Object[]> expectedLineage =
+        Lists.newArrayList(
+            // The post-upgrade append writes a new file, which is assigned the first row IDs (0-3)
+            // and carries the append commit's sequence number as its last-updated sequence.
+            row(0L, appendSequenceNumber, ANY, ANY, ANY),
+            row(1L, appendSequenceNumber, ANY, ANY, ANY),
+            row(2L, appendSequenceNumber, ANY, ANY, ANY),
+            row(3L, appendSequenceNumber, ANY, ANY, ANY),
+            row(4L, originalSequenceNumber, ANY, ANY, ANY),
+            row(5L, originalSequenceNumber, ANY, ANY, ANY),
+            row(6L, originalSequenceNumber, ANY, ANY, ANY),
+            row(7L, originalSequenceNumber, ANY, ANY, ANY));
+
+    assertEquals(
+        "Carried-over rows must inherit the data sequence",
+        expectedLineage,
+        actualLineage);
+  }
+

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to keep your test name as is, just what i had locally

@amogh-jahagirdar amogh-jahagirdar Jul 3, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually my test is arguably a bit brittle to how we organize the manifests in the manifest list, today it works but if there's a reordering the test would fail needlessly since the first row ID assignment would be different but still valid. Maybe we should key off a logical row to make it more resilient (but it becomes a bit more complex)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok here's a more minimal test I came up with so we don't even have to append any records. We can just do a compact + upgrade + rewrite manifests to move the new file into a single v3 manifest. since there's a single data file/single manifest we can then reliably assert the row IDs regardless of any change in the implementation.

+  @TestTemplate
+  public void testLastUpdatedSequenceInheritedFromDataSequenceAfterUpgrade() {
+    assumeThat(formatVersion).isEqualTo(2);
+
+    Table table = createTable();
+
+    // Two data files in a single v2 commit; both share data sequence number 1.
+    writeRecords(2 /* files */, 4 /* records */);
+    table.refresh();
+    shouldHaveFiles(table, 2);
+    long committedDataSequence = table.currentSnapshot().sequenceNumber();
+
+    basicRewrite(table).execute();
+    table.refresh();
+    shouldHaveFiles(table, 1);
+
+    DataFile compacted = Iterables.getOnlyElement(currentDataFiles(table));
+    long dataSequence = compacted.dataSequenceNumber();
+    long fileSequence = compacted.fileSequenceNumber();
+    assertThat(dataSequence)
+        .as("Compaction must preserve the original data sequence number")
+        .isEqualTo(committedDataSequence);
+    assertThat(fileSequence)
+        .as("Compaction must bump the file sequence above the preserved data sequence")

+    shouldHaveFiles(table, 1);
+
+    DataFile compacted = Iterables.getOnlyElement(currentDataFiles(table));
+    long dataSequence = compacted.dataSequenceNumber();
+    long fileSequence = compacted.fileSequenceNumber();
+    assertThat(dataSequence)
+        .as("Compaction must preserve the original data sequence number")
+        .isEqualTo(committedDataSequence);
+    assertThat(fileSequence)
+        .as("Compaction must bump the file sequence above the preserved data sequence")
+        .isGreaterThan(dataSequence);
+
+    // Upgrade to v3, then move the carried-over file into a new v3 manifest 
+    table.updateProperties().set(TableProperties.FORMAT_VERSION, "3").commit();
+    table.rewriteManifests().rewriteIf(manifest -> true).commit();
+    table.refresh();
+
+    List<Object[]> expected =
+        Lists.newArrayList(
+            row(0L, dataSequence, ANY, ANY, ANY),
+            row(1L, dataSequence, ANY, ANY, ANY),
+            row(2L, dataSequence, ANY, ANY, ANY),
+            row(3L, dataSequence, ANY, ANY, ANY));
+
+    assertEquals(
+        "Row IDs must be 0..3 and last-updated must inherit the data sequence",
+        expected,
+        currentDataWithLineage());
+  }

Use manifest rewrite after upgrading the compacted V2 data file to V3 so row ID assignment is deterministic without an empty append. Assert the explicit Spark row lineage output expected from the preserved data sequence.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core spark Specification Issues that may introduce spec changes.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants