fix: handle non-contiguous RowRanges when resolving global row IDs by zhf999 · Pull Request #383 · alibaba/paimon-cpp

zhf999 · 2026-06-26T02:55:49Z

Purpose

Fix a correctness bug where upper layers assumed returned batch rows are continuous and derived global row IDs as previous_batch_start + offset.
Introduce unified usage of GetPreviousBatchGlobalRowId(batch_row_id) to resolve the global row ID for a row index inside the current batch.
In PrefetchFileBatchReaderImpl, cache the actual global row IDs for each returned batch and keep row-id mapping aligned when a batch is sliced by read_range.
For Parquet, add target row-range union and per-batch row mapping logic:
- Merge filtered target ranges into a global target row set.
- Build batch_row -> global_row_id mapping in NextBatch().
- Keep GetPreviousBatchGlobalRowId() correct under non-contiguous rows caused by predicate + bitmap filtering.
Update upper-layer consumers to query per-row global IDs directly (deletion vectors, bitmap index filtering, _ROW_ID field conversion, KeyValue iteration positions).

Tests

Updated and adapted reader tests to the new interface and semantics:
- src/paimon/format/avro/avro_file_batch_reader_test.cpp
- src/paimon/format/blob/blob_file_batch_reader_test.cpp
- src/paimon/format/lance/lance_format_reader_writer_test.cpp
- src/paimon/format/orc/orc_file_batch_reader_test.cpp
- src/paimon/format/parquet/parquet_file_batch_reader_test.cpp
- src/paimon/common/reader/prefetch_file_batch_reader_impl_test.cpp
Added/extended coverage in Parquet with TestRowMapping to validate global row mapping across non-contiguous ranges.
Remaining changes are interface migration plus consistency updates of call sites and assertions.

API and Format

Public reader API change: FileBatchReader and implementations now use GetPreviousBatchGlobalRowId(uint64_t batch_row_id).
Semantic change: returns the global row ID for the batch_row_id inside the current batch (instead of deriving by batch start + offset under contiguous assumptions).
Storage format and on-disk protocol are unchanged.

Documentation

No.

Generative AI tooling

gpt-5.3-codex

…ousBatchFirstRowId to GetGlobalRowId

Copilot

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

lxy-9602 · 2026-06-27T03:12:40Z

-    virtual Result<uint64_t> GetPreviousBatchFirstRowNumber() const = 0;
+    /// Get the global row number of the row in the previously read batch.
+    virtual Result<uint64_t> GetPreviousBatchGlobalRowId(uint64_t batch_row_id) const = 0;



GetPreviousBatchGlobalRowId -> GetPreviousBatchFileRowId

lxy-9602 · 2026-06-27T03:18:32Z

+            PAIMON_ASSIGN_OR_RAISE(uint64_t global_row_id, reader_->GetPreviousBatchGlobalRowId(i));
+            if (bitmap_.Contains(global_row_id)) {
+                is_valid.Add(i);
+            }


This looks inefficient. Contains starts scanning from the beginning on every call, which can be quite costly for large bitmaps. In this case, since the iterator only moves forward, it should be enough to adjust the filter logic to take advantage of that.

lxy-9602 · 2026-06-27T03:19:24Z

-Result<uint64_t> PrefetchFileBatchReaderImpl::GetPreviousBatchFirstRowNumber() const {
-    return previous_batch_first_row_num_;
+Result<uint64_t> PrefetchFileBatchReaderImpl::GetPreviousBatchGlobalRowId(
+    uint64_t batch_row_id) const {


Why can’t we just return previous_batch_first_row_num_ + batch_row_id directly?

lxy-9602 · 2026-06-27T03:21:54Z

-    ASSERT_EQ(reader->GetPreviousBatchFirstRowNumber().value(), 0);
+    uint64_t global_row_id = 0;
+    ASSERT_OK_AND_ASSIGN(global_row_id, reader->GetPreviousBatchGlobalRowId(0));
+    ASSERT_EQ(global_row_id, 0);


Using .value() directly should also work here.

lxy-9602 · 2026-06-27T03:28:30Z

                         ReadResultCollector::CollectResult(
                             reader.get(), /*max simulated data processing time*/ 100));
-    ASSERT_EQ(reader->GetPreviousBatchFirstRowNumber().value(), 10);
    auto expected_array = std::make_shared<arrow::ChunkedArray>(data_array);


If we drop the final validation, this scenario would no longer be exercised by the test. Could we mimic ReadResultCollector::CollectResult inside this test instead? We don’t need to materialize the full result — the goal is just to validate the last row of the final batch.

lxy-9602 · 2026-06-27T03:40:30Z

+            PAIMON_ASSIGN_OR_RAISE(uint64_t global_row_id,
+                                   reader_->GetPreviousBatchGlobalRowId(idx_in_array));
+            return first_row_id_.value() + global_row_id;
        };


The name global row id doesn’t seem quite appropriate in this scenario. It usually implies a row ID at the table level, while here global_row_id actually refers to a row ID within the file.

lxy-9602 · 2026-06-27T03:46:44Z

-    Result<uint64_t> GetPreviousBatchFirstRowNumber() const override {
-        return ToReaderRowNumber(previous_batch_first_row_num_);
+    Result<uint64_t> GetPreviousBatchGlobalRowId(uint64_t batch_row_id) const override {
+        return previous_batch_first_row_num_ + batch_row_id;


If previous_batch_first_row_num_ can be -1, then the current implementation of GetPreviousBatchGlobalRowId would be incorrect, which means the refactoring is not behaviorally equivalent.

lxy-9602 · 2026-06-27T03:47:08Z


+    static Result<std::shared_ptr<arrow::ChunkedArray>> CollectResultOneBatch(
+        BatchReader* batch_reader) {
+        return CollectResultOneBatch(batch_reader, /*max simulated data processing time*/ 0);


/*max_data_processing_time_in_us=*/

lxy-9602 · 2026-06-27T03:50:16Z

+
+    static Result<std::shared_ptr<arrow::ChunkedArray>> CollectResultOneBatch(
+        BatchReader* batch_reader, int64_t max_data_processing_time_in_us) {
+        int64_t seed = DateTimeUtils::GetCurrentUTCTimeUs();


Can CollectResultOneBatch just return an Arrow array directly? It doesn’t look like we need a ChunkedArray here.

lxy-9602 · 2026-06-27T04:03:28Z


+Result<RowRanges> ParquetFileBatchReader::GetAllTargetRowRanges(
+    const std::vector<TargetRowGroup>& target_row_groups) {
+    row_mapping_.clear();


Please rename this to something like UpdateAllTargetRowRanges. Methods prefixed with Get should ideally avoid modifying member variables. Also, update doesn’t need to return anything here — it can just update all_row_ranges_ directly.

zhf999 added 4 commits June 26, 2026 10:36

fix: FileBatchReader returns discontinuous batch, and change GetPrevi…

5d534db

…ousBatchFirstRowId to GetGlobalRowId

style: change interface name

ad03498

update header files

cd7bd44

Merge branch 'main' into fix-rowid

f9721a8

Copilot AI review requested due to automatic review settings June 26, 2026 02:55

Copilot AI reviewed Jun 26, 2026

zhf999 added 7 commits June 26, 2026 14:06

fix: return Status::Invalid intead of returning max value

41f932d

fix: lance and blob return NotImplemented

6bd98d8

fix: add inclusive extend for fully matched rowgroup in SetReadSchema

a694567

fix: calling SetReadSchema many time do not clear row_mapping

eb48e42

test: add test for PrefetchFileBatchReaderImpl

7b00794

Merge branch 'main' into fix-rowid

5b28d56

style: replace auto in assigning macro with explicit type

0d9a174

lxy-9602 reviewed Jun 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: handle non-contiguous RowRanges when resolving global row IDs#383

fix: handle non-contiguous RowRanges when resolving global row IDs#383
zhf999 wants to merge 11 commits into
alibaba:mainfrom
zhf999:fix-rowid

zhf999 commented Jun 26, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

lxy-9602 Jun 27, 2026

Uh oh!

lxy-9602 Jun 27, 2026

Uh oh!

lxy-9602 Jun 27, 2026

Uh oh!

lxy-9602 Jun 27, 2026

Uh oh!

lxy-9602 Jun 27, 2026

Uh oh!

lxy-9602 Jun 27, 2026

Uh oh!

lxy-9602 Jun 27, 2026

Uh oh!

lxy-9602 Jun 27, 2026

Uh oh!

lxy-9602 Jun 27, 2026

Uh oh!

lxy-9602 Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

zhf999 commented Jun 26, 2026

Purpose

Tests

API and Format

Documentation

Generative AI tooling

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants