[core] Support append writes for MAP shared-shredding#8355
Conversation
| writeCols); | ||
| writeCols, | ||
| sharedShredding | ||
| ? new MapSharedShreddingWritePlanFactory( |
There was a problem hiding this comment.
This enables writing the shared-shredding physical ROW layout, but I do not see a matching logical read/unshredding path. The normal table scan / format reader still requests the logical MAP schema; for Parquet, clipParquetType(MAP) expects a Parquet map group, so it will not be able to read this physical ROW. Please add an end-to-end test that writes through this path and reads the table back with the logical schema, and wire the reader conversion before enabling the write path.
There was a problem hiding this comment.
Thanks for the careful review. I agree this is an important gap. This PR currently focuses on introducing the write-side physical layout, but the normal scan path still reads with the logical MAP schema, so we should not expose this as a usable end-to-end feature before the reader-side unshredding path is in place.
My original thought was to keep this PR relatively small and reviewable by treating it as internal groundwork. To avoid users accidentally enabling an incomplete path, I can add schema validation to reject the shared-shredding option for now, and then enable the option together with the reader conversion, documentation, and end-to-end logical read tests in a follow-up PR.
While, I am also fine with making this PR larger and adding the reader-side conversion plus the end-to-end test here if you think the feature should only land when the full read/write path is included. Which approach would you prefer?
There was a problem hiding this comment.
I’ve added e2e coverage for the append-table read/write paths, including basic shared-shredding MAP read/write, multiple maps, projection, null values including overflow, metadata restore, and schema/data evolution cases. Unsupported paths such as rewrite are guarded to fail fast for now; would appreciate a review when you have time.
| options.dataEvolutionEnabled(), | ||
| blobContext); | ||
| blobContext, | ||
| sharedShreddingContext); |
There was a problem hiding this comment.
Passing the shared-shredding context only to AppendOnlyWriter leaves the rewrite paths inconsistent. compactRewrite/clusterRewrite below still create a plain RowDataRollingFileWriter, so compaction can rewrite shared-shredded append files back into the default MAP layout without shared-shredding metadata. Please either route those rewrite writers through the same write plan or explicitly disable/guard compaction for shared-shredding append tables until it is supported.
There was a problem hiding this comment.
Thanks for catching this! I added an explicit guard for now: compact rewrite and cluster rewrite will fail fast when MAP shared-shredding is configured.
| columnName, | ||
| keyType); | ||
| checkArgument( | ||
| options.mapStorageLayout(columnName) != MapStorageLayout.SHARED_SHREDDING, |
There was a problem hiding this comment.
This validation catches nested map indexes, but shared-shredding can still be configured with formats whose writers do not implement SupportsWriterMetadata. In that case the table is accepted and the write fails only when the file is closing. Please reject unsupported file formats during schema validation, or otherwise fail before any data file write starts.
There was a problem hiding this comment.
Agreed. I added schema validation for MAP shared-shredding to reject unsupported file formats early.
2db0402 to
dc94d68
Compare
Purpose
This is a subtask of MAP shared-shredding support.
This PR adds the append write path for MAP shared-shredding. When a MAP column is configured with
map.storage-layout=shared-shredding, append writers now build a per-file physical write plan, convert logical rows to the shared-shredding physical layout, and write field-level shared-shredding metadata into file metadata before closing the format writer.Main changes:
RowDataFileWritePlan,RowDataFileWritePlanFactory,RowDataTransform, andFileMetadataFinalizerto support per-file physical row conversion and metadata finalization.MapSharedShreddingWritePlanFactoryfor append shared-shredding writes.MapSharedShreddingCoreUtilsto create and restore shared-shredding write context from recent data files.BaseAppendFileStoreWrite,AppendOnlyWriter,RowDataRollingFileWriter,RowDataFileWriter, andSingleFileWriter.paimon.id.This PR currently focuses on append tables. Blob writes, PK tables, and other more complex write paths will be supported in follow-up PRs. The row write plan is kept generic so these scenarios can reuse the same logical-to-physical write flow later.
Tests
mvn -pl paimon-core -am -Pfast-build -DfailIfNoTests=false -Dtest=AppendOnlyWriterTest,MapSharedShreddingCoreUtilsTest,SchemaValidationTest testmvn -pl paimon-core -am -Pfast-build -DfailIfNoTests=false -Dtest=KeyValueFileReadWriteTest testmvn -pl paimon-format -Pfast-build -DfailIfNoTests=false -Dtest=FormatMetadataUtilsTest,OrcFormatReadWriteTest testmvn -pl paimon-arrow -am -Pfast-build -DfailIfNoTests=false -Dtest=ArrowSchemaMetadataCompatibilityTest testmvn -pl paimon-core -DskipTests spotless:checkmvn -pl paimon-format -DskipTests spotless:checkmvn -pl paimon-arrow -DskipTests spotless:checkgit diff --cached --check