Skip to content

[core] Fix dedicated-format bundle write path#7598

Open
QuakeWang wants to merge 1 commit intoapache:masterfrom
QuakeWang:fix/dedicated-bundle-path
Open

[core] Fix dedicated-format bundle write path#7598
QuakeWang wants to merge 1 commit intoapache:masterfrom
QuakeWang:fix/dedicated-bundle-path

Conversation

@QuakeWang
Copy link
Copy Markdown

Purpose

Fix the dedicated-format writeBundle path so bundle writes remain correct when data is fanned out to main/blob/vector writers.

This change:

  • materializes only non-replayable bundles in the dedicated fan-out path instead of always materializing them;
  • introduces explicit opt-in bundle capabilities in common (ReplayableBundleRecords and ProjectableBundleRecords) without changing the base BundleRecords contract;
  • allows projection to preserve typed bundle implementations such as ArrowBundleRecords;
  • preserves row-level side effects in bundle writes, including sequence number updates and file-index writing.

Tests

Added or updated tests in paimon-core:

  • ProjectedFileWriterTest
  • BundleAwareRowDataRollingFileWriterTest
  • DedicatedFormatRollingFileWriterTest
  • DedicatedFormatRollingFileWriterVectorTest

These cover:

  • replayable/projectable bundle pass-through and fallback behavior;
  • sequence-number side effects for bundle writes;
  • dedicated-format bundle writes with main-file index side effects;
  • external-storage blob fallback for bundle writes;
  • single-use bundle writes for blob/vector dedicated paths.

@JingsongLi
Copy link
Copy Markdown
Contributor

Can you explain what went wrong?

@QuakeWang
Copy link
Copy Markdown
Author

Can you explain what went wrong?

The old dedicated-format writeBundle path was still functionally fine in the simple row-by-row fallback, but it dropped bundle semantics completely by iterating the bundle and calling write(row).

That becomes problematic in the dedicated fan-out path, where one logical bundle has to be written to projected main/blob/vector writers. BundleRecords itself does not guarantee replayability, so the same bundle cannot be safely reused across multiple child writers. At the same time, if we forward the bundle directly to restore pass-through, we can bypass row-level side effects in RowDataFileWriter, such as sequence-number updates and main-file index writing. We also lose typed / bundle-aware fast paths such as ArrowBundleRecords.

This patch makes those constraints explicit: replayable bundles can be passed through safely, non-replayable bundles are materialized once, and dedicated row-data writers preserve the row-level side effects after bundle writes.

So the issue was not that the old row-by-row fallback always corrupted data. The issue was that the dedicated-format bundle path could not preserve bundle semantics safely and correctly once a single bundle needed to be fanned out to multiple writers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants