Skip to content

[spark] Support compact_chain_table procedure#7313

Open
juntaozhang wants to merge 3 commits into
apache:masterfrom
juntaozhang:pr-chain-table-compaction
Open

[spark] Support compact_chain_table procedure#7313
juntaozhang wants to merge 3 commits into
apache:masterfrom
juntaozhang:pr-chain-table-compaction

Conversation

@juntaozhang

Copy link
Copy Markdown
Contributor

Purpose

Linked issue: close #7312

Tests

  • CompactChainTableProcedureTest.scala

API and Format

Documentation

  • docs/content/primary-key-table/chain-table.md
  • docs/content/spark/procedures.md

Generative AI tooling

No

@juntaozhang juntaozhang force-pushed the pr-chain-table-compaction branch from 58cfdb0 to d381de3 Compare February 27, 2026 07:47

@JingsongLi JingsongLi left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the Spark procedure for chain table compaction.

Comments:

  1. PR description is minimal — the "Purpose" section just links issue #7312 without explaining the approach. Please describe:

    • How does compact_chain_table differ from regular compact?
    • Does it compact both snapshot and delta branches?
    • What's the merge strategy for chain compaction?
  2. File changes: The PR touches ChainGroupReadTable, FallbackReadFileStoreTable, and ChainSplit. What changes were needed to support compaction vs. just read? Are these refactoring prerequisites or functional changes?

  3. 613 additions is significant. A test file (CompactChainTableProcedureTest.scala) + procedure + supporting core changes — consider whether the core changes could be a separate prerequisite PR.

  4. Documentation: Good that both chain-table.md and procedures.md are updated.

  5. ChainSplitTest: What cases does it cover? Chain tables have complex split logic due to cross-branch data dependencies.

Please fill in the PR description with the design approach before requesting final review.

@juntaozhang

Copy link
Copy Markdown
Contributor Author

Thanks for adding the Spark procedure for chain table compaction.

Comments:

  1. PR description is minimal — the "Purpose" section just links issue [Feature] [spark] Support compact_chain_table procedure #7312 without explaining the approach. Please describe:

    • How does compact_chain_table differ from regular compact?
    • Does it compact both snapshot and delta branches?
    • What's the merge strategy for chain compaction?
  2. File changes: The PR touches ChainGroupReadTable, FallbackReadFileStoreTable, and ChainSplit. What changes were needed to support compaction vs. just read? Are these refactoring prerequisites or functional changes?

  3. 613 additions is significant. A test file (CompactChainTableProcedureTest.scala) + procedure + supporting core changes — consider whether the core changes could be a separate prerequisite PR.

  4. Documentation: Good that both chain-table.md and procedures.md are updated.

  5. ChainSplitTest: What cases does it cover? Chain tables have complex split logic due to cross-branch data dependencies.

Please fill in the PR description with the design approach before requesting final review.

Thank you for the thorough review and valuable feedback.

  1. PR description
    Already updated in [Feature] [spark] Support compact_chain_table procedure #7312.
  2. Core file changes
    Extracted to prerequisite PR [core] Minor refactor partition predicates in FallbackReadScan #7950.
  3. ChainSplitTest / Split support
    ScanPlanHelper.createNewScanPlan already supports the Split interface, no additional changes needed.

JingsongLi pushed a commit that referenced this pull request May 27, 2026
This PR is a prerequisite for [`[spark] Support compact_chain_table
procedure`](#7313).

`FallbackReadScan` currently uses a single `partitionPredicate` for both
main and fallback scans. However, in chain table compaction scenarios,
we need to apply different
partition filters to the main (snapshot) branch and the fallback (delta)
branch. For example, when overwriting an existing partition, we need to:
  - Exclude the target partition from the main scan
  - Include only the target partition in the fallback scan

This PR refactors `FallbackReadScan` to support separate partition
predicates for main and fallback scans.
@juntaozhang juntaozhang force-pushed the pr-chain-table-compaction branch from 7ac2113 to 6da3d39 Compare May 27, 2026 06:01
@juntaozhang juntaozhang requested a review from JingsongLi May 27, 2026 09:10
@JingsongLi

Copy link
Copy Markdown
Contributor

Can you rebase master?

@juntaozhang juntaozhang force-pushed the pr-chain-table-compaction branch from c6e5450 to 35672ea Compare June 9, 2026 09:57
@juntaozhang

Copy link
Copy Markdown
Contributor Author

Can you rebase master?
Hi @JingsongLi, thanks for the reminder, done.

@JingsongLi JingsongLi left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found a couple of issues while reviewing this change.

boolean partitionExists = checkPartitionExists(snapshotTable, partition, relation);
if (partitionExists) {
if (overwrite) {
scan.withPartitionFilter(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This overwrite path appears to read too much data. When the target partition already exists, the snapshot-side predicate is changed to NOT (target partition), while the delta side still uses the target partition. However, ChainGroupReadTable.plan() adds all mainScan.plan() splits directly before planning the delta/anchor splits, and this procedure later rewrites every output row's partition columns to the target partition. That means unrelated snapshot partitions can be copied into the compacted target partition during overwrite. The overwrite path should only read the chain-merge result needed for the target partition, not every snapshot partition except the target.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for your review. I added a flag of preloadTargetSnapshot to skip executing mainScan.plan() when overwriting the target partition.
Appreciate your reminder, thanks.


you will get the following result:
```text
+---+----+-----+

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the trailing whitespace in this added result block. git diff --check origin/master...HEAD reports trailing whitespace on this block, which will fail whitespace/style checks.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks for your reminder.

@juntaozhang juntaozhang requested a review from JingsongLi June 10, 2026 11:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] [spark] Support compact_chain_table procedure

2 participants