Skip to content

[VL] shuffle: TypeAwareCompress(tac) for column-wise data compression like (U)INT64#11894

Open
guowangy wants to merge 2 commits intoapache:mainfrom
guowangy:type-aware-compress
Open

[VL] shuffle: TypeAwareCompress(tac) for column-wise data compression like (U)INT64#11894
guowangy wants to merge 2 commits intoapache:mainfrom
guowangy:type-aware-compress

Conversation

@guowangy
Copy link
Copy Markdown

@guowangy guowangy commented Apr 9, 2026

What changes are proposed in this pull request?

Introduces TypeAwareCompress (TAC) — a column-wise compression layer for shuffle that selects
an algorithm based on each buffer's data type, applied per-buffer alongside the existing LZ4/ZSTD
codec path.

For INT64/UINT64 columns the values are often clustered in a small range, making
Frame-of-Reference + Bit-Packing (FFOR) significantly more effective than generic byte-level
compression. TAC exploits this by encoding 8-byte integer buffers with a 4-lane FFOR codec before
the standard codec sees them.

Here is the performance data on TPCH/TPCDS:

Total Latency Shuffle Write Size
TPCH-6T -15% -32%
TPCDS-6T -6% -14%

New files

Path Purpose
cpp/core/utils/tac/ffor.hpp Header-only 4-lane FFOR codec for uint64_t
cpp/core/utils/tac/FForCodec.{h,cc} Arrow-Result wrapper around ffor.hpp
cpp/core/utils/tac/TypeAwareCompressCodec.{h,cc} Type dispatch; self-describing wire format (codec ID + element width embedded in header, so decompression needs no type hint)
cpp/velox/shuffle/VeloxTypeAwareCompress.h Maps Velox TypeKindTacDataType (BIGINTkUInt64)

Shuffle integration

  • Payload.cc/h: BlockPayload::fromBuffers accepts an optional bufferTypes vector. Per-buffer:
    if TypeAwareCompressCodec::support(type) is true, use TAC; otherwise fall back to LZ4/ZSTD.
    A new wire marker kTypeAwareBuffer = -3 is added; decompression in readCompressedBuffer is
    self-describing. If TAC compressed size ≥ original, falls back to kUncompressedBuffer.
  • Options.h: adds enableTypeAwareCompress (default false) to LocalPartitionWriterOptions.
  • VeloxHashShuffleWriter: populates bufferTypes from the schema when TAC is enabled.
  • GlutenConfig.scala: new config spark.gluten.sql.columnar.shuffle.typeAwareCompress.enabled (default false).
  • ColumnarShuffleWriter / LocalPartitionWriterJniWrapper: forward the new option to native.

Disabled by default — no behaviour changes for existing deployments.

How was this patch tested?

cpp/core/tests/FForCodecTest.cc covers:

  • Round-trip correctness for random, all-zero, monotonic, and near-max value patterns
  • maxCompressedLength boundary checks
  • Invalid input size rejection

cpp/velox/tests/VeloxShuffleWriterTest.cc: extended to exercise the TAC path end-to-end through
VeloxHashShuffleWriter.

Was this patch authored or co-authored using generative AI tooling?

Co-authored-by: Claude Sonnet 4.6

@github-actions github-actions bot added CORE works for Gluten Core VELOX DOCS labels Apr 9, 2026
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 9, 2026

Run Gluten Clickhouse CI on x86

@guowangy guowangy changed the title shuffle: TypeAwareCompress(tac) for column-wise data compression like (U)INT64 [VL] shuffle: TypeAwareCompress(tac) for column-wise data compression like (U)INT64 Apr 9, 2026
@guowangy guowangy force-pushed the type-aware-compress branch from d4db9f6 to 6d5f57f Compare April 9, 2026 02:27
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 9, 2026

Run Gluten Clickhouse CI on x86

@FelixYBW
Copy link
Copy Markdown
Contributor

FelixYBW commented Apr 9, 2026

We have enabled this in Gazelle. @marin-ma do you still remember why we didn't introduce it in Gluten?

@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@marin-ma
Copy link
Copy Markdown
Contributor

@FelixYBW We used the compression and arrow ipc payload API in gazelle, and added the FastPFor compression for integer column types and it's also used as the default compression method for integers in gazelle https://github.com/fast-pack/FastPFOR

Because adding a new compression algorithm for the Arrow API requires extra patches, we removed it mainly due to maintenance concerns. Using the default LZ4 algorithm did not result in significant performance regression.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CORE works for Gluten Core DOCS VELOX

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants