[VL] shuffle: TypeAwareCompress(tac) for column-wise data compression like (U)INT64 by guowangy · Pull Request #11894 · apache/gluten

guowangy · 2026-04-09T02:25:55Z

What changes are proposed in this pull request?

Introduces TypeAwareCompress (TAC) — a column-wise compression layer for shuffle that selects
an algorithm based on each buffer's data type, applied per-buffer alongside the existing LZ4/ZSTD
codec path.

For INT64/UINT64 columns the values are often clustered in a small range, making
Frame-of-Reference + Bit-Packing (FFOR) significantly more effective than generic byte-level
compression. TAC exploits this by encoding 8-byte integer buffers with a 4-lane FFOR codec before
the standard codec sees them.

Here is the performance data on TPCH/TPCDS:

	Total Latency	Shuffle Write Size
TPCH-6T	-15%	-32%
TPCDS-6T	-6%	-14%

New files

Path	Purpose
`cpp/core/utils/tac/ffor.hpp`	Header-only 4-lane FFOR codec for `uint64_t`
`cpp/core/utils/tac/FForCodec.{h,cc}`	Arrow-Result wrapper around `ffor.hpp`
`cpp/core/utils/tac/TypeAwareCompressCodec.{h,cc}`	Type dispatch; self-describing wire format (codec ID + element width embedded in header, so decompression needs no type hint)
`cpp/velox/shuffle/VeloxTypeAwareCompress.h`	Maps Velox `TypeKind` → `TacDataType` (`BIGINT` → `kUInt64`)

Shuffle integration

Payload.cc/h: BlockPayload::fromBuffers accepts an optional bufferTypes vector. Per-buffer:
if TypeAwareCompressCodec::support(type) is true, use TAC; otherwise fall back to LZ4/ZSTD.
A new wire marker kTypeAwareBuffer = -3 is added; decompression in readCompressedBuffer is
self-describing. If TAC compressed size ≥ original, falls back to kUncompressedBuffer.
Options.h: adds enableTypeAwareCompress (default false) to LocalPartitionWriterOptions.
VeloxHashShuffleWriter: populates bufferTypes from the schema when TAC is enabled.
GlutenConfig.scala: new config spark.gluten.sql.columnar.shuffle.typeAwareCompress.enabled (default false).
ColumnarShuffleWriter / LocalPartitionWriterJniWrapper: forward the new option to native.

Disabled by default — no behaviour changes for existing deployments.

How was this patch tested?

cpp/core/tests/FForCodecTest.cc covers:

Round-trip correctness for random, all-zero, monotonic, and near-max value patterns
maxCompressedLength boundary checks
Invalid input size rejection

cpp/velox/tests/VeloxShuffleWriterTest.cc: extended to exercise the TAC path end-to-end through
VeloxHashShuffleWriter.

Was this patch authored or co-authored using generative AI tooling?

Co-authored-by: Claude Sonnet 4.6

github-actions · 2026-04-09T02:26:27Z

Run Gluten Clickhouse CI on x86

… in shuffle

github-actions · 2026-04-09T02:27:36Z

Run Gluten Clickhouse CI on x86

FelixYBW · 2026-04-09T18:02:14Z

We have enabled this in Gazelle. @marin-ma do you still remember why we didn't introduce it in Gluten?

github-actions · 2026-04-10T01:26:19Z

Run Gluten Clickhouse CI on x86

marin-ma · 2026-04-10T15:52:05Z

@FelixYBW We used the compression and arrow ipc payload API in gazelle, and added the FastPFor compression for integer column types and it's also used as the default compression method for integers in gazelle https://github.com/fast-pack/FastPFOR

Because adding a new compression algorithm for the Arrow API requires extra patches, we removed it mainly due to maintenance concerns. Using the default LZ4 algorithm did not result in significant performance regression.

github-actions bot added CORE works for Gluten Core VELOX DOCS labels Apr 9, 2026

guowangy changed the title ~~shuffle: TypeAwareCompress(tac) for column-wise data compression like (U)INT64~~ [VL] shuffle: TypeAwareCompress(tac) for column-wise data compression like (U)INT64 Apr 9, 2026

TypeAwareCompress(tac) for column-wise data compression like (U)INT64…

6d5f57f

… in shuffle

guowangy force-pushed the type-aware-compress branch from d4db9f6 to 6d5f57f Compare April 9, 2026 02:27

cpp code style fix

f462ecf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[VL] shuffle: TypeAwareCompress(tac) for column-wise data compression like (U)INT64#11894

[VL] shuffle: TypeAwareCompress(tac) for column-wise data compression like (U)INT64#11894
guowangy wants to merge 2 commits intoapache:mainfrom
guowangy:type-aware-compress

guowangy commented Apr 9, 2026

Uh oh!

github-actions bot commented Apr 9, 2026

Uh oh!

github-actions bot commented Apr 9, 2026

Uh oh!

FelixYBW commented Apr 9, 2026

Uh oh!

github-actions bot commented Apr 10, 2026

Uh oh!

marin-ma commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

guowangy commented Apr 9, 2026

What changes are proposed in this pull request?

New files

Shuffle integration

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

github-actions bot commented Apr 9, 2026

Uh oh!

github-actions bot commented Apr 9, 2026

Uh oh!

FelixYBW commented Apr 9, 2026

Uh oh!

github-actions bot commented Apr 10, 2026

Uh oh!

marin-ma commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants