GH-1116: [Java] Fix compressed buffer prefix write and ZSTD dstCapacity by LuciferYang · Pull Request #1119 · apache/arrow-java

LuciferYang · 2026-04-18T13:35:04Z

What

Two fixes in the compression codec:

AbstractCompressionCodec.compress(): capture writerIndex() once

The previous code read uncompressedBuffer.writerIndex() at multiple sites — for the size comparison and again after doCompress() to populate the 8-byte uncompressed-length prefix. Capture the value once at the top of compress() and reuse it for the empty-buffer check, the size comparison, and the prefix, so all three consumers see the same value.
ZstdCompressionCodec.doCompress(): dstCapacity overstated by 8 bytes

Zstd.compressUnsafe(dst, dstSize, ...) expects dstSize to be the available space from dst. The code offsets dst by 8 bytes past the prefix but passed 8 + maxSize instead of maxSize. The compressBound() headroom hides this in practice, but the parameter was semantically wrong. Pass maxSize.

Tests

Covered by the existing round-trip tests (testEmptyBuffer, testReadWriteStream, testReadWriteFile, etc.). I was not able to construct a minimal reproducer for the original declaredUncompressed=0 symptom on the unfixed code, so both fixes are conservative correctness improvements derived from code inspection rather than failing-then-green regression tests.

lidavidm · 2026-06-29T03:45:59Z

While I think this change is fine, nothing should be modifying the buffer during this method. I don't believe any vectors/etc. are safe to use concurrently, so the comment is misleading about the guarantees we can provide.

Good point — the original framing implied a concurrency guarantee we don't actually provide, and there's no in-tree doCompress implementation that mutates the source buffer's writerIndex. Tightened the comment to state only the engineering invariant (three in-method consumers must observe the same value) and dropped the "shared reference" / doCompress speculation:

// GH-1116: capture writerIndex() once so the empty-buffer check, size // comparison, and uncompressed-length prefix all see the same value. long uncompressedLength = uncompressedBuffer.writerIndex();

Pushed in d7b6a8f.

@lidavidm

The previous comment said the captured writerIndex() guarded against the buffer being a 'shared reference to a vector's internal buffer'. As @lidavidm pointed out, Arrow vectors are not safe for concurrent use, so that framing implied a guarantee we do not provide. Tighten the comment to state only the engineering invariant: the three in-method consumers (empty-buffer check, size comparison, and the uncompressed-length prefix) must all see the same value. Reference apacheGH-1116 for context.

lidavidm · 2026-06-29T06:37:32Z

@LuciferYang were you able to replicate the original issue, by the way?

LuciferYang · 2026-06-30T02:46:50Z

Honest answer: no. I built the two new tests around the shape of data the reporter described (all-null TimestampMilli, 3469 rows, VectorSchemaRoot reuse via clear()/allocateNew()), but I just re-checked with the source change reverted — both still pass. So they're round-trip coverage for that shape, not a direct regression test for the declaredUncompressed=0 symptom.

The fix came from reading the code: the old compress() reads writerIndex() twice — once for the size comparison, once for the 8-byte prefix after doCompress(). Reading once and reusing the value is the conservative invariant. The ZSTD dstCapacity change is similar — the compressBound() headroom hides it, but the parameter was semantically wrong.

If you'd rather hold until we have a failing-then-green test, that's fair. Otherwise happy to dig further — the reporter's env was Arrow 18.3.0/19.0.0 with zstd-jni 1.5.7-6/-7, but the allocator setup wasn't in the report.

lidavidm · 2026-06-30T02:52:55Z

I think we should document that the new tests don't actually reflect the original bug report, at least (or consider if they're necessary even). But the fixes themselves are worth having.

…H-1116 The two tests added in the previous commit didn't fail on the unfixed code, so they aren't a regression test for the reported symptom. The shapes they exercise (all-null fixed-width round-trip, multi-batch ZSTD stream) are already covered by testEmptyBuffer and testReadWriteStream. Drop them per review feedback.

LuciferYang · 2026-06-30T03:11:07Z

Fair. Dropped both tests in b8e6b46 — testEmptyBuffer and testReadWriteStream already cover those round-trip shapes.

lidavidm

Thanks!

LuciferYang · 2026-06-30T03:54:33Z

Thanks for your review @lidavidm

fix

4c754b1

LuciferYang requested review from jbonofre, laurentgo, lidavidm and wgtmac as code owners April 18, 2026 13:35

LuciferYang changed the title ~~# GH-1116: [Java] Fix compressed buffer prefix write and ZSTD dstCapacity~~ GH-1116: [Java] Fix compressed buffer prefix write and ZSTD dstCapacity Apr 18, 2026

This comment has been minimized.

Sign in to view

lidavidm added the bug-fix PRs that fix a big. label Jun 29, 2026

github-actions Bot added this to the 20.0.0 milestone Jun 29, 2026

lidavidm reviewed Jun 29, 2026

View reviewed changes

lidavidm approved these changes Jun 30, 2026

View reviewed changes

lidavidm merged commit b0e51af into apache:main Jun 30, 2026
25 of 26 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GH-1116: [Java] Fix compressed buffer prefix write and ZSTD dstCapacity#1119

GH-1116: [Java] Fix compressed buffer prefix write and ZSTD dstCapacity#1119
lidavidm merged 3 commits into
apache:mainfrom
LuciferYang:fix-zstd-compressed-buffer-prefix-corruption

LuciferYang commented Apr 18, 2026 •

edited by lidavidm

Loading

Uh oh!

This comment has been minimized.

lidavidm Jun 29, 2026

Uh oh!

LuciferYang Jun 29, 2026

Uh oh!

lidavidm commented Jun 29, 2026

Uh oh!

LuciferYang commented Jun 30, 2026

Uh oh!

lidavidm commented Jun 30, 2026 •

edited

Loading

Uh oh!

LuciferYang commented Jun 30, 2026

Uh oh!

lidavidm left a comment

Uh oh!

LuciferYang commented Jun 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

LuciferYang commented Apr 18, 2026 • edited by lidavidm Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Tests

Uh oh!

This comment has been minimized.

lidavidm Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

LuciferYang Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

lidavidm commented Jun 29, 2026

Uh oh!

LuciferYang commented Jun 30, 2026

Uh oh!

lidavidm commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LuciferYang commented Jun 30, 2026

Uh oh!

lidavidm left a comment

Choose a reason for hiding this comment

Uh oh!

LuciferYang commented Jun 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

LuciferYang commented Apr 18, 2026 •

edited by lidavidm

Loading

lidavidm commented Jun 30, 2026 •

edited

Loading