Skip to content

GH-1116: [Java] Fix compressed buffer prefix write and ZSTD dstCapacity#1119

Merged
lidavidm merged 3 commits into
apache:mainfrom
LuciferYang:fix-zstd-compressed-buffer-prefix-corruption
Jun 30, 2026
Merged

GH-1116: [Java] Fix compressed buffer prefix write and ZSTD dstCapacity#1119
lidavidm merged 3 commits into
apache:mainfrom
LuciferYang:fix-zstd-compressed-buffer-prefix-corruption

Conversation

@LuciferYang

@LuciferYang LuciferYang commented Apr 18, 2026

Copy link
Copy Markdown
Contributor

What

Two fixes in the compression codec:

  1. AbstractCompressionCodec.compress(): capture writerIndex() once

    The previous code read uncompressedBuffer.writerIndex() at multiple sites — for the size comparison and again after doCompress() to populate the 8-byte uncompressed-length prefix. Capture the value once at the top of compress() and reuse it for the empty-buffer check, the size comparison, and the prefix, so all three consumers see the same value.

  2. ZstdCompressionCodec.doCompress(): dstCapacity overstated by 8 bytes

    Zstd.compressUnsafe(dst, dstSize, ...) expects dstSize to be the available space from dst. The code offsets dst by 8 bytes past the prefix but passed 8 + maxSize instead of maxSize. The compressBound() headroom hides this in practice, but the parameter was semantically wrong. Pass maxSize.

Tests

Covered by the existing round-trip tests (testEmptyBuffer, testReadWriteStream, testReadWriteFile, etc.). I was not able to construct a minimal reproducer for the original declaredUncompressed=0 symptom on the unfixed code, so both fixes are conservative correctness improvements derived from code inspection rather than failing-then-green regression tests.

@LuciferYang LuciferYang changed the title # GH-1116: [Java] Fix compressed buffer prefix write and ZSTD dstCapacity GH-1116: [Java] Fix compressed buffer prefix write and ZSTD dstCapacity Apr 18, 2026
@github-actions

This comment has been minimized.

@lidavidm lidavidm added the bug-fix PRs that fix a big. label Jun 29, 2026
@github-actions github-actions Bot added this to the 20.0.0 milestone Jun 29, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I think this change is fine, nothing should be modifying the buffer during this method. I don't believe any vectors/etc. are safe to use concurrently, so the comment is misleading about the guarantees we can provide.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point — the original framing implied a concurrency guarantee we don't actually provide, and there's no in-tree doCompress implementation that mutates the source buffer's writerIndex. Tightened the comment to state only the engineering invariant (three in-method consumers must observe the same value) and dropped the "shared reference" / doCompress speculation:

// GH-1116: capture writerIndex() once so the empty-buffer check, size
// comparison, and uncompressed-length prefix all see the same value.
long uncompressedLength = uncompressedBuffer.writerIndex();

Pushed in d7b6a8f.

The previous comment said the captured writerIndex() guarded against the
buffer being a 'shared reference to a vector's internal buffer'. As
@lidavidm pointed out, Arrow vectors are not safe for concurrent use,
so that framing implied a guarantee we do not provide.

Tighten the comment to state only the engineering invariant: the three
in-method consumers (empty-buffer check, size comparison, and the
uncompressed-length prefix) must all see the same value. Reference
apacheGH-1116 for context.
@lidavidm

Copy link
Copy Markdown
Member

@LuciferYang were you able to replicate the original issue, by the way?

@LuciferYang

Copy link
Copy Markdown
Contributor Author

Honest answer: no. I built the two new tests around the shape of data the reporter described (all-null TimestampMilli, 3469 rows, VectorSchemaRoot reuse via clear()/allocateNew()), but I just re-checked with the source change reverted — both still pass. So they're round-trip coverage for that shape, not a direct regression test for the declaredUncompressed=0 symptom.

The fix came from reading the code: the old compress() reads writerIndex() twice — once for the size comparison, once for the 8-byte prefix after doCompress(). Reading once and reusing the value is the conservative invariant. The ZSTD dstCapacity change is similar — the compressBound() headroom hides it, but the parameter was semantically wrong.

If you'd rather hold until we have a failing-then-green test, that's fair. Otherwise happy to dig further — the reporter's env was Arrow 18.3.0/19.0.0 with zstd-jni 1.5.7-6/-7, but the allocator setup wasn't in the report.

@lidavidm

lidavidm commented Jun 30, 2026

Copy link
Copy Markdown
Member

I think we should document that the new tests don't actually reflect the original bug report, at least (or consider if they're necessary even). But the fixes themselves are worth having.

…H-1116

The two tests added in the previous commit didn't fail on the unfixed
code, so they aren't a regression test for the reported symptom. The
shapes they exercise (all-null fixed-width round-trip, multi-batch
ZSTD stream) are already covered by testEmptyBuffer and
testReadWriteStream. Drop them per review feedback.
@LuciferYang

Copy link
Copy Markdown
Contributor Author

Fair. Dropped both tests in b8e6b46testEmptyBuffer and testReadWriteStream already cover those round-trip shapes.

@lidavidm lidavidm left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@LuciferYang

Copy link
Copy Markdown
Contributor Author

Thanks for your review @lidavidm

@lidavidm lidavidm merged commit b0e51af into apache:main Jun 30, 2026
25 of 26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug-fix PRs that fix a big.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants