GH-3522: Add batch read APIs to ValuesReader hierarchy by iemejia · Pull Request #3535 · apache/parquet-java

iemejia · 2026-05-01T22:20:03Z

Summary

Add readIntegers(), readLongs(), readFloats(), readDoubles() batch methods to ValuesReader with default loop-based implementations
Override in specialized readers to amortize per-value overhead across batches

Overrides

RunLengthBitPackingHybridDecoder.readInts(): batch across RLE runs and packed groups using Arrays.fill/System.arraycopy
DictionaryValuesReader: batch-decode dictionary IDs first, then batch-lookup values (eliminates per-value IOException try/catch)
DeltaBinaryPackingValuesReader: System.arraycopy from pre-decoded buffer
PlainValuesReader (all types): loop over LittleEndianDataInputStream
ByteStreamSplitValuesReader (all types): indexed ByteBuffer bulk read

Rationale

These APIs enable callers to amortize per-value overhead (virtual dispatch, bounds checks, mode switches) across batches. Combined with other optimizations in this series (ByteBuffer-based RLE decoder, etc.), batch reads yield significant throughput improvements over per-value loops.

All 576 parquet-column tests pass.

Add readIntegers(), readLongs(), readFloats(), readDoubles() batch methods to ValuesReader with default loop-based implementations. Override in: - RunLengthBitPackingHybridDecoder.readInts(): batch across RLE runs and packed groups using Arrays.fill/System.arraycopy - DictionaryValuesReader: batch-decode dictionary IDs first, then batch-lookup values (eliminates per-value IOException try/catch) - DeltaBinaryPackingValuesReader: System.arraycopy from pre-decoded buffer - PlainValuesReader (all types): loop over LittleEndianDataInputStream - ByteStreamSplitValuesReader (all types): indexed ByteBuffer bulk read These APIs enable callers to amortize per-value overhead (virtual dispatch, bounds checks, mode switches) across batches. On the perf branch where the RLE decoder uses ByteBuffer, this yielded +148% RLE throughput and +67% dictionary decode throughput.

RunLengthBitPackingHybridValuesReader inherited the default loop from ValuesReader.readIntegers() which called readInteger() per value. Delegate to decoder.readInts() which uses Arrays.fill for RLE runs and System.arraycopy for packed groups. Benchmark (100K values, SEQUENTIAL/RANDOM/LOW_CARDINALITY): Before: ~556M ops/s (same as per-value path) After: ~1,270M ops/s (+128%) This matters for def/rep level decoding on every data page, BOOLEAN columns in V2 pages, and any direct RLE consumers using batch APIs.

Replace per-value getXxx(offset) loops with position()+asXxxBuffer().get() bulk copy in readFloats/readDoubles/readIntegers/readLongs. The decoded data buffer is a contiguous heap byte[] in LE order, making view buffer bulk reads a single memcpy via Unsafe.copyMemory. Benchmark results (100K values, BSS FLOAT batch): Before: ~1,228M ops/s After: ~1,442M ops/s (+17%) INT32/INT64/DOUBLE show negligible change because BSS invocation cost is dominated by page transposition in initFromPage, not the read loop.

…ns() Replace ByteBitPackingValuesReader delegation in BooleanPlainValuesReader with direct bit extraction from the page byte[]. The scalar path uses a single array access + shift + mask instead of the 8-element int[] buffer and packer dispatch. The batch path (readBooleans) unrolls 8 booleans per byte with constant masks. For RLE (V2), add a native readBooleans() method that uses Arrays.fill for RLE runs (constant-time for uniform data) and direct int-to-boolean conversion for packed groups, avoiding the intermediate int[] allocation of the readInts() path. Benchmark results (1M values, JDK 25, Compiler Blackholes): - V1 PLAIN scalar: ~620M -> ~1,528-1,618M ops/s (+150%) - V1 PLAIN batch: ALL_TRUE/FALSE ~5,000M (+680%), RANDOM 2,757M (+337%) - V2 RLE batch: ALL_TRUE/FALSE ~190B (fill), RANDOM 1,335M (+93%)

Replace the per-bit unrolled extraction loop with a static boolean[256][8] lookup table + System.arraycopy. Each byte maps to its 8 pre-decoded booleans, and the 8-byte copy is emitted by HotSpot as a single 64-bit load/store pair — the boolean equivalent of asIntBuffer().get() for ints. For RLE PACKED groups (bitWidth=1), bypass the int[] intermediate and read directly from the raw packed bytes via the same lookup table. This makes batch decode throughput independent of data pattern: - V1 PLAIN batch RANDOM: 2,757M -> 5,047M ops/s (+83%) - V2 RLE batch RANDOM: 1,335M -> 1,618M ops/s (+21%) - V2 RLE batch MOSTLY_TRUE_99: 3,205M -> 3,745M ops/s (+17%) - Uniform patterns (ALL_TRUE/FALSE): unchanged (still Arrays.fill)

…king Refactor BooleanPlainValuesWriter to pack bits directly into bytes instead of delegating through ByteBitPackingValuesWriter and the generic int[8]-based ByteBasedBitPackingEncoder. Add batch writeBooleans() API to ValuesWriter with optimized overrides: - PLAIN: processes 8 booleans at a time into single bytes with OR/shift, eliminating the per-value method call chain and int[] intermediate. - RLE: pre-scans for runs >= 8 to emit RLE directly, fills partial bit-packed groups from run boundaries to avoid spurious padding. PLAIN scalar improves +69% (890M -> 1,500M ops/s) from the refactoring. PLAIN batch: +184% over old scalar (2,528M for RANDOM). RLE batch: +278% for ALL_FALSE, +95% for MOSTLY_*, +36% for ALTERNATING.

…riteDoubles with bulk ByteBuffer view transfers Add bulk write methods to CapacityByteArrayOutputStream (writeInts, writeLongs, writeFloats, writeDoubles) that use IntBuffer/LongBuffer/FloatBuffer/DoubleBuffer view puts to transfer entire arrays in one operation, amortizing capacity checks across the batch. Add corresponding batch APIs to ValuesWriter (with scalar default) and optimized overrides in PlainValuesWriter. Performance improvement (100K values, JDK 25): INT32: 566M -> 2,809M ops/s (+396%) FLOAT: 540M -> 2,818M ops/s (+422%) INT64: 479M -> 1,306M ops/s (+173%) DOUBLE: 442M -> 1,275M ops/s (+189%)

- ValuesReader.readBinaries() / ValuesWriter.writeBinaries() default impls - FixedLenByteArrayPlainValuesReader: bulk slice() with fixed-offset Binary views - FixedLenByteArrayPlainValuesWriter: chunked bulk write() amortizing stream overhead - ByteStreamSplitValuesReader: optimized array-based decode with unrolled loops for element sizes 2, 4, 8, 12, 16 - ByteStreamSplitValuesReaderForFLBA: batch readBinaries() with single advanceByteOffset - FixedLenByteArrayEncodingBenchmark: full FLBA benchmark suite with batch variants - Add TestDataFactory and BenchmarkEncodingUtils helper classes - Fix JMH annotation processor config in pom.xml for Maven Compiler 3.14+

… writes Replace per-value scatterBytes() in FixedLenByteArrayByteStreamSplitValuesWriter with a BATCH_SIZE=64 buffered scatter pattern: - Accumulate byte values into per-stream batch buffers - Flush as bulk write(byte[], 0, count) to each stream - Eliminates N*elementSize individual stream.write(byte) calls per batch - Adds writeBinaries() batch override for FLBA BSS writer Performance improvement: FLBA size=2 +85%, size=16 +160% (vs per-byte scatter).

- Add FileReadBenchmark / FileWriteBenchmark with SS warmup=5, measurement=10 - Add RowGroupFlushBenchmark with warmup=3, measurement=5 - Add RleDictionaryIndexDecodingBenchmark with encodeDictionaryIds() and ValuesReader-level decode benchmarks (decodeValuesReader, decodeValuesReaderBatch) - Add BlackHoleOutputFile for write benchmarks without I/O overhead - Adapt RLE decoder instantiation to use InputStream (par13 API)

iemejia · 2026-05-17T22:42:32Z

Closing — the batch read APIs will be superseded by forthcoming PRs for column I/O (v2-par7-column-io) and level write batching (v2-par9-level-write-batching), which are not yet opened because they depend on #3565 (PLAIN), #3568 (RLE), and other encoding PRs being merged first.

I initially submitted a series of small, focused PRs thinking they'd be easier to review. In practice the sheer number (~16 PRs, with more pending) made things harder to follow — even for me. I've regrouped the changes by encoding type / performance area so that each PR is self-contained with its own benchmarks and test coverage, which should make review and performance analysis much more straightforward.

Apologies for the churn. The replacement PRs will be opened once their dependencies land. Thank you.

iemejia mentioned this pull request May 6, 2026

Apache Parquet Java Performance Improvements #3530

Open

iemejia force-pushed the perf-batch-read-api branch 2 times, most recently from 775d723 to bcf585a Compare May 13, 2026 19:27

iemejia added 10 commits May 13, 2026 21:30

iemejia force-pushed the perf-batch-read-api branch from bcf585a to ec6408d Compare May 13, 2026 19:30

iemejia marked this pull request as draft May 15, 2026 09:35

iemejia closed this May 17, 2026

iemejia deleted the perf-batch-read-api branch May 17, 2026 22:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GH-3522: Add batch read APIs to ValuesReader hierarchy#3535

GH-3522: Add batch read APIs to ValuesReader hierarchy#3535
iemejia wants to merge 10 commits into
apache:masterfrom
iemejia:perf-batch-read-api

iemejia commented May 1, 2026

Uh oh!

iemejia commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

iemejia commented May 1, 2026

Summary

Overrides

Rationale

Uh oh!

iemejia commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant