GH-3522: Add batch read APIs to ValuesReader hierarchy#3535
Conversation
775d723 to
bcf585a
Compare
Add readIntegers(), readLongs(), readFloats(), readDoubles() batch methods to ValuesReader with default loop-based implementations. Override in: - RunLengthBitPackingHybridDecoder.readInts(): batch across RLE runs and packed groups using Arrays.fill/System.arraycopy - DictionaryValuesReader: batch-decode dictionary IDs first, then batch-lookup values (eliminates per-value IOException try/catch) - DeltaBinaryPackingValuesReader: System.arraycopy from pre-decoded buffer - PlainValuesReader (all types): loop over LittleEndianDataInputStream - ByteStreamSplitValuesReader (all types): indexed ByteBuffer bulk read These APIs enable callers to amortize per-value overhead (virtual dispatch, bounds checks, mode switches) across batches. On the perf branch where the RLE decoder uses ByteBuffer, this yielded +148% RLE throughput and +67% dictionary decode throughput.
RunLengthBitPackingHybridValuesReader inherited the default loop from ValuesReader.readIntegers() which called readInteger() per value. Delegate to decoder.readInts() which uses Arrays.fill for RLE runs and System.arraycopy for packed groups. Benchmark (100K values, SEQUENTIAL/RANDOM/LOW_CARDINALITY): Before: ~556M ops/s (same as per-value path) After: ~1,270M ops/s (+128%) This matters for def/rep level decoding on every data page, BOOLEAN columns in V2 pages, and any direct RLE consumers using batch APIs.
Replace per-value getXxx(offset) loops with position()+asXxxBuffer().get() bulk copy in readFloats/readDoubles/readIntegers/readLongs. The decoded data buffer is a contiguous heap byte[] in LE order, making view buffer bulk reads a single memcpy via Unsafe.copyMemory. Benchmark results (100K values, BSS FLOAT batch): Before: ~1,228M ops/s After: ~1,442M ops/s (+17%) INT32/INT64/DOUBLE show negligible change because BSS invocation cost is dominated by page transposition in initFromPage, not the read loop.
…ns() Replace ByteBitPackingValuesReader delegation in BooleanPlainValuesReader with direct bit extraction from the page byte[]. The scalar path uses a single array access + shift + mask instead of the 8-element int[] buffer and packer dispatch. The batch path (readBooleans) unrolls 8 booleans per byte with constant masks. For RLE (V2), add a native readBooleans() method that uses Arrays.fill for RLE runs (constant-time for uniform data) and direct int-to-boolean conversion for packed groups, avoiding the intermediate int[] allocation of the readInts() path. Benchmark results (1M values, JDK 25, Compiler Blackholes): - V1 PLAIN scalar: ~620M -> ~1,528-1,618M ops/s (+150%) - V1 PLAIN batch: ALL_TRUE/FALSE ~5,000M (+680%), RANDOM 2,757M (+337%) - V2 RLE batch: ALL_TRUE/FALSE ~190B (fill), RANDOM 1,335M (+93%)
Replace the per-bit unrolled extraction loop with a static boolean[256][8] lookup table + System.arraycopy. Each byte maps to its 8 pre-decoded booleans, and the 8-byte copy is emitted by HotSpot as a single 64-bit load/store pair — the boolean equivalent of asIntBuffer().get() for ints. For RLE PACKED groups (bitWidth=1), bypass the int[] intermediate and read directly from the raw packed bytes via the same lookup table. This makes batch decode throughput independent of data pattern: - V1 PLAIN batch RANDOM: 2,757M -> 5,047M ops/s (+83%) - V2 RLE batch RANDOM: 1,335M -> 1,618M ops/s (+21%) - V2 RLE batch MOSTLY_TRUE_99: 3,205M -> 3,745M ops/s (+17%) - Uniform patterns (ALL_TRUE/FALSE): unchanged (still Arrays.fill)
…king Refactor BooleanPlainValuesWriter to pack bits directly into bytes instead of delegating through ByteBitPackingValuesWriter and the generic int[8]-based ByteBasedBitPackingEncoder. Add batch writeBooleans() API to ValuesWriter with optimized overrides: - PLAIN: processes 8 booleans at a time into single bytes with OR/shift, eliminating the per-value method call chain and int[] intermediate. - RLE: pre-scans for runs >= 8 to emit RLE directly, fills partial bit-packed groups from run boundaries to avoid spurious padding. PLAIN scalar improves +69% (890M -> 1,500M ops/s) from the refactoring. PLAIN batch: +184% over old scalar (2,528M for RANDOM). RLE batch: +278% for ALL_FALSE, +95% for MOSTLY_*, +36% for ALTERNATING.
…riteDoubles with bulk ByteBuffer view transfers Add bulk write methods to CapacityByteArrayOutputStream (writeInts, writeLongs, writeFloats, writeDoubles) that use IntBuffer/LongBuffer/FloatBuffer/DoubleBuffer view puts to transfer entire arrays in one operation, amortizing capacity checks across the batch. Add corresponding batch APIs to ValuesWriter (with scalar default) and optimized overrides in PlainValuesWriter. Performance improvement (100K values, JDK 25): INT32: 566M -> 2,809M ops/s (+396%) FLOAT: 540M -> 2,818M ops/s (+422%) INT64: 479M -> 1,306M ops/s (+173%) DOUBLE: 442M -> 1,275M ops/s (+189%)
- ValuesReader.readBinaries() / ValuesWriter.writeBinaries() default impls - FixedLenByteArrayPlainValuesReader: bulk slice() with fixed-offset Binary views - FixedLenByteArrayPlainValuesWriter: chunked bulk write() amortizing stream overhead - ByteStreamSplitValuesReader: optimized array-based decode with unrolled loops for element sizes 2, 4, 8, 12, 16 - ByteStreamSplitValuesReaderForFLBA: batch readBinaries() with single advanceByteOffset - FixedLenByteArrayEncodingBenchmark: full FLBA benchmark suite with batch variants - Add TestDataFactory and BenchmarkEncodingUtils helper classes - Fix JMH annotation processor config in pom.xml for Maven Compiler 3.14+
… writes Replace per-value scatterBytes() in FixedLenByteArrayByteStreamSplitValuesWriter with a BATCH_SIZE=64 buffered scatter pattern: - Accumulate byte values into per-stream batch buffers - Flush as bulk write(byte[], 0, count) to each stream - Eliminates N*elementSize individual stream.write(byte) calls per batch - Adds writeBinaries() batch override for FLBA BSS writer Performance improvement: FLBA size=2 +85%, size=16 +160% (vs per-byte scatter).
- Add FileReadBenchmark / FileWriteBenchmark with SS warmup=5, measurement=10 - Add RowGroupFlushBenchmark with warmup=3, measurement=5 - Add RleDictionaryIndexDecodingBenchmark with encodeDictionaryIds() and ValuesReader-level decode benchmarks (decodeValuesReader, decodeValuesReaderBatch) - Add BlackHoleOutputFile for write benchmarks without I/O overhead - Adapt RLE decoder instantiation to use InputStream (par13 API)
bcf585a to
ec6408d
Compare
|
Closing — the batch read APIs will be superseded by forthcoming PRs for column I/O (v2-par7-column-io) and level write batching (v2-par9-level-write-batching), which are not yet opened because they depend on #3565 (PLAIN), #3568 (RLE), and other encoding PRs being merged first. I initially submitted a series of small, focused PRs thinking they'd be easier to review. In practice the sheer number (~16 PRs, with more pending) made things harder to follow — even for me. I've regrouped the changes by encoding type / performance area so that each PR is self-contained with its own benchmarks and test coverage, which should make review and performance analysis much more straightforward. Apologies for the churn. The replacement PRs will be opened once their dependencies land. Thank you. |
Summary
readIntegers(),readLongs(),readFloats(),readDoubles()batch methods toValuesReaderwith default loop-based implementationsOverrides
Arrays.fill/System.arraycopySystem.arraycopyfrom pre-decoded bufferRationale
These APIs enable callers to amortize per-value overhead (virtual dispatch, bounds checks, mode switches) across batches. Combined with other optimizations in this series (ByteBuffer-based RLE decoder, etc.), batch reads yield significant throughput improvements over per-value loops.
All 576 parquet-column tests pass.