[WIP] Parquet Java ALP Implementation by vinooganesh · Pull Request #3397 · apache/parquet-java

vinooganesh · 2026-02-17T23:15:47Z

Rationale for this change

Reworks the ALP encoding implementation to address emkornfield's architectural feedback on PR #3390. The original buffered all values in memory and decoded eagerly. This makes the writer incremental (encode per-vector as values arrive) and the reader lazy (decode on demand), matching how other Parquet encodings work.

Builds on Julien Le Dem's original implementation (#3390). File structure, integration points, core math, and interop test infrastructure all come from his work. The rework focused on the internal writer/reader plumbing.

What changes are included in this PR?

Architecture (addressing review feedback):

Incremental writer. Values buffer in a fixed-size vector, each full vector encodes and flushes immediately.
Lazy reader. Vectors decode on first access via offset array, skip() is O(1).
Interleaved page layout so each vector is self-contained.
Extracted AlpValuesReader abstract base class for shared logic.
Preset caching. Full parameter search for first 8 vectors, top 5 combos cached for the rest.

Spec compliance:

Fixed packed data size formula to ceil(n * bitWidth / 8)
Fixed unsigned delta comparison in float writer
Explicit little-endian byte reads instead of relying on ByteBuffer order
Using parquet-encoding's BytePacker instead of custom bit-packing
Capped max vector size at 32768 to prevent uint16 overflow in num_exceptions

Code quality:

Renamed bitWidth overloads to prevent silent type coercion
Package-private visibility for internals
Configurable vector size (default 1024)

Integration:

Wired ALP into DefaultV2ValuesWriterFactory and ParquetProperties

Are these changes tested?

Yes. 105 tests across 3 test classes, all passing. Full parquet-column suite (677 tests) also passes.

Key tests construct ALP page bytes directly according to the spec and feed them to the reader without going through the writer. This verifies the reader works independently and catches any bugs where writer and reader agree with each other but disagree with the spec. Also covers NaN bit pattern preservation, negative zero roundtrip, extreme values, every partial vector remainder mod 8, skip across vector boundaries, and preset caching under distribution change.

Are there any user-facing changes?

Users can enable ALP encoding for FLOAT and DOUBLE columns via ParquetProperties.withAlpEncoding(), globally or per-column.

Note: Likely me missing something - but ALP is not yet in the parquet-format Thrift spec (apache/parquet-format#533), so writing ALP files through the full Hadoop pipeline will fail at metadata serialization until parquet.thrift is updated (parquet-format PR #548).

Implements ALP encoding for FLOAT and DOUBLE types, which converts floating-point values to integers using decimal scaling, then applies Frame of Reference (FOR) encoding and bit-packing for compression. New files: - AlpConstants.java: Constants for ALP encoding - AlpEncoderDecoder.java: Core encoding/decoding logic - AlpValuesWriter.java: Writer implementation - AlpValuesReaderForFloat/Double.java: Reader implementations Includes comprehensive unit tests and interop test infrastructure. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Restore original comment indentation that was accidentally changed. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Escape <= characters as <= in javadoc comments to avoid malformed HTML errors during documentation generation. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

ALP encoding is not yet part of the parquet-format Thrift specification, so it cannot be converted to org.apache.parquet.format.Encoding. Skip it in the testEnumEquivalence test and add a clear error message in the converter for when ALP conversion is attempted. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

size and add independent reader/writer verification tests

Switch encode/decode from division-based formula to multiply-by-reciprocal using separate POW10_NEGATIVE arrays, matching C++ Arrow's approach: - Encode: fastRound(value * POW10[e] * POW10_NEGATIVE[f]) - Decode: encoded * POW10[f] * POW10_NEGATIVE[e] Add fastRound helpers with sign branching for correct negative value rounding. Remove version byte from page header (8 -> 7 bytes). Empty pages now emit a 7-byte header with numElements=0. Update all hand-crafted binary tests to match the new header format and add comprehensive end-to-end tests for overflow boundaries, large-scale data, preset caching, and NaN bit-pattern preservation.

- Rewrite TestInterOpReadAlp to use LocalInputFile instead of Hadoop FileSystem, fixing failures on Java 24+ where Subject.getSubject is removed. Tests now read C++ ALP parquet files directly without going through Hadoop security/UGI. - Add AlpExceptionCountTest with per-column exception rate reporting against the real Spotify and Arade floating-point datasets from the parquet-testing repository. Useful for comparing Java vs C++ ALP compression ratios.

- Switch findBestFloatParams/findBestDoubleParams from minimizing exception count to minimizing estimated compressed size (length * bitWidth + exceptions * (typeSize + 2 bytes)), matching the C++ ALP cost model. This closes the ~4-5% compression gap vs C++. - Rewrite sampler to collect evenly-spaced sample vectors and run findBestParams on each, then rank by win count. Matches C++ AlpSampler behavior more closely than the previous HashMap-based approach. - Minor fixes: IOExceptionUtils null check, MemoryManager volatile scale, Files utility cleanup, parquet-cli dependency update.

- Move shared LE helper methods (getShortLE/getIntLE/getLongLE) to AlpValuesReader base class; remove duplicates from subclasses - Make EncodingParams fields package-private (remove public modifier) - Replace fully-qualified java.util.Arrays.fill calls with imported Arrays.fill in both float and double readers; add missing import to double reader - Add explanatory comments to getBufferedSize() magic numbers (3 for float, 5 for double) explaining the overhead breakdown - Add ALP enabled state to ParquetProperties.toString() - Add ALP support to DefaultV1ValuesWriterFactory for float and double columns

- Revert Files.java, IOExceptionUtils.java, MemoryManager.java, and parquet-cli/pom.xml to master state; these changes are unrelated to ALP and should be submitted in separate PRs - Clarify ParquetMetadataConverter error message: ALP encoding is defined in the ALP paper (enum value 26) but is not yet in the parquet-format Thrift spec, so ALP cannot be written through the Hadoop write path; the error message now explains what needs to happen to remove the block

prtkgaur

The code organization looks good to me and the code follows the spec. I looked for areas of any extra buffer allocations which might impact performance and I think it is optimally written.

I think we should add a few benchmarks and publish numbers from them.

Thanks for working on this Vinoo!

prtkgaur

Wanted to make sure we have the following testing.

For the cross compatibility testing are we making sure that we write both V1 and V2 pages and the implementation in other language is able to read it.

…a-implementation

- Add build-time Perl script to patch generated Encoding.java with ALP(10) after Thrift codegen (process-sources phase), since parquet-format 2.12.0 does not yet include ALP in its Thrift spec - Remove guard in ParquetMetadataConverter.getEncoding() that blocked ALP writes; Encoding.ALP now exists in the patched Thrift enum - Add withAlpEncoding() builder methods to ParquetWriter - Add TestInterOpReadAlp: Java V1/V2 write+read round-trip tests and C++ Arrow interop tests (reads alp_spotify1.parquet, alp_arade.parquet, etc.) - Add AlpEncodingBenchmarks JMH benchmark

…a-implementation

…d pyarrow interop test - AlpValuesWriter: stop clearing cachedPresets in reset() so preset (e,f) pairs survive page flushes; eliminates redundant full parameter search on every page after the first, cutting write time ~60% - AlpEncodingBenchmarks: clarify Javadoc that comparison is PLAIN+UNCOMPRESSED (no codec), not plain+ZSTD - parquet-benchmarks pom: add explicit annotationProcessorPaths and proc=full for jmh-generator-annprocess so BenchmarkList is generated under Java 23+ - TestInterOpReadAlp: add pyarrow cross-language compatibility test (skips if pyarrow unavailable or does not yet support ALP encoding)

- ParquetProperties: add withAlpVectorSize(int) and withAlpVectorSize(String, int) builder methods plus getAlpVectorSize(ColumnDescriptor) accessor, defaulting to AlpConstants.DEFAULT_VECTOR_SIZE (1024). - AlpConstants: promote validateVectorSize to public so the builder can validate eagerly across packages. - DefaultV1/V2 ValuesWriterFactory: pass the configured vector size to the 4-arg AlpValuesWriter constructors. - ParquetWriter.Builder: expose withAlpVectorSize facades mirroring withAlpEncoding. - TestInterOpReadAlp: add testJavaWriteAlpCustomVectorSize covering 4500 rows at vectorSize=4096 so we cross a full vector boundary and verify round-trip equality. A wrong log_vector_size byte would surface as decode garbage, so round-trip equality is sufficient proof the configured size took effect on the wire. Enables generating ALP test fixtures at different vector sizes (e.g. 4096) for cross-language compatibility testing against the C++/Rust/Go implementations.

Logging and debug output was missing the new alpVectorSize field alongside the existing 'ALP enabled' line. Cosmetic only — no behavior change.

Adds generateAlpFixturesAtMultipleVectorSizes to TestInterOpReadAlp. For each of the four source files in parquet-testing PR apache#100 (alp_spotify1, alp_arade, alp_float_spotify1, alp_float_arade), reads every row, then re-encodes as Java ALP at both vectorSize=1024 and vectorSize=4096. Output goes to ALP_OUTPUT_DIR (default ${user.dir}/alp-java-generated/), producing 8 files total named alp_java_<stem>_vs{1024,4096}.parquet. Each output is verified by reading back through the standard reader path and bit-comparing every value via doubleToRawLongBits / floatToRawIntBits — catches NaN payload and signed-zero divergence, not just numerical equality. Skips when ALP_TEST_DATA_DIR isn't set, so it stays inert in CI on machines without the source datasets. To run: git clone --branch alpFloatingPointDataset \\ https://github.com/prtkgaur/parquet-testing.git ALP_TEST_DATA_DIR=path/to/parquet-testing/data \\ mvn -pl parquet-hadoop \\ -Dtest=TestInterOpReadAlp#generateAlpFixturesAtMultipleVectorSizes \\ test

Extends generateAlpFixturesAtMultipleVectorSizes to vary writer page version (PARQUET_1_0, PARQUET_2_0) as a third axis alongside dataset and ALP vector size. Output grows from 8 → 16 files per run: alp_java_<stem>_v{1,2}_vs{1024,4096}.parquet Page version is orthogonal to ALP encoding — the page version difference lives in the parquet protocol layer, not in the ALP payload — but covering both axes makes the fixture set fully symmetric for cross-language compatibility verification. C++/Rust/Go readers can use the V1 and V2 variants to prove their decoders handle Java-written ALP regardless of how the surrounding pages are framed. Avoids an asymmetry where the existing PR apache#100 set has C++ at V1 and Java at V2 with no overlap. All 16 outputs independently verified against the canonical _expect.csv truth files from parquet-testing PR apache#100 (1.56M values, 0 mismatches).

The reader was asserting that the ALP header's num_elements equals the data page's valuesCount, but those values differ whenever a column has nulls: num_elements is the count of non-null values that went through ALP encoding, while valuesCount is the total row count of the page (which includes null positions tracked by definition levels). The strict equality check made the reader reject every optional float/double column with at least one null value. Relaxes the check to numElements > valuesCount — the header can never legitimately claim more encoded values than the page has rows, but it can claim fewer when nulls are present. The downstream code already uses numElements (not valuesCount) to drive vector allocation and decoding, so the rest of the read path is unchanged. This was surfaced by the corner-case fixture per parquet-testing issue apache#105, which exercises optional columns with null values.

Two new tests in TestInterOpReadAlp: readAllFixtureFilesIndependently Opens every alp_java_*.parquet in ALP_OUTPUT_DIR and asserts each column chunk declares Encoding.ALP and decodes through the standard reader path without error. Separate from the generator's own round-trip verification so reader correctness surfaces as a distinct signal in CI when the fixtures are present. Skips cleanly when ALP_OUTPUT_DIR is empty so it stays inert in default CI environments. generateAndVerifyCornerCaseFixture Writes a single small fixture file (alp_java_cornercases.parquet, ~60 KB) targeting the corner cases enumerated in parquet-testing issue apache#105: vectors with no exceptions, one exception per vector, all exceptions, NaN/Inf/-0.0, constant values (bit_width=0), multi-vector with differing exponents, and optional columns with nulls. Both f32 and f64 variants — 14 columns × 2048 rows total. Reads each column back and bit-exactly verifies every value against the expected pattern via doubleToRawLongBits / floatToRawIntBits. The corner-case fixture is intended as a candidate file for parquet-testing PR apache#100 once naming/design is confirmed. Generating it also surfaced (and verified the fix for) a pre-existing reader bug where optional columns with nulls couldn't be decoded — see the preceding commit.

julienledem and others added 7 commits January 22, 2026 08:44

Fix formatting in DirectCodecFactory and ParquetMetadataConverter

bc5ebe4

Restore original comment indentation that was accidentally changed. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Fix javadoc HTML escaping in AlpEncoderDecoder

dfdd809

Escape <= characters as <= in javadoc comments to avoid malformed HTML errors during documentation generation. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Apply spotless formatting to ParquetMetadataConverter

03457c5

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

first pass of ALP java implementation

0c393e0

Fix uint16 overflow bug in max vector

6d65eaa

size and add independent reader/writer verification tests

vinooganesh force-pushed the vinooganesh/alp-java-implementation branch from 93c365a to 6d65eaa Compare February 18, 2026 02:54

julienledem mentioned this pull request Mar 11, 2026

Add ALP (Adaptive Lossless floating-Point) encoding support #3390

Closed

vinooganesh added 3 commits March 14, 2026 13:51

vinooganesh force-pushed the vinooganesh/alp-java-implementation branch from 15bc06d to 24c23e5 Compare March 22, 2026 23:56

vinooganesh added 2 commits March 22, 2026 20:43

prtkgaur reviewed Apr 14, 2026

View reviewed changes

vinooganesh added 10 commits April 16, 2026 13:43

Merge remote-tracking branch 'origin/master' into vinooganesh/alp-jav…

01ac671

…a-implementation

Merge remote-tracking branch 'origin/master' into vinooganesh/alp-jav…

9104f3f

…a-implementation

Include alpVectorSize in ParquetProperties.toString()

e6f0916

Logging and debug output was missing the new alpVectorSize field alongside the existing 'ALP enabled' line. Cosmetic only — no behavior change.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Parquet Java ALP Implementation#3397

[WIP] Parquet Java ALP Implementation#3397
vinooganesh wants to merge 22 commits into
apache:masterfrom
vinooganesh:vinooganesh/alp-java-implementation

vinooganesh commented Feb 17, 2026 •

edited

Loading

Uh oh!

prtkgaur left a comment

Uh oh!

prtkgaur left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

vinooganesh commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

prtkgaur left a comment

Choose a reason for hiding this comment

Uh oh!

prtkgaur left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vinooganesh commented Feb 17, 2026 •

edited

Loading