Add raw LZ4 block format support for cross-language compatibility#39
Add raw LZ4 block format support for cross-language compatibility#3927Bslash6 wants to merge 1 commit intokjdev:masterfrom
Conversation
Add lz4_compress_raw() and lz4_uncompress_raw() functions to enable byte-exact compatibility with Python lz4.block, Rust lz4_flex, and Go pierrec/lz4. Changes: - Add php_lz4_compress_raw() static function (bypasses 4-byte size header) - Add php_lz4_uncompress_raw() static function (requires max_size parameter) - Add ZEND_FUNCTION wrappers for both new functions - Add argument info structures for parameter validation - Register new functions in lz4_functions array API: - string lz4_compress_raw(string $data, int $level = 0) Compresses data with NO size header (raw LZ4 block) - string lz4_uncompress_raw(string $data, int $max_size) Decompresses raw LZ4 block (max_size required) Tests: - tests/raw_001.phpt - Basic roundtrip functionality - tests/raw_002.phpt - Python compatibility test vectors - tests/raw_003.phpt - Error handling validation - tests/raw_004.phpt - Compression level testing Validation: - All 4 new tests pass - All 14 existing tests pass (backward compatibility maintained) - Bidirectional cross-language compatibility verified: * Python → PHP decompression: 4/4 test vectors pass * PHP → Python decompression: 4/4 test vectors pass Resolves incompatibility with ByteStorage envelope format used by CacheKit Python/Rust implementations.
WalkthroughThe PR adds raw (headerless) LZ4 block compression and decompression capabilities to the PHP extension. Two new public functions, Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant lz4_compress_raw
participant php_lz4_compress_raw
participant LZ4 Lib
participant User2
participant lz4_uncompress_raw
participant php_lz4_uncompress_raw
participant LZ4 Lib2
User->>lz4_compress_raw: (data, level)
lz4_compress_raw->>php_lz4_compress_raw: data, level
php_lz4_compress_raw->>LZ4 Lib: LZ4_compressBound
LZ4 Lib-->>php_lz4_compress_raw: max_len
php_lz4_compress_raw->>php_lz4_compress_raw: allocate buffer
alt level <= 0
php_lz4_compress_raw->>LZ4 Lib: LZ4_compress_default
else
php_lz4_compress_raw->>LZ4 Lib: LZ4_compress_HC
end
LZ4 Lib-->>php_lz4_compress_raw: compressed_len or error
php_lz4_compress_raw-->>lz4_compress_raw: raw compressed data
lz4_compress_raw-->>User: PHP string (compressed)
User2->>lz4_uncompress_raw: (data, max_size)
lz4_uncompress_raw->>php_lz4_uncompress_raw: data, max_size
php_lz4_uncompress_raw->>php_lz4_uncompress_raw: validate max_size > 0
php_lz4_uncompress_raw->>php_lz4_uncompress_raw: allocate buffer (max_size + 1)
php_lz4_uncompress_raw->>LZ4 Lib2: LZ4_decompress_safe
LZ4 Lib2-->>php_lz4_uncompress_raw: decompressed_len or error
php_lz4_uncompress_raw-->>lz4_uncompress_raw: raw decompressed data
lz4_uncompress_raw-->>User2: PHP string (decompressed)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
lz4.c (1)
728-767: Consider validating max_size range before castingThe wrapper correctly implements the raw decompression API following existing patterns. However, line 754 casts
zend_longtoconst intwithout validating thatmax_sizefits withinINT_MAX. On 64-bit systems, if a user providesmax_size > INT_MAX, silent truncation could occur.Note: This is consistent with the existing
lz4_uncompressfunction (line 609), so it's a pre-existing pattern. Consider adding validation for both functions:+ if (max_size > INT_MAX) { + zend_error(E_WARNING, + "lz4_uncompress_raw : max_size exceeds maximum allowed value"); + RETURN_FALSE; + } + /* Call internal decompression function */ if (php_lz4_uncompress_raw(Z_STRVAL_P(data), Z_STRLEN_P(data), (const int)max_size,
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
lz4.c(5 hunks)tests/raw_001.phpt(1 hunks)tests/raw_002.phpt(1 hunks)tests/raw_003.phpt(1 hunks)tests/raw_004.phpt(1 hunks)
🔇 Additional comments (8)
lz4.c (4)
77-78: LGTM: Function declarations and registrationThe new raw API functions are properly declared and registered following the extension's existing patterns. The argument info correctly specifies required/optional parameters for each function.
Also applies to: 103-111, 123-124
302-349: LGTM: Raw compression implementationThe implementation correctly produces headerless LZ4 blocks by directly writing compressed data without the 4-byte size prefix. The level validation, error handling, and memory management are all consistent with the existing
php_lz4_compressfunction.
351-387: LGTM: Raw decompression implementation with one noteThe implementation correctly handles raw LZ4 blocks requiring an explicit
max_sizeparameter. Error handling and memory management follow existing patterns.The
max_size + 1allocation (line 367) matches the pattern inphp_lz4_uncompressand provides space for null termination.
691-726: LGTM: Public compress wrapperThe PHP-visible wrapper correctly parses parameters, validates input types, and manages memory. The implementation follows the existing extension patterns with proper version compatibility handling for PHP 5 vs 7+.
tests/raw_001.phpt (1)
1-26: LGTM: Comprehensive basic functionality testThe test correctly verifies:
- Basic compression/decompression roundtrip
- Exact byte-level output (hex validation) for cross-language compatibility
- Proper handling of the original data length requirement in
lz4_uncompress_rawtests/raw_002.phpt (1)
1-33: LGTM: Excellent cross-language compatibility validationThe test vectors validate byte-exact output against Python's
lz4.blockimplementation, which is essential for the stated goal of cross-language interoperability. The test includes varied data patterns (literals and repeated sequences) to exercise different compression paths.tests/raw_003.phpt (1)
1-43: LGTM: Thorough error handling coverageThe test comprehensively validates error handling for:
- Invalid
max_sizevalues (zero and negative)- Corrupted compressed data
- Incorrect
max_sizeparameter (too small for decompressed data)All error paths in
php_lz4_uncompress_raware exercised, ensuring robust validation.tests/raw_004.phpt (1)
1-46: LGTM: Complete compression level validationThe test validates:
- Default fast compression (level 0)
- High compression mode (levels 1 and 9)
- Roundtrip integrity across all compression levels
- Invalid level rejection with appropriate error message
The use of
--EXPECTF--with%d bytespatterns allows for variable compression ratios while still validating that compression occurred.
Add LZ4 Block Format Specification Compliance
Problem
The current
lz4_compress()andlz4_uncompress()functions violate the official LZ4 Block Format specification by embedding a 4-byte size header in the compressed output.From the LZ4 Block Format spec (Section: Metadata):
This deviation breaks interoperability with all standard LZ4 Block implementations:
lz4.block(https://pypi.org/project/lz4/)lz4_flex(https://crates.io/crates/lz4_flex)pierrec/lz4(https://github.com/pierrec/lz4)LZ4_compress_default()(raw output)Solution
Add two new functions that produce specification-compliant LZ4 blocks:
Technical Details
Current behavior (spec violation):
New raw functions (spec-compliant):
The size header approach was chosen for API convenience but creates a proprietary format incompatible with the LZ4 ecosystem.
Cross-Language Compatibility
Test Results:
Example (PHP ↔ Python):
Use Case: Multi-Language Cache Systems
Enables ByteStorage pattern for shared caches:
Compatible with Python, Rust, Go, Node.js - any language with standard LZ4 block support.
Testing
New tests: 4 PHPT tests (100% pass)
raw_001.phpt- Basic roundtripraw_002.phpt- Python compatibility (test vectors)raw_003.phpt- Error handlingraw_004.phpt- Compression levelsRegression tests: All 14 existing tests pass (100%)
Cross-language validation: 8/8 test vectors pass
Backward Compatibility
Zero breaking changes
Performance
Same as existing functions (both call
LZ4_compress_default()directly), but:Why Not Use Frame Format?
The LZ4 Frame Format (magic
0x184D2204) is designed for self-contained files/streams with embedded metadata. It adds:For cache systems and databases, Block format is preferred:
References
Implementation
lz4.c(179 lines added)This PR restores LZ4 specification compliance while maintaining full backward compatibility.
Summary by CodeRabbit
Release Notes