Skip to content

[GH-2662] Implement a pure Java single-thread COG writer#2663

Draft
jiayuasu wants to merge 3 commits intomasterfrom
feature/cog-writer
Draft

[GH-2662] Implement a pure Java single-thread COG writer#2663
jiayuasu wants to merge 3 commits intomasterfrom
feature/cog-writer

Conversation

@jiayuasu
Copy link
Member

Did you read the Contributor Guide?

Is this PR related to a ticket?

What changes were proposed in this PR?

This PR adds a pure Java, single-threaded Cloud Optimized GeoTIFF (COG) writer to Apache Sedona. The implementation generates COG files by:

  1. Computing overview decimation factors (powers of 2), ported from GeoTrellis
  2. Generating downsampled overview images via GeoTools resample
  3. Writing each level as a tiled GeoTIFF via GeoTools GeoTiffWriter
  4. Parsing the TIFF IFD structure of each level (TiffIfdParser)
  5. Reassembling all levels into COG byte order with IFDs at file start and image data in reverse size order (CogAssembler)

New files

  • CogOptions.java - Immutable builder-pattern options class with 5 configurable parameters:
    • compression (default: Deflate) - compression algorithm
    • compressionQuality (default: 0.2) - quality from 0.0 to 1.0
    • tileSize (default: 256) - tile dimensions, must be power of 2
    • resampling (default: Nearest) - overview resampling: Nearest, Bilinear, or Bicubic
    • overviewCount (default: -1/auto) - number of overview levels, 0 for none
  • CogWriter.java - Main orchestrator with write(GridCoverage2D, CogOptions) primary method
  • CogAssembler.java - Reassembles parsed TIFFs into COG-compliant byte layout, injects NewSubfileType tag for overviews
  • TiffIfdParser.java - Parses TIFF byte arrays to extract IFD entries, image data, and overflow areas

Modified files

  • RasterOutputs.java - New public API methods:
    • asCloudOptimizedGeoTiff(GridCoverage2D raster)
    • asCloudOptimizedGeoTiff(GridCoverage2D raster, String compressionType, double compressionQuality)
    • asCloudOptimizedGeoTiff(GridCoverage2D raster, CogOptions options)

How was this patch tested?

  • 25 unit tests in CogWriterTest.java covering:
    • Overview decimation computation
    • Overview generation with all resampling modes (Nearest, Bilinear, Bicubic)
    • Small raster (no overviews), medium raster (with overviews), multiband raster
    • LZW compression, existing GeoTIFF conversion
    • TIFF IFD parsing, NewSubfileType tag presence in overview IFDs
    • Input validation (compressionQuality, tileSize, malformed TIFF)
    • COG tile offsets are forward-pointing
    • CogOptions builder defaults, validation, and resampling normalization
    • overviewCount=0 (no overviews), overviewCount=1 (specific count)
    • tileSize=512, RasterOutputs CogOptions API path
  • External validation: generated COGs confirmed valid by rio cogeo validate and gdalinfo

Did this PR include necessary documentation updates?

  • No, this PR does not affect any public API so no need to change the documentation. The Spark SQL API surface will be added in a follow-up PR.

- TiffIfdParser: extracts IFD structure from TIFF byte arrays
- CogAssembler: reassembles parsed IFDs into COG byte order (ported from GeoTrellis)
- CogWriter: orchestrates decimation, overview generation, tiled writing, and COG assembly
- RasterOutputs: adds asCloudOptimizedGeoTiff() public API
- CogWriterTest: 8 unit tests covering decimation, overview, round-trip, multiband, compression
- Remove unused ReferencedEnvelope variable (Finding 5)
- Remove unused isOverview parameter from writeAsTiledGeoTiff (Finding 2)
- Add input validation: compressionQuality range, tileSize positive & power-of-2 (Finding 3)
- Add parser bounds checks: IFD offset, entries length, overflow/image data ranges (Finding 4)
- Inject NewSubfileType tag (254=1) into overview IFDs when missing (Finding 1)
- Add 4 new tests: NewSubfileType presence, invalid inputs, malformed TIFF rejection,
  forward-pointing tile offsets
- CogOptions: immutable builder with compression, compressionQuality,
  tileSize, resampling (Nearest/Bilinear/Bicubic), and overviewCount
- CogWriter: new write(raster, CogOptions) primary method; existing
  overloads delegate to it; generateOverview accepts Interpolation param
- RasterOutputs: new asCloudOptimizedGeoTiff(raster, CogOptions) overload
- 13 new tests (25 total): builder validation, resampling modes,
  overviewCount=0/1, tileSize=512, RasterOutputs CogOptions path
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a pure-Java Cloud Optimized GeoTIFF (COG) writer implementation to Sedona’s common raster module, exposing new RasterOutputs entrypoints and introducing a TIFF-IFD parser + COG byte-layout assembler.

Changes:

  • Introduces CogWriter, CogAssembler, TiffIfdParser, and CogOptions to generate COGs from GridCoverage2D.
  • Adds RasterOutputs.asCloudOptimizedGeoTiff(...) public APIs that delegate to the new writer.
  • Adds CogWriterTest coverage for overview generation, assembly structure, and option validation.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
common/src/main/java/org/apache/sedona/common/raster/cog/CogWriter.java Implements overview computation/resampling and writes tiled GeoTIFF levels for later COG assembly.
common/src/main/java/org/apache/sedona/common/raster/cog/CogAssembler.java Rebuilds multi-level TIFFs into COG-compliant byte layout and patches IFD offsets/tags.
common/src/main/java/org/apache/sedona/common/raster/cog/TiffIfdParser.java Parses TIFF header/IFD/overflow/image segment regions needed for assembly.
common/src/main/java/org/apache/sedona/common/raster/cog/CogOptions.java Adds an immutable options/builder API for controlling compression/tiling/resampling/overviews.
common/src/main/java/org/apache/sedona/common/raster/RasterOutputs.java Exposes new public asCloudOptimizedGeoTiff overloads to produce COG bytes.
common/src/test/java/org/apache/sedona/common/raster/cog/CogWriterTest.java Adds unit tests validating decimations, overview creation, COG layout properties, and options validation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +418 to +421
// Verify all IFDs are before image data (forward-pointing)
// The last IFD end should be well before the end of the file
assertTrue("IFD region should be at start of file", lastIfdEnd < cogBytes.length / 2);
}
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test’s “IFD region should be at start of file” assertion uses lastIfdEnd < cogBytes.length / 2, which can be flaky for highly compressible inputs (image data can be smaller than metadata/IFDs). Consider computing the minimum TileOffsets/StripOffsets across all IFDs and asserting lastIfdEnd <= minOffset instead of comparing to half the file length.

Copilot uses AI. Check for mistakes.
Comment on lines +234 to +240
GridGeometry2D gridGeometry =
new GridGeometry2D(
new GridEnvelope2D(0, 0, newWidth, newHeight),
PixelInCell.CELL_CORNER,
newTransform,
crs,
null);
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generateOverview hard-codes PixelInCell.CELL_CORNER when constructing the target GridGeometry2D. Sedona can create rasters with PixelInCell.CELL_CENTER (e.g., NetCDF reader / some constructors), so this can shift the georeferencing of overviews by half a pixel. Preserve the input raster’s pixel anchor when building the overview grid geometry (or derive the target transform via the existing GridGeometry APIs rather than forcing CELL_CORNER).

Copilot uses AI. Check for mistakes.
Comment on lines +285 to +312
ByteArrayOutputStream out = new ByteArrayOutputStream();
GridCoverageWriter writer = new GeoTiffWriter(out);

ParameterValueGroup defaultParams = writer.getFormat().getWriteParameters();
GeoTiffWriteParams params = new GeoTiffWriteParams();

// Set tiling — must use the 2-arg overload from GeoToolsWriteParams
// which delegates to the inner write param. The 4-arg ImageWriteParam.setTiling()
// writes to the wrong fields (parent vs inner param).
params.setTilingMode(ImageWriteParam.MODE_EXPLICIT);
params.setTiling(tileSize, tileSize);

// Set compression
params.setCompressionMode(ImageWriteParam.MODE_EXPLICIT);
params.setCompressionType(compressionType);
params.setCompressionQuality((float) compressionQuality);

defaultParams
.parameter(AbstractGridFormat.GEOTOOLS_WRITE_PARAMS.getName().toString())
.setValue(params);

GeneralParameterValue[] wps = defaultParams.values().toArray(new GeneralParameterValue[0]);

writer.write(raster, wps);
writer.dispose();
out.close();

return out.toByteArray();
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

writeAsTiledGeoTiff doesn’t guarantee writer.dispose() runs if writer.write(...) throws, which can leak native resources/file handles in GeoTools. Wrap the writer lifecycle in try/finally (or try-with-resources if applicable) so dispose() is always called on failure paths.

Suggested change
ByteArrayOutputStream out = new ByteArrayOutputStream();
GridCoverageWriter writer = new GeoTiffWriter(out);
ParameterValueGroup defaultParams = writer.getFormat().getWriteParameters();
GeoTiffWriteParams params = new GeoTiffWriteParams();
// Set tiling — must use the 2-arg overload from GeoToolsWriteParams
// which delegates to the inner write param. The 4-arg ImageWriteParam.setTiling()
// writes to the wrong fields (parent vs inner param).
params.setTilingMode(ImageWriteParam.MODE_EXPLICIT);
params.setTiling(tileSize, tileSize);
// Set compression
params.setCompressionMode(ImageWriteParam.MODE_EXPLICIT);
params.setCompressionType(compressionType);
params.setCompressionQuality((float) compressionQuality);
defaultParams
.parameter(AbstractGridFormat.GEOTOOLS_WRITE_PARAMS.getName().toString())
.setValue(params);
GeneralParameterValue[] wps = defaultParams.values().toArray(new GeneralParameterValue[0]);
writer.write(raster, wps);
writer.dispose();
out.close();
return out.toByteArray();
byte[] result;
try (ByteArrayOutputStream out = new ByteArrayOutputStream()) {
GridCoverageWriter writer = null;
try {
writer = new GeoTiffWriter(out);
ParameterValueGroup defaultParams = writer.getFormat().getWriteParameters();
GeoTiffWriteParams params = new GeoTiffWriteParams();
// Set tiling — must use the 2-arg overload from GeoToolsWriteParams
// which delegates to the inner write param. The 4-arg ImageWriteParam.setTiling()
// writes to the wrong fields (parent vs inner param).
params.setTilingMode(ImageWriteParam.MODE_EXPLICIT);
params.setTiling(tileSize, tileSize);
// Set compression
params.setCompressionMode(ImageWriteParam.MODE_EXPLICIT);
params.setCompressionType(compressionType);
params.setCompressionQuality((float) compressionQuality);
defaultParams
.parameter(AbstractGridFormat.GEOTOOLS_WRITE_PARAMS.getName().toString())
.setValue(params);
GeneralParameterValue[] wps =
defaultParams.values().toArray(new GeneralParameterValue[0]);
writer.write(raster, wps);
} finally {
if (writer != null) {
writer.dispose();
}
}
result = out.toByteArray();
}
return result;

Copilot uses AI. Check for mistakes.
Comment on lines +213 to +242
int fieldType = buf.getShort(entryOffset + 2) & 0xFFFF;
int count = buf.getInt(entryOffset + 4);
int valueSize = count * getFieldTypeSize(fieldType);

if (tag == TAG_TILE_OFFSETS || tag == TAG_STRIP_OFFSETS) {
offsetsTag = tag;
segmentCount = count;
} else if (tag == TAG_TILE_BYTE_COUNTS || tag == TAG_STRIP_BYTE_COUNTS) {
byteCountsTag = tag;
} else if (tag == TAG_NEW_SUBFILE_TYPE) {
hasNewSubfileType = true;
}

// Track overflow data region (values > 4 bytes stored outside IFD entries)
if (valueSize > 4) {
int valOffset = buf.getInt(entryOffset + 8);
if (valOffset < 0 || valOffset + valueSize > tiffBytes.length) {
throw new IllegalArgumentException(
"Overflow data for tag "
+ tag
+ " out of range: offset="
+ valOffset
+ " size="
+ valueSize
+ " fileSize="
+ tiffBytes.length);
}
overflowStart = Math.min(overflowStart, valOffset);
overflowEnd = Math.max(overflowEnd, valOffset + valueSize);
}
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TiffIfdParser.parse treats the TIFF count field as a signed int and computes valueSize as count * typeSize in an int. For large/unsigned counts this can overflow to a negative valueSize, which can bypass the range checks and lead to unexpected IndexOutOfBoundsException/DoS instead of a controlled IllegalArgumentException. Use unsigned/long arithmetic for count/valueSize and explicitly reject counts that are negative or would exceed the input length.

Copilot uses AI. Check for mistakes.
Comment on lines +250 to +272
// Read segment offsets and byte counts
int[] segmentOffsets = readIntArray(buf, tiffBytes, entriesStart, tagCount, offsetsTag);
int[] segmentByteCounts = readIntArray(buf, tiffBytes, entriesStart, tagCount, byteCountsTag);

// Extract overflow data
byte[] overflowData;
int overflowDataStart;
if (overflowStart < Integer.MAX_VALUE) {
overflowDataStart = overflowStart;
overflowData = new byte[overflowEnd - overflowStart];
System.arraycopy(tiffBytes, overflowStart, overflowData, 0, overflowData.length);
} else {
overflowDataStart = 0;
overflowData = new byte[0];
}

// Find image data bounds
int imageDataStart = Integer.MAX_VALUE;
int imageDataEnd = 0;
for (int i = 0; i < segmentCount; i++) {
imageDataStart = Math.min(imageDataStart, segmentOffsets[i]);
imageDataEnd = Math.max(imageDataEnd, segmentOffsets[i] + segmentByteCounts[i]);
}
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reading segmentOffsets and segmentByteCounts, the code assumes both arrays have the same length (segmentCount) and later indexes them together. If a malformed TIFF provides mismatched counts for offsets vs byte-counts, this can throw ArrayIndexOutOfBoundsException instead of a clear validation error. Add an explicit length/count equality check and throw IllegalArgumentException when they differ.

Copilot uses AI. Check for mistakes.
Comment on lines +209 to +229
// Handle TileOffsets/StripOffsets — rewrite to point to new image data location
if (tag == TiffIfdParser.TAG_TILE_OFFSETS || tag == TiffIfdParser.TAG_STRIP_OFFSETS) {
if (count == 1 && valueSize <= 4) {
// Single segment: offset stored inline
buf.putInt(offset + 8, newImageDataStart + pt.segmentOffsets[0]);
} else {
// Multiple segments: the entry points to an overflow array.
// We need to rewrite the overflow array with new absolute offsets.
// First, rebase the pointer to the overflow data.
int origPointer = buf.getInt(offset + 8);
int newPointer = origPointer + overflowDelta;
buf.putInt(offset + 8, newPointer);

// Now patch the overflow data array with new image data offsets
int overflowArrayOffset = origPointer - pt.overflowDataStart;
ByteBuffer overflowBuf = ByteBuffer.wrap(pt.overflowData).order(byteOrder);
for (int j = 0; j < count; j++) {
int newSegmentOffset = newImageDataStart + pt.segmentOffsets[j];
overflowBuf.putInt(overflowArrayOffset + j * 4, newSegmentOffset);
}
}
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

patchIfdEntries mutates pt.overflowData in place when rewriting TileOffsets/StripOffsets arrays. Since assemble(...) takes ParsedTiff objects from the caller, this creates surprising side effects (the parsed structures are no longer a faithful representation of the original TIFF after assembly). Prefer patching a copy of the overflow buffer (and writing the patched copy) so assemble doesn’t mutate its inputs.

Copilot uses AI. Check for mistakes.
Comment on lines +92 to +140
/**
* Creates a Cloud Optimized GeoTIFF (COG) byte array from the given raster. The COG format
* arranges tiles and overviews in an order optimized for HTTP range-request based access,
* enabling efficient partial reads from cloud storage.
*
* @param raster The input raster
* @param compressionType Compression type: "Deflate", "LZW", "JPEG", "PackBits", or null for
* default (Deflate)
* @param compressionQuality Quality 0.0 (max compression) to 1.0 (no compression)
* @return COG file as byte array
*/
public static byte[] asCloudOptimizedGeoTiff(
GridCoverage2D raster, String compressionType, double compressionQuality) {
try {
return CogWriter.write(raster, compressionType, compressionQuality);
} catch (IOException e) {
throw new RuntimeException("Failed to write Cloud Optimized GeoTIFF", e);
}
}

/**
* Creates a Cloud Optimized GeoTIFF (COG) byte array with default settings (Deflate compression,
* 256x256 tiles).
*
* @param raster The input raster
* @return COG file as byte array
*/
public static byte[] asCloudOptimizedGeoTiff(GridCoverage2D raster) {
try {
return CogWriter.write(raster);
} catch (IOException e) {
throw new RuntimeException("Failed to write Cloud Optimized GeoTIFF", e);
}
}

/**
* Creates a Cloud Optimized GeoTIFF (COG) byte array with the given options.
*
* @param raster The input raster
* @param options COG generation options (compression, tileSize, resampling, overviewCount)
* @return COG file as byte array
*/
public static byte[] asCloudOptimizedGeoTiff(GridCoverage2D raster, CogOptions options) {
try {
return CogWriter.write(raster, options);
} catch (IOException e) {
throw new RuntimeException("Failed to write Cloud Optimized GeoTIFF", e);
}
}
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description says it “does not affect any public API”, but this change adds new public methods to RasterOutputs (asCloudOptimizedGeoTiff overloads). Either update the PR description to reflect the public API addition, or (if the intent is to keep it internal) reduce visibility / place it behind an internal API.

Copilot uses AI. Check for mistakes.
Comment on lines +148 to +152
// The next IFD should be before any image data (COG requirement)
assertTrue(
"Overview IFD should immediately follow first IFD region",
nextIfdOffset < cogBytes.length / 2);

Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assertion uses nextIfdOffset < cogBytes.length / 2 as a proxy for “IFDs are before image data”. That threshold is arbitrary and can be flaky if compression makes the image data very small (valid COGs can have IFD/metadata > 50% of the file). Prefer asserting nextIfdOffset (or the end of the IFD region) is strictly less than the minimum TileOffsets/StripOffsets value.

Copilot uses AI. Check for mistakes.
Comment on lines +201 to +225
public CogOptions build() {
if (compression == null || compression.isEmpty()) {
throw new IllegalArgumentException("compression must not be null or empty");
}
if (compressionQuality < 0 || compressionQuality > 1.0) {
throw new IllegalArgumentException(
"compressionQuality must be between 0.0 and 1.0, got: " + compressionQuality);
}
if (tileSize <= 0) {
throw new IllegalArgumentException("tileSize must be positive, got: " + tileSize);
}
if ((tileSize & (tileSize - 1)) != 0) {
throw new IllegalArgumentException("tileSize must be a power of 2, got: " + tileSize);
}
if (overviewCount < -1) {
throw new IllegalArgumentException(
"overviewCount must be -1 (auto), 0 (none), or positive, got: " + overviewCount);
}

// Normalize resampling to title-case for matching
String normalized = normalizeResampling(resampling);
if (!VALID_RESAMPLING.contains(normalized)) {
throw new IllegalArgumentException(
"resampling must be one of " + VALID_RESAMPLING + ", got: '" + resampling + "'");
}
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CogOptions.Builder validates resampling/tileSize/quality but does not validate compression against the compression types actually supported by the GeoTools GeoTIFF writer. As-is, invalid strings will pass build() and fail later at write time. Consider normalizing/validating compression (and ideally allowing the same set mentioned in RasterOutputs.asGeoTiff, e.g., "None", "Huffman", etc.) so errors surface early with a clear message.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement a pure Java single thread COG writer

1 participant

Comments