Skip to content

Improve GcsPinotFS null safety and contract compliance #17714

@jineshparakh

Description

@jineshparakh

While working on #17713, I identified several additional gaps in GcsPinotFS that should be tracked and fixed. Grouping them here for visibility.

1. Missing null checks on getBlob() callers

GcsPinotFS.getBlob() wraps Storage.get(BlobId) which returns null when the blob does not exist. Several callers do not handle this, leading to potential NPEs

Method Unguarded call Impact
lastModified() getBlob(new GcsUri(uri)).getUpdateTime() NPE if blob doesn't exist
touch() blob.getUpdateTime(), blob.toBuilder() NPE if blob doesn't exist
touch() getBlob(gcsUri).getUpdateTime() NPE if blob deleted between update and re-fetch
open() blob.reader() NPE if blob doesn't exist
copyFile() blob.copyTo(...) NPE on source blob; also leaves an empty destination blob (created on line 482) that needs cleanup

For comparison, the S3 SDK throws NoSuchKeyException for missing objects, so S3PinotFS gets meaningful errors without explicit null checks. The GCS SDK returns null silently, so GcsPinotFS needs to handle it explicitly.
Methods that already handle null correctly:

2. PinotFS contract violations

touch() does not create files

The PinotFS interface specifies:

Updates the last modified time of an existing file or directory to be current time. If the file system object does not exist, creates an empty file.

LocalPinotFS and S3PinotFS both correctly implement this. S3PinotFS catches NoSuchKeyException and creates an empty object. LocalPinotFS checks !file.exists() and calls file.createNewFile().
GcsPinotFS.touch() does not handle missing files at all. It would NPE instead of creating the file.

lastModified() return value for missing files

The PinotFS interface specifies:

Returns the time the file was last modified or 0L if the file does not exist or if an I/O error occurs

LocalPinotFS.lastModified( returns file.lastModified() which returns 0L for non-existent files. GcsPinotFS would NPE instead.

3. Deprecated getUpdateTime() usage

BlobInfo.getUpdateTime() is deprecated in the GCS SDK in favor of getUpdateTimeOffsetDateTime(). It is used in four places:
lastModified()
touch() (twice)
listFilesWithMetadata()

4. Minor: redundant getBlob call in delete()

The private delete() method calls exists(segmentUri), which internally calls getBlob() for file URIs. Then it calls getBlob(segmentUri) again. This results in two GCS round-trips for a single file deletion. Not a bug but an efficiency concern for high-throughput deletion.

We can have separate PRs for every sub-issue.

Metadata

Metadata

Assignees

Labels

pinot-filesystemRelated to PinotFS abstraction (S3, GCS, ADLS, HDFS)

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions