Skip to content

[CELEBORN-2317] Validate applicationId to prevent worker path traversal#3674

Open
afterincomparableyum wants to merge 2 commits into
apache:mainfrom
afterincomparableyum:celeborn-2317
Open

[CELEBORN-2317] Validate applicationId to prevent worker path traversal#3674
afterincomparableyum wants to merge 2 commits into
apache:mainfrom
afterincomparableyum:celeborn-2317

Conversation

@afterincomparableyum
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

This change:

  • Adds Utils.validateAppId enforcing ^[A-Za-z0-9_-]+$, which matches Spark (application_<ts>_<n>, local-<ts>), Flink, and MR formats.
  • Calls it at every worker RPC entry point that takes an applicationId or shuffleKey from the wire: Controller (ReserveSlots, CommitFiles, DestroyWorkerSlots), PushDataHandler.handleCore, and the two checkAuth sites in FetchHandler.
  • Adds a canonical path containment check in StorageManager.createDiskFile (local disk branch) as defense in depth, before mkdirs() runs.

Why are the changes needed?

The worker builds local shuffle paths by concatenating applicationId received over RPC: <workingDir>/<appId>/<shuffleId>/<fileName>. The int32 and fileName is built from int id/epoch/mode, but it was never validated against a charset. With auth disabled, any client on the network could supply appId = "../foo" and have the worker mkdir, create, or delete files outside its working dir. With auth enabled, the SASL clientId == applicationId equality check in RpcEndpoint did not constrain format, so a tenant whose registered id contained .. could still escape.

Does this PR resolve a correctness bug?

Yes

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit tests/CI

@afterincomparableyum
Copy link
Copy Markdown
Contributor Author

I'll fix CI issues and push

@afterincomparableyum afterincomparableyum force-pushed the celeborn-2317 branch 2 times, most recently from 1a54265 to 6e65966 Compare April 30, 2026 06:02
@RexXiong
Copy link
Copy Markdown
Contributor

RexXiong commented May 14, 2026

Thanks for the fix! The path traversal issue is real and the approach is solid.

One suggestion: I think the validateAppId check is only needed in Controller (ReserveSlots, CommitFiles, DestroyWorkerSlots). The reason is that PushData and Fetch operations require the shuffle to already be registered via ReserveSlots — if the appId was never accepted by the Controller, subsequent push/fetch calls will fail anyway because the shuffle metadata doesn't exist.

So the checks added in PushDataHandler.handleCore and FetchHandler are effectively redundant — a malicious appId can never reach the filesystem layer through those paths without first passing through the Controller.

That said, the canonical path containment check in StorageManager is worth keeping as defense-in-depth — it's cheap and guards against potential future RPC entry points that might bypass the Controller.


Reviewed with Claude Code

@afterincomparableyum
Copy link
Copy Markdown
Contributor Author

ping @RexXiong I have addressed your comment.

The worker builds local shuffle paths by concatenating applicationId received over RPC: `<workingDir>/<appId>/<shuffleId>/<fileName>`. The int32 and fileName is built from int id/epoch/mode, but it was never validated against a charset. With auth disabled, any client on the network could supply `appId = "../foo"` and have the worker mkdir, create, or delete files outside its working dir. With auth enabled, the SASL clientId == applicationId equality check in RpcEndpoint did not constrain format, so a tenant whose registered id contained `..` could still escape.

This change:
  - Adds Utils.validateAppId enforcing `^[A-Za-z0-9_-]+$`, which matches
    Spark (`application_<ts>_<n>`, `local-<ts>`), Flink, and MR formats.
  - Calls it at every worker RPC entry point that takes an applicationId
    or shuffleKey from the wire: Controller (ReserveSlots, CommitFiles,
    DestroyWorkerSlots), PushDataHandler.handleCore, and the two
    checkAuth sites in FetchHandler.
  - Adds a canonical path containment check in
    StorageManager.createDiskFile (local disk branch) as defense in
    depth, before mkdirs() runs.
@afterincomparableyum
Copy link
Copy Markdown
Contributor Author

ping @SteNicholas @RexXiong could you please review this when you get the chance.

@SteNicholas SteNicholas force-pushed the main branch 2 times, most recently from 1d92a40 to cf8d472 Compare May 27, 2026 02:11
@SteNicholas
Copy link
Copy Markdown
Member

Review — Validate applicationId to prevent worker path traversal

Good catch on the underlying issue, and the fix targets the right sink. The worker builds <workingDir>/<appId>/<shuffleId>/<fileName> in StorageManager.createDiskFile, which is reached at ReserveSlots time — and that path is now validated. A few things to address before merge.

What works well

  • The canonical-path backstop is done correctly: dir.getCanonicalPath + File.separator with startsWith avoids the sibling-prefix false-negative (/work vs /work-evil), and moving shuffleDir.mkdirs() to after the check means a traversal directory is never created.
  • Allowlist charset is the right approach — rejecting anything outside [A-Za-z0-9_-] excludes ., /, \, and spaces, so .. and path separators can't appear at all.
  • DestroyWorkerSlots correctly splits the shuffleKey first, then validates the extracted appId.
  • Solid positive/negative test coverage.

Issues

  1. Description doesn't match the diff (please reconcile). The PR text says validateAppId is called "at PushDataHandler.handleCore, and the two checkAuth sites in FetchHandler," but the diff changes only 4 files — Utils.scala, UtilsSuite.scala, Controller.scala, StorageManager.scala. There are no changes to PushDataHandler or FetchHandler. Either those calls were dropped from the commit, or the description is stale.

    For what it's worth, the impact is limited: FetchHandler/PushDataHandler resolve files via StorageManager.getFileInfo(shuffleKey, fileName), which is a ConcurrentHashMap lookup — it does not reconstruct a filesystem path from appId. So those handlers aren't independent traversal sinks, and the missing calls are defense-in-depth/consistency rather than a hole in the core fix. But the description should reflect what actually ships.

  2. The regex accepts a trailing newline (low severity, but a validator should be airtight). ^[A-Za-z0-9_-]+$ evaluated with findFirstIn matches "validapp\n", because $ (without MULTILINE) matches before a final line terminator, not strictly end-of-input. I verified the anchor semantics — Java's java.util.regex behaves the same:

    "validapp\n"  -> matches (not rejected)
    

    A lone trailing \n in a directory name isn't itself a traversal sequence, so real-world impact is small, but the clean fix is a full-region match:

    if (applicationId == null || !appIdPattern.pattern.matcher(applicationId).matches()) { ... }

    (matches() anchors the whole input; equivalently use \A...\z.) Worth adding "app\n" to the negative test cases.

  3. Backstop is local-disk-only (defense-in-depth gap). The canonical-path containment check is in the local-disk branch of createDiskFile, but the HDFS/S3/OSS branches also build s"$appId/$shuffleId" paths (e.g. StorageManager.scala:1166, 1171) and rely solely on the upstream validateAppId. Given the Controller guard is the primary defense this is acceptable, but if the intent is layered defense, the DFS branches currently have none — at least worth a comment noting the backstop's scope.

  4. Consider centralizing the check (maintainability). Validation is currently sprinkled per-RPC-case, which is exactly why items 1 & 3 are easy to get wrong — a future handler can forget the call. Since every guarded site calls checkAuth(context, applicationId) right after, calling validateAppId inside checkAuth (so it runs even when auth is disabled) would cover all current and future entry points from one place.

Nice, well-scoped security fix overall — the main blocker is item 1 (make the description and code agree), and item 2 is a quick hardening of the validator itself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants