Skip to content

Skip directory entries in RecursiveCopyableDataset to fix IOException on empty source dirs#4181

Open
agam-99 wants to merge 2 commits intoapache:masterfrom
agam-99:ETL-19035
Open

Skip directory entries in RecursiveCopyableDataset to fix IOException on empty source dirs#4181
agam-99 wants to merge 2 commits intoapache:masterfrom
agam-99:ETL-19035

Conversation

@agam-99
Copy link
Copy Markdown
Contributor

@agam-99 agam-99 commented Mar 30, 2026

Dear Gobblin maintainers,

Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!

JIRA

Description

  • Here are some details about my PR, including screenshots (if applicable):

RecursiveCopyableDataset.getCopyableFilesImpl iterates over filesInSource entries and calls
CopyableFile.resolveReplicatedOwnerAndPermissionsRecursively(fs, file.getPath().getParent(), replacedPrefix, ...)
for each. This method walks from file.getPath().getParent() up to replacedPrefix collecting
directory permissions to replicate at the destination. It requires replacedPrefix to be an
ancestor of file.getPath().getParent().

When source.path points to an empty directory, FileListUtils.listFilesToCopyAtPath
(with includeEmptyDirectories=true) returns the directory itself as the sole FileStatus entry.
For this entry file.getPath() == replacedPrefix, so file.getPath().getParent() is one level
above replacedPrefix. This inverts the ancestry check and throws:

IOException: toPath /jobs/.../edp_lts_opsreview must be an ancestor of fromPath hdfs://cluster/.../u_ltscso.db

Fix: Guard the fromPath argument passed to resolveReplicatedOwnerAndPermissionsRecursively
with an isAncestor check:

Path parentPath = file.getPath().getParent();
Path ancestorFromPath = PathUtils.isAncestor(replacedPrefix, parentPath) ? parentPath : file.getPath();

When parentPath is within replacedPrefix (normal files, nested empty subdirs) it is used
unchanged — no behavior change. When parentPath is above replacedPrefix (empty root dir),
file.getPath() is used instead, which equals replacedPrefix, causing the walk to produce an
empty ancestors list. The empty directory is still replicated at the destination.

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:

Added testEmptySourceDirectoryProducesCopyEntityForDirectory to RecursiveCopyableDatasetTest:

  • Simulates the exact condition that triggered the bug: filesInSource contains only the
    empty-directory FileStatus entry (as returned by FileListUtils when includeEmptyDirectories=true)
  • Verifies that getCopyableFiles produces one copy entity mapping the empty directory to the
    target path, rather than throwing IOException

Commits

  • My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits
    if they address the same issue. In addition, my commits follow the guidelines from
    "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

agam-99 and others added 2 commits March 30, 2026 12:07
…avoid IOException on empty source dirs

When source.path is an empty directory, FileListUtils.listFilesToCopyAtPath
(includeEmptyDirectories=true) returns the directory itself as the sole
FileStatus entry. The subsequent call to
resolveReplicatedOwnerAndPermissionsRecursively passes file.getPath().getParent()
as fromPath — which is *above* replacedPrefix — inverting the ancestry check
and throwing an IOException.

Skip FileStatus entries where isDirectory=true; empty source directories
produce no copy work units by design. Log a warning so operators can
diagnose misconfigured source paths.
… in RecursiveCopyableDataset

When includeEmptyDirectories=true and the source root is empty, FileListUtils
returns the root directory itself as a FileStatus entry. Calling .getParent()
on it produces a path above replacedPrefix, breaking the ancestry check in
resolveReplicatedOwnerAndPermissionsRecursively.

Fix: guard with isAncestor(replacedPrefix, parentPath) — fall back to the
file's own path only when parentPath is above the dataset root. This preserves
correct ancestor permission replication for nested empty subdirs (where
.getParent() is still within replacedPrefix) and ensures the empty root
directory is replicated at the destination rather than skipped.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@codecov-commenter
Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 43.30%. Comparing base (26406a8) to head (2852359).
⚠️ Report is 7 commits behind head on master.

Additional details and impacted files
@@             Coverage Diff              @@
##             master    #4181      +/-   ##
============================================
+ Coverage     43.26%   43.30%   +0.03%     
- Complexity     2583     2587       +4     
============================================
  Files           516      516              
  Lines         22087    22101      +14     
  Branches       2505     2505              
============================================
+ Hits           9557     9570      +13     
- Misses        11566    11569       +3     
+ Partials        964      962       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants