Skip to content

feat!: secure-by-default symlink handling for ByteStream and converters#11505

Closed
ritikraj2425 wants to merge 1 commit into
deepset-ai:mainfrom
ritikraj2425:fix/secure-symlink-handling
Closed

feat!: secure-by-default symlink handling for ByteStream and converters#11505
ritikraj2425 wants to merge 1 commit into
deepset-ai:mainfrom
ritikraj2425:fix/secure-symlink-handling

Conversation

@ritikraj2425
Copy link
Copy Markdown
Contributor

Related Issues

Proposed Changes:

Currently, ByteStream.from_file_path and the get_bytestream_from_source utility follow symbolic links by default when reading files from disk. This "blind follow" behavior could lead to accidental Local File Inclusion (LFI) or Information Disclosure if a developer passes a user-supplied path that happens to be a symlink to a sensitive system file.

This PR implements a "Secure by Default" approach:

  • Added a follow_symlinks: bool = False parameter to ByteStream.from_file_path.
  • If follow_symlinks is False (the new default), the method checks filepath.is_symlink() and raises a ValueError if a link is detected.
  • Propagated this setting to the get_bytestream_from_source utility so that components reading file paths (like routers or converters) inherit this secure behavior.

How did you test it?

  • Added unit tests in test/dataclasses/test_byte_stream.py to verify that ValueError is raised when trying to read a symlink with default parameters, and that it successfully reads it when follow_symlinks=True is explicitly passed.
  • Added corresponding unit tests in test/components/converters/test_utils.py for get_bytestream_from_source.
  • Verified changes locally using hatch run test:unit and hatch run test:types.

Notes for the reviewer

  • The change uses .is_symlink() rather than aggressively resolving all parent paths to avoid breaking legitimate environments (like macOS /tmp directories which are often symlinked to /private/tmp).
  • This change is isolated to the core dataclasses and utilities to keep the scope focused and avoid breaking the API of all individual converter components at once.

Checklist

  • I have read the contributors guidelines and the code of conduct.
  • I have updated the related issue with new insights and changes.
  • I have added unit tests and updated the docstrings.
  • I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test: and added ! in case the PR includes breaking changes.
  • I have documented my code.
  • I have added a release note file, following the contributors guidelines.
  • I have run pre-commit hooks and fixed any issue.

@ritikraj2425 ritikraj2425 requested a review from a team as a code owner June 4, 2026 03:44
@ritikraj2425 ritikraj2425 requested review from anakin87 and removed request for a team June 4, 2026 03:44
@vercel
Copy link
Copy Markdown

vercel Bot commented Jun 4, 2026

@ritikraj2425 is attempting to deploy a commit to the deepset Team on Vercel.

A member of the Team first needs to authorize it.

@github-actions github-actions Bot added topic:tests type:documentation Improvements on the docs labels Jun 4, 2026
@anakin87
Copy link
Copy Markdown
Member

anakin87 commented Jun 4, 2026

See #11252 (comment)

@anakin87 anakin87 closed this Jun 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

topic:tests type:documentation Improvements on the docs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Proposal] Security Hardening: Add symlink protection to ByteStream and Converters

2 participants