Skip to content

Improve URL parsing in VortexMultiFileScan to handle file paths more …#6842

Open
iceboundrock wants to merge 10 commits intovortex-data:developfrom
Lychee-Technology:develop
Open

Improve URL parsing in VortexMultiFileScan to handle file paths more …#6842
iceboundrock wants to merge 10 commits intovortex-data:developfrom
Lychee-Technology:develop

Conversation

@iceboundrock
Copy link
Contributor

@iceboundrock iceboundrock commented Mar 8, 2026

Summary

This PR fixes the issue that relative paths cause Neither URL nor path error

Closes: #6841

This PR utilizes AI (codex from OpenAI) to generate code.

API Changes

None

Testing

cargo test -p vortex-duckdb --lib

@AdamGS AdamGS added the changelog/fix A bug fix label Mar 8, 2026
@AdamGS AdamGS self-assigned this Mar 8, 2026
…robustly

Fixes vortex-data#6841 where relative file paths caused a "Neither URL nor path" error.

Relative paths are now resolved to absolute paths via `std::fs::canonicalize`
before being converted to a `file://` URL. Glob patterns containing `*`, `?`,
or `[` skip canonicalization because the pattern itself does not correspond to
an existing path.

The URL parsing logic is extracted into a `parse_glob_url` helper with unit
tests covering s3 URLs, `file://` URLs, absolute glob patterns, absolute file
paths, relative file paths, and the error case for non-existent paths.

Signed-off-by: Ruoshi Li <iceboundrock@gmail.com>
iceboundrock and others added 3 commits March 8, 2026 13:14
- Extract URL parsing into parse_glob_url and path resolution into
  canonicalize_path_prefix for testability.
- Fix relative glob paths (e.g. ./data/*.vortex): canonicalize only
  the directory prefix before the first glob character, then append
  the glob suffix, so relative patterns resolve to absolute file URLs.
- Replace repeated generic "Neither URL nor path" errors with specific
  messages that include the underlying IO error (e.g. "No such file or
  directory"), making diagnosis easier.
- Add unit tests covering s3/file URLs, absolute/relative paths,
  absolute/relative glob patterns, and the not-found error case.

Signed-off-by: Ruoshi Li <iceboundrock@gmail.com>
Signed-off-by: Ruoshi Li <iceboundrock@gmail.com>
/// is canonicalized and the remaining glob suffix is appended unchanged. This allows relative glob
/// patterns such as `./data/*.vortex` to be resolved correctly.
fn canonicalize_path_prefix(path_str: &str) -> VortexResult<PathBuf> {
let first_glob_char_pos = path_str.find(['*', '?', '[']);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use the glob crate to detect these pattern? this feel brittle

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will use it in the future. For this PR, I will follow @myrrc 's comment to remove the implementation and just keep tests.

@codspeed-hq
Copy link

codspeed-hq bot commented Mar 9, 2026

Merging this PR will degrade performance by 11%

⚡ 4 improved benchmarks
❌ 2 regressed benchmarks
✅ 994 untouched benchmarks
⏩ 1466 skipped benchmarks1

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation patched_take_200k_first_chunk_only 5.4 ms 4.8 ms +11.97%
Simulation patched_take_200k_dispersed 5.6 ms 4.7 ms +19.86%
Simulation take_200k_dispersed 4.5 ms 3.6 ms +24.35%
Simulation take_200k_first_chunk_only 4.2 ms 3.3 ms +26.42%
Simulation map_each[BufferMut<i32>, 128] 770.6 ns 858.1 ns -10.2%
Simulation bitwise_not_vortex_buffer_mut[128] 471.9 ns 530.3 ns -11%

Comparing Lychee-Technology:develop (1b252df) with develop (0309e57)

Open in CodSpeed

Footnotes

  1. 1466 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@AdamGS AdamGS requested a review from myrrc March 9, 2026 13:57
Copy link
Contributor

@myrrc myrrc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duckdb resolves symlinks in its filesystem module, so we do a double job here with canonicalization. Same for manual glob checks. The error was that we needed to pass Url an absolute path which I've fixed in #6851. We do want to merge tests from this PR though, increased code coverage is always a good thing.

Can you please keep the tests but replace canonicalization/globbing checks with the solution present upstream?

@myrrc myrrc assigned myrrc and unassigned AdamGS Mar 9, 2026
@myrrc myrrc added the ext/duckdb Relates to the DuckDB integration label Mar 9, 2026
@iceboundrock
Copy link
Contributor Author

iceboundrock commented Mar 9, 2026

Duckdb resolves symlinks in its filesystem module, so we do a double job here with canonicalization. Same for manual glob checks. The error was that we needed to pass Url an absolute path which I've fixed in #6851. We do want to merge tests from this PR though, increased code coverage is always a good thing.

Can you please keep the tests but replace canonicalization/globbing checks with the solution present upstream?

Done

iceboundrock and others added 2 commits March 9, 2026 10:03
Merges upstream changes including PR vortex-data#6851 (duckdb: read from local
path, correct err on invalid path). Resolves conflict in multi_file.rs
by keeping the parse_glob_url wrapper function (with its tests) from
this branch, but replacing the canonicalize_path_prefix implementation
with the simpler std::path::absolute approach from upstream.

Signed-off-by: iceboundrock <iceboundrock@outlook.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@iceboundrock iceboundrock requested a review from myrrc March 9, 2026 17:19
}

#[test]
fn test_parse_glob_url_relative_glob_path() -> VortexResult<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I believe the PR is nearly ready.

Can you also add a test for relative paths like "../test.vortex", please?
I've tested and it fails, likely a flaw in my implementation.

$ duckdb -c 'SELECT * FROM "../vortex/vortex-bench/data/tpch/1.0/vortex-file-compressed/customer_0.vortex"'
Binder Error:
Object store error: Object at location /home/myrrc/duckdb-vortex/%2E%2E/vortex/vortex-bench/data/tpch/1.0/vortex-file-compressed/customer_0.vortex not found: No such file or directory (os error 2)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Root cause

std::path::absolute() makes a path absolute but does not normalize .. and/or . components. So ../test.vortex from /home/user/vortex/ becomes /home/user/vortex/../test.vortex, and Url::from_file_path then percent-encodes the dots as %2E%2E, causing DuckDB to look up a wrong/encoded path.

Fix

Added a normalize_path helper function that walks the path components and resolves . and .. without touching the filesystem. This is called after absolute() before creating the Url.

Tests

  • added: test_parse_glob_url_parent_relative_path — creates a temp file in .., constructs a ../filename relative path, and asserts the resulting URL is absolute, has no %2E encoding, and ends with the correct filename.
  • added test_parse_glob_url_dot_normalization, just test the normalize_path algorithm without creating any files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/fix A bug fix ext/duckdb Relates to the DuckDB integration

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Relative paths cause Neither URL nor path error because Url::from_file_path requires absolute paths

3 participants