Improve URL parsing in VortexMultiFileScan to handle file paths more …#6842
Improve URL parsing in VortexMultiFileScan to handle file paths more …#6842iceboundrock wants to merge 10 commits intovortex-data:developfrom
Conversation
…robustly Fixes vortex-data#6841 where relative file paths caused a "Neither URL nor path" error. Relative paths are now resolved to absolute paths via `std::fs::canonicalize` before being converted to a `file://` URL. Glob patterns containing `*`, `?`, or `[` skip canonicalization because the pattern itself does not correspond to an existing path. The URL parsing logic is extracted into a `parse_glob_url` helper with unit tests covering s3 URLs, `file://` URLs, absolute glob patterns, absolute file paths, relative file paths, and the error case for non-existent paths. Signed-off-by: Ruoshi Li <iceboundrock@gmail.com>
- Extract URL parsing into parse_glob_url and path resolution into canonicalize_path_prefix for testability. - Fix relative glob paths (e.g. ./data/*.vortex): canonicalize only the directory prefix before the first glob character, then append the glob suffix, so relative patterns resolve to absolute file URLs. - Replace repeated generic "Neither URL nor path" errors with specific messages that include the underlying IO error (e.g. "No such file or directory"), making diagnosis easier. - Add unit tests covering s3/file URLs, absolute/relative paths, absolute/relative glob patterns, and the not-found error case. Signed-off-by: Ruoshi Li <iceboundrock@gmail.com>
Signed-off-by: Ruoshi Li <iceboundrock@gmail.com>
vortex-duckdb/src/multi_file.rs
Outdated
| /// is canonicalized and the remaining glob suffix is appended unchanged. This allows relative glob | ||
| /// patterns such as `./data/*.vortex` to be resolved correctly. | ||
| fn canonicalize_path_prefix(path_str: &str) -> VortexResult<PathBuf> { | ||
| let first_glob_char_pos = path_str.find(['*', '?', '[']); |
There was a problem hiding this comment.
can we use the glob crate to detect these pattern? this feel brittle
There was a problem hiding this comment.
I will use it in the future. For this PR, I will follow @myrrc 's comment to remove the implementation and just keep tests.
Merging this PR will degrade performance by 11%
Performance Changes
Comparing Footnotes
|
myrrc
left a comment
There was a problem hiding this comment.
Duckdb resolves symlinks in its filesystem module, so we do a double job here with canonicalization. Same for manual glob checks. The error was that we needed to pass Url an absolute path which I've fixed in #6851. We do want to merge tests from this PR though, increased code coverage is always a good thing.
Can you please keep the tests but replace canonicalization/globbing checks with the solution present upstream?
Done |
Merges upstream changes including PR vortex-data#6851 (duckdb: read from local path, correct err on invalid path). Resolves conflict in multi_file.rs by keeping the parse_glob_url wrapper function (with its tests) from this branch, but replacing the canonicalize_path_prefix implementation with the simpler std::path::absolute approach from upstream. Signed-off-by: iceboundrock <iceboundrock@outlook.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
| } | ||
|
|
||
| #[test] | ||
| fn test_parse_glob_url_relative_glob_path() -> VortexResult<()> { |
There was a problem hiding this comment.
Thanks, I believe the PR is nearly ready.
Can you also add a test for relative paths like "../test.vortex", please?
I've tested and it fails, likely a flaw in my implementation.
$ duckdb -c 'SELECT * FROM "../vortex/vortex-bench/data/tpch/1.0/vortex-file-compressed/customer_0.vortex"'
Binder Error:
Object store error: Object at location /home/myrrc/duckdb-vortex/%2E%2E/vortex/vortex-bench/data/tpch/1.0/vortex-file-compressed/customer_0.vortex not found: No such file or directory (os error 2)
There was a problem hiding this comment.
Root cause
std::path::absolute() makes a path absolute but does not normalize .. and/or . components. So ../test.vortex from /home/user/vortex/ becomes /home/user/vortex/../test.vortex, and Url::from_file_path then percent-encodes the dots as %2E%2E, causing DuckDB to look up a wrong/encoded path.
Fix
Added a normalize_path helper function that walks the path components and resolves . and .. without touching the filesystem. This is called after absolute() before creating the Url.
Tests
- added:
test_parse_glob_url_parent_relative_path— creates a temp file in.., constructs a../filenamerelative path, and asserts the resulting URL is absolute, has no%2Eencoding, and ends with the correct filename. - added
test_parse_glob_url_dot_normalization, just test thenormalize_pathalgorithm without creating any files.
Signed-off-by: Ruoshi Li <iceboundrock@gmail.com>
Summary
This PR fixes the issue that relative paths cause Neither URL nor path error
Closes: #6841
This PR utilizes AI (codex from OpenAI) to generate code.
API Changes
None
Testing
cargo test -p vortex-duckdb --lib