[Variant] Support ['fieldName'] in VariantPath parser#9276
[Variant] Support ['fieldName'] in VariantPath parser#9276klion26 wants to merge 7 commits intoapache:mainfrom
['fieldName'] in VariantPath parser#9276Conversation
88e1964 to
3201b0e
Compare
parquet-variant/src/path.rs
Outdated
|
|
||
| /// Parses a path string, panics on invalid input. | ||
| /// Only use for tests for known-valid input. | ||
| pub fn from_str_unchecked(s: &'a str) -> Self { |
There was a problem hiding this comment.
In rust and other parts of this crate the "unchecked" word is often used to denote unsafe code -- specifically that could lead to undefined behavior
I suggest we don't call it "unchecked"
I think just from_str is probably fine ?
There was a problem hiding this comment.
Thanks for the feedback. changed to from_str_or_panic. because clippy discourage to implement a from_str manually, and the std::str::FromStr::from_str returns Result<Self, Self::Err>; which is the same as TryFrom.
error: method `from_str` can be confused for the standard trait method `std::str::FromStr::from_str
There was a problem hiding this comment.
How about just using try_from("foo").unwrap() ?
Out of curiosity, why would we need to force escaping anything other than |
d76d9e7 to
8e2dcdb
Compare
8e2dcdb to
634f236
Compare
klion26
left a comment
There was a problem hiding this comment.
@alamb @scovich thanks for the review and sorry for the late reply, I've been busy with other things lately.
Also support escaping
\inside brackets. If we force users to use brackets when the field contains special characters, maybe we can also close #8954Out of curiosity, why would we need to force escaping anything other than
]? Even a\.wouldn't be enough to make splitting on.safe, so we'd just need a simple parser that scans the string looking for matching]after encountering a[?
@scovich Sorry for the confusion. In the description, I want to say we only support special characters like '.' between '[' and ']', if the user wants to use special characters, he/she nees to use them between '[' and ']', so the path parser is simpler and easier to maintain. If we prefer to support special characters in addition to '[' and ']', we can continue with #8954 and then implement it.
parquet-variant/src/path.rs
Outdated
|
|
||
| /// Parses a path string, panics on invalid input. | ||
| /// Only use for tests for known-valid input. | ||
| pub fn from_str_unchecked(s: &'a str) -> Self { |
There was a problem hiding this comment.
Thanks for the feedback. changed to from_str_or_panic. because clippy discourage to implement a from_str manually, and the std::str::FromStr::from_str returns Result<Self, Self::Err>; which is the same as TryFrom.
error: method `from_str` can be confused for the standard trait method `std::str::FromStr::from_str
|
I'll try and review this one over the next day or two to get it into the 58 release |
| /// let shredding_type = ShreddedSchemaBuilder::default() | ||
| /// // store the "time" field as a separate UTC timestamp | ||
| /// .with_path("time", (&DataType::Timestamp(TimeUnit::Nanosecond, Some("UTC".into())), true)) | ||
| /// .with_path(VariantPath::from_str_or_panic("time"), (&DataType::Timestamp(TimeUnit::Nanosecond, Some("UTC".into())), true)) |
There was a problem hiding this comment.
This looks so ugly compared to the previous version
It seems if we are going to do this, we should probably also change the signature of with_path because now P: Into<VariantPath<'a>>, is unlikely to match anything other than VariantPath
What if we changed the signature to take a TryFrom and return the correct error?
Then the example could look like
/// let shredding_type = ShreddedSchemaBuilder::default()
/// // store the "time" field as a separate UTC timestamp
/// .with_path("time", &DataType::Timestamp(TimeUnit::Nanosecond, Some("UTC".into())), true))? <-- note the ? here
/// ...
There was a problem hiding this comment.
Fixed, this's indeed a better way.
| single_variant_get_test( | ||
| r#"{"some_field": 1234}"#, | ||
| VariantPath::from("some_field"), | ||
| VariantPath::from_str_or_panic("some_field"), |
There was a problem hiding this comment.
I think this would be clearer to be honest:
| VariantPath::from_str_or_panic("some_field"), | |
| VariantPath::try_from("some_field").unwrap(), |
Or maybe a helper like
fn parse_path(p: &str) -> VariantPath<'a> {
VariantPath::try_from(p).unwrap(),
}And then
parse_path("some_field")There was a problem hiding this comment.
changed to try_from(..).unwrap()
parquet-variant/src/path.rs
Outdated
| #[test] | ||
| fn test_variant_path_empty_str() { | ||
| let path = VariantPath::from(""); | ||
| let path = VariantPath::from_str_or_panic(""); |
There was a problem hiding this comment.
I think
let path = VariantPath::try_from("").unwrap()Is no less verbose and is easier to understand (as it uses the standard rust traits)
6e3c4ce to
7a1d447
Compare
| let path: VariantPath<'a> = path.into(); | ||
| let path: VariantPath<'a> = path | ||
| .try_into() | ||
| .map_err(|e| ArrowError::InvalidArgumentError(format!("{:?}", e)))?; |
There was a problem hiding this comment.
Use this line and L544, because we'll have VaraintPath -> VariantPath, which is Infallible, and there is no From<Infallible> for ArrowError, ArrowError is located in arrow-schema, not sure if it's appropriate to make the change there. If add From<Infallible> for ArrowError is better, I can change it.
Which issue does this PR close?
..and['fieldName']syntax in the VariantPath parser #9050.VariantPathparsing #8954Rationale for this change
What changes are included in this PR?
Add
[fieldName]support inVariantPathparser, will throw an error if the parser fails.Also support escaping
\inside brackets. If we force users to use brackets when the field contains special characters, maybe we can also close #8954Sample behaviors(read more on the code doc)
[foo]-> filedfoo[2]-> index 2[a.b]-> fielda.b[a\]b]-> fielda]b[a\xb]-> fieldaxbAre these changes tested?
Added tests
Are there any user-facing changes?
Yes, there are some user-facing changes, but
parquet-variantis still experient for now, so maybe we don't need to wait for a major version.