Improve ergonomics - rebuild Reader around native utf-8 string types#963
Improve ergonomics - rebuild Reader around native utf-8 string types#963dralley wants to merge 13 commits into
Conversation
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #963 +/- ##
==========================================
+ Coverage 55.08% 56.05% +0.96%
==========================================
Files 44 46 +2
Lines 16911 18036 +1125
==========================================
+ Hits 9316 10110 +794
- Misses 7595 7926 +331
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| Cow::Owned(owned) => CowRef::Owned(owned), | ||
| }, | ||
| Cow::Borrowed(b) => { | ||
| let name_str = std::str::from_utf8(&b[..start.name_len]) |
There was a problem hiding this comment.
There will be a handful of these temporary from_utf8() calls, but they should be able to be removed by by subsequent commits as additional types are switched over.
e028123 to
4a01f13
Compare
|
@Mingun Would you be satisfied if edit: well, that's what I implemented. Sidenote: maybe |
|
These 3 particular commits are ready for review, with the caveat that there will be (probably) 6-8 additional commits coming. |
| } else { | ||
| break; | ||
| } | ||
| pub fn trim_xml_start(s: &str) -> &str { |
There was a problem hiding this comment.
@Mingun Keeping this as const would, I think, require bumping the MSRV to 1.86 (for const fn split_at) unless we use unsafe here.
There was a problem hiding this comment.
That would look like
pub const fn trim_xml_start(s: &str) -> &str {
let mut bytes = s.as_bytes();
while let [first, rest @ ..] = bytes {
if is_whitespace(*first) {
bytes = rest;
} else {
break;
}
}
// SAFETY: trimming ASCII whitespace from a valid UTF-8 string always yields valid UTF-8
unsafe { core::str::from_utf8_unchecked(bytes) }
}I went ahead with the safe split_at implementation however
a198941 to
27e09f5
Compare
| impl<'a> Attribute<'a> { | ||
| /// Returns the attribute value normalized as per [the XML specification] (or [for 1.0]). | ||
| /// | ||
| /// The document **must** be UTF-8 encoded, or pre-processed using [`DecodingReader`]. |
There was a problem hiding this comment.
This is still true, but it doesn't really need to be said here, since the Reader itself now has this requirement.
5474a71 to
266bf6f
Compare
|
Remaining design questions, not all of which actually need to be dealt with in this PR:
|
|
Also, I can improve the commit messages and Changelog entries if needed. The initial are pretty... concise. |
4ba0d68 to
cd8da9b
Compare
|
First, I would prefer to keep the ability to parse non-utf8 encoded documents without recoding. XML itself can be parsed without knowing the exact encoding, it is enough if it is XML-compatible (which is all legacy 1-byte encodings that we support). So, is it possible to create a separate reader and event which will be always UTF-8 encoded and keep the current ones for advanced usage? It is fine to promote the new UTF-8-based reader as default, but keep the ability to work with non-UTF-8 input without recoding. Here is the same situation as for regexp -- although it is defined in terms of strings, nothing prevents it from running on top of any byte arrays. The author of regexp engine even created a |
IMO, it is not worth the ergonomic and maintenance costs. If you look at all the major XML parsing libraries like e.g. libxml2 parses & handles UTF-8 only, performs a streaming decode of other encodings expat selects either UTF-8 or UTF-16 as an internal encoding at compile time, decodes to that, returns whichever type of string was selected encoding/xml is the same as libxml2 - utf-8 only Decoding is very very fast relative to XML parsing - it varies depending on encoding and the precise makeup of the document of course, but generally between 15 and 90 Gbps, whereas XML parsing is currently in the ballpark of 0.5 Gbps and often slower, so I don't really think that's a reason to avoid it either. I would maybe accept the argument that it's a huge API change and it might be warranted to support both for some time to allow a migration, but even then it would likely be easier to just maintain an older branch for a longer period of time. Duplicating the reader would, I think, be way way more work than it's worth. |
|
Also, the reason the XML libraries work that way, apart from overall simplicity, is that the XML standard effectively requires working that way. The standard actually said that all XML processors should be able to handle I'm not a complete stickler for compliance, and we do provide a handful of features catering to noncompliant XML and XML-derived document formats (which is fine), but in this case I really don't see a good reason to go out of our way to break with it. It's just more complexity for a use case of (IMO) very questionable value. Section 2.2
Section 4.3.3 - Character Encoding in Entities
...
|
Document that Reader expects UTF-8 input.
Required for const fn split_at - needed to keep trim_xml* functions const.
Make xml*_content() methods infalliable as they no longer handle decoding.
Deprecate decode_and* methods, since they no longer serve a purpose.
It is now impossible for ReaderState to receive unvalidated bytes. This avoids some redundant validation and allows making different decisions about how to validate for different types of XmlSource.
Eliminates some duplicitous validation
|
@Mingun Thoughts? I understand your view, but duplicating the readers would end up being a lot of work I think, and I don't see a reason why we should have a different architecture than every other major XML parsing library. Even trying to put myself in the position of someone that needs to parse legacy encodings, I can't see why I would want to have to deal with those encodings in Rust. |
|
Are you having any issues with Github access? |
No, just not every day enough time to read everything that appeared and answer.
It seems to me that it is difficult to judge this without trying. It doesn't seem like that to me. The core reader will work with bytes, only event and and attribute types will need to be duplicated. It's not a lot of code. For editing purposes we may want to be able to resave part of document, keeping the original encoding. We also may want to report byte offsets into the original source from which the document is parsed, which may be hard if we will reencode document always. I think, we need to think about this problems and how to handle them (or not handle at all). I would prefer not to pessimize ahead of time. I need some time (perhaps of several months, because sometimes you need to take a break from the project in order to then take on it with renewed vigor) to review this and think about / experiment with possible alternative implementations. I will try to finish review for this PR as soon as possible.
For example, if we want to save memory. UTF-8 could be more that 2x times more memory than using legacy encodings. This can be critical if you need to keep large files in memory for some reason. |
|
Take your time on the review. However, re: your points - Counterarguments:
The amount of code that works with those types is pretty significant though. They'd still be different types ultimately, so even if the meat of the implementation is shared, you still need either duplicate definitions of all the user-facing methods of those types including duplicate doc comments that need to be kept in-sync; or else making a much larger portion of the codebase into macros than is already the case (which I wouldn't want to do). Tests are also an issue, because they can only test one API at a time, unless again everything becomes a macro, or you just pick one variant to test thoroughly and only trivially exercise the other. Either way, I think that ends up being a significant maintenance cost and challenge. That's just my impression, maybe I'm wrong.
quick-xml is already a streaming parser, which already puts it ahead of most parser options. Memory use would only be a major concern if the user needed to accumulate large portions of the document in memory anyway. And I would have to imagine it unlikely that a document in a legacy encoding would be so large in the first place, because such documents are usually somewhat constrained by the resources available during the time of its vintage. There's not a lot of new documents of those encodings being written since the 2000s. Even if that was not the case - due to the streaming nature and the fact that no other XML library I can find permits this (much less streaming ones, specifically), quick-xml would still be one of the best options.
I take this to mean "parse to position, obtain byte position, acquire a handle and seek + overwrite within the file". I guess that's possible, but has limited utility (it could only be used to overwrite exactly the same # of bytes) and comes at higher risk of corrupting the file. (On top of possible mistakes, it's also more susceptible to TOCTOU - maybe not so much in the "exploit" sense but in the correctness sense). However it's also not that difficult or expensive to reverse-engineer the input buffer positions from a little bit of state that we already need to keep track of. We know the position of the start of the input buffer, we know the parser position in the output stream, it's just an exercise of counting characters on both sides, which can be optimized in various ways to minimize re-counting. e.g. with single byte input encodings, the character count between the start of the output buffer and the current parser position is the byte offset (which you can add to the position of the beginning of the input buffer). The infrastructure for this is mostly identical to what needs to be added for proper reporting of line + column as well - something we can do much more reliably with a uniform input encoding. Anyway, I'd like to not pessimize the common cases in favor of uncommon ones - I can't say that nobody would use those things, but I strongly suspect that more users would appreciate improved spec compliance and more convenient APIs. |
Cow<[u8]>toCow<str>, remove DecoderCow<[u8]>toCow<str>