Skip to content

Improve ergonomics - rebuild Reader around native utf-8 string types#963

Open
dralley wants to merge 13 commits into
tafia:masterfrom
dralley:ergonomics-str
Open

Improve ergonomics - rebuild Reader around native utf-8 string types#963
dralley wants to merge 13 commits into
tafia:masterfrom
dralley:ergonomics-str

Conversation

@dralley
Copy link
Copy Markdown
Collaborator

@dralley dralley commented May 11, 2026

  • Add UTF-8 validation in Reader internals for parsing events
  • Change name / namespace types from &[u8] to &str
  • Change event types from Cow<[u8]> to Cow<str>, remove Decoder
  • Change attribute types from Cow<[u8]> to Cow<str>
  • Remove Decoder & methods from public API

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 11, 2026

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 75.42857% with 172 lines in your changes missing coverage. Please review.
✅ Project coverage is 56.05%. Comparing base (a759d65) to head (ba067e3).
⚠️ Report is 54 commits behind head on master.

Files with missing lines Patch % Lines
src/events/mod.rs 55.94% 63 Missing ⚠️
src/events/attributes.rs 80.58% 20 Missing ⚠️
src/name.rs 86.02% 19 Missing ⚠️
src/reader/state.rs 84.05% 11 Missing ⚠️
examples/read_nodes.rs 0.00% 9 Missing ⚠️
benches/macrobenches.rs 0.00% 8 Missing ⚠️
examples/custom_entities.rs 0.00% 6 Missing ⚠️
src/reader/buffered_reader.rs 79.31% 6 Missing ⚠️
src/reader/slice_reader.rs 77.77% 6 Missing ⚠️
examples/nested_readers.rs 0.00% 5 Missing ⚠️
... and 10 more
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #963      +/-   ##
==========================================
+ Coverage   55.08%   56.05%   +0.96%     
==========================================
  Files          44       46       +2     
  Lines       16911    18036    +1125     
==========================================
+ Hits         9316    10110     +794     
- Misses       7595     7926     +331     
Flag Coverage Δ
unittests 56.05% <75.42%> (+0.96%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@dralley dralley changed the title Improve ergonomics - rebuild Reader around native &str types Improve ergonomics - rebuild Reader around native utf-8 string types May 11, 2026
Comment thread src/de/key.rs Outdated
Cow::Owned(owned) => CowRef::Owned(owned),
},
Cow::Borrowed(b) => {
let name_str = std::str::from_utf8(&b[..start.name_len])
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There will be a handful of these temporary from_utf8() calls, but they should be able to be removed by by subsequent commits as additional types are switched over.

@dralley dralley force-pushed the ergonomics-str branch 4 times, most recently from e028123 to 4a01f13 Compare May 11, 2026 19:31
@dralley
Copy link
Copy Markdown
Collaborator Author

dralley commented May 11, 2026

@Mingun Would you be satisfied if BinaryStream / raw byte buffers were only supported when the rest of the document apart from those buffers is UTF-8?

edit: well, that's what I implemented.

Sidenote: maybe decoded_and_normalized_value() (etc.) ought to be marked deprecated in 0.40.1, to point people in the direction of using DecodingReader

@dralley
Copy link
Copy Markdown
Collaborator Author

dralley commented May 11, 2026

These 3 particular commits are ready for review, with the caveat that there will be (probably) 6-8 additional commits coming.

Comment thread src/utils.rs Outdated
} else {
break;
}
pub fn trim_xml_start(s: &str) -> &str {
Copy link
Copy Markdown
Collaborator Author

@dralley dralley May 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Mingun Keeping this as const would, I think, require bumping the MSRV to 1.86 (for const fn split_at) unless we use unsafe here.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would look like

pub const fn trim_xml_start(s: &str) -> &str {
    let mut bytes = s.as_bytes();
    while let [first, rest @ ..] = bytes {
        if is_whitespace(*first) {
            bytes = rest;
        } else {
            break;
        }
    }
    // SAFETY: trimming ASCII whitespace from a valid UTF-8 string always yields valid UTF-8
    unsafe { core::str::from_utf8_unchecked(bytes) }
}

I went ahead with the safe split_at implementation however

@dralley dralley force-pushed the ergonomics-str branch 3 times, most recently from a198941 to 27e09f5 Compare May 12, 2026 18:50
Comment thread src/events/attributes.rs
impl<'a> Attribute<'a> {
/// Returns the attribute value normalized as per [the XML specification] (or [for 1.0]).
///
/// The document **must** be UTF-8 encoded, or pre-processed using [`DecodingReader`].
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still true, but it doesn't really need to be said here, since the Reader itself now has this requirement.

@dralley dralley force-pushed the ergonomics-str branch 3 times, most recently from 5474a71 to 266bf6f Compare May 12, 2026 20:58
@dralley dralley marked this pull request as ready for review May 12, 2026 21:11
@dralley
Copy link
Copy Markdown
Collaborator Author

dralley commented May 12, 2026

Remaining design questions, not all of which actually need to be dealt with in this PR:

  • Should Reader and / or Deserializer wrap a DecodingReader automatically if the encoding feature is enabled?
    • If we do, should we use from_utf8_unchecked() (since input is pre-validated)?
    • Should some form of built-in stream decoding or stream validation be the "default", with a special feature to disable it for access to BinaryStream / Reader::stream()?
    • Should slice_reader be converted to &str-only, with a reader over &[u8] using the buffered_reader path?
  • Should BytesStart, BytesEnd, BytesText (etc.) be renamed since they are no longer raw bytes but rather guaranteed UTF-8?
    • e.g. XmlStart, XmlEnd, XmlText or just Start, End, Text (which would maybe be ambiguous given Event::Text, Event::Start etc.)
  • Would introducing a Utf8ValidatingReader be worthwhile, or is the encoding_rs dependency not that big a deal?
  • Should a similar built-in wrapper of the inner Reader be used to track position within the file globally (line numbers, etc.) and perform EOL normalization?

@dralley dralley requested a review from Mingun May 12, 2026 21:37
@dralley
Copy link
Copy Markdown
Collaborator Author

dralley commented May 12, 2026

Also, I can improve the commit messages and Changelog entries if needed. The initial are pretty... concise.

@dralley dralley force-pushed the ergonomics-str branch 2 times, most recently from 4ba0d68 to cd8da9b Compare May 12, 2026 22:40
@Mingun
Copy link
Copy Markdown
Collaborator

Mingun commented May 13, 2026

First, I would prefer to keep the ability to parse non-utf8 encoded documents without recoding. XML itself can be parsed without knowing the exact encoding, it is enough if it is XML-compatible (which is all legacy 1-byte encodings that we support). So, is it possible to create a separate reader and event which will be always UTF-8 encoded and keep the current ones for advanced usage? It is fine to promote the new UTF-8-based reader as default, but keep the ability to work with non-UTF-8 input without recoding.

Here is the same situation as for regexp -- although it is defined in terms of strings, nothing prevents it from running on top of any byte arrays. The author of regexp engine even created a bstr crate to add useful string-based methods to byte arrays.

@dralley
Copy link
Copy Markdown
Collaborator Author

dralley commented May 13, 2026

First, I would prefer to keep the ability to parse non-utf8 encoded documents without recoding.

IMO, it is not worth the ergonomic and maintenance costs. If you look at all the major XML parsing libraries like libxml2, expat, encoding/xml (Go), and Jackson (Java) etc, they all do internal transcoding and throw errors if they encounter something that can't be decoded (or or replaced) - with no escape hatch. I suspect if this was a significant use case we would likely not be the only ones catering to it.

e.g.

libxml2 parses & handles UTF-8 only, performs a streaming decode of other encodings
https://dev.w3.org/XInclude-Test-Suite/libxml2-2.4.24/libxml2-2.4.24/doc/encoding.html

expat selects either UTF-8 or UTF-16 as an internal encoding at compile time, decodes to that, returns whichever type of string was selected
https://libexpat.github.io/doc/expat-internals-encodings/

encoding/xml is the same as libxml2 - utf-8 only
https://pkg.go.dev/encoding/xml (search CharsetReader)

Decoding is very very fast relative to XML parsing - it varies depending on encoding and the precise makeup of the document of course, but generally between 15 and 90 Gbps, whereas XML parsing is currently in the ballpark of 0.5 Gbps and often slower, so I don't really think that's a reason to avoid it either.

I would maybe accept the argument that it's a huge API change and it might be warranted to support both for some time to allow a migration, but even then it would likely be easier to just maintain an older branch for a longer period of time.

Duplicating the reader would, I think, be way way more work than it's worth.

@dralley
Copy link
Copy Markdown
Collaborator Author

dralley commented May 13, 2026

Also, the reason the XML libraries work that way, apart from overall simplicity, is that the XML standard effectively requires working that way. The standard actually said that all XML processors should be able to handle
either UTF-16 or UTF-8, and have mandatory fatal decoding errors in many situations, the easiest way to satisfy the requirements is to just do what everyone does, which is decode the document up front, and build a parser against one canonical encoding.

I'm not a complete stickler for compliance, and we do provide a handful of features catering to noncompliant XML and XML-derived document formats (which is fine), but in this case I really don't see a good reason to go out of our way to break with it. It's just more complexity for a use case of (IMO) very questionable value.

https://www.w3.org/TR/xml/

Section 2.2

The mechanism for encoding character code points into bit patterns may vary from entity to entity. **All XML processors MUST accept **the UTF-8 and UTF-16 encodings of Unicode [Unicode]; the mechanisms for signaling which of the two is in use, or for bringing other encodings into play, are discussed later, in 4.3.3 Character Encoding in Entities.

Section 4.3.3 - Character Encoding in Entities

Each external parsed entity in an XML document may use a different encoding for its characters. All XML processors MUST be able to read entities in both the UTF-8 and UTF-16 encodings.

...

It is a fatal error when an XML processor encounters an entity with an encoding that it is unable to process. It is a fatal error if an XML entity is determined (via default, encoding declaration, or higher-level protocol) to be in a certain encoding but contains byte sequences that are not legal in that encoding. Specifically, it is a fatal error if an entity encoded in UTF-8 contains any ill-formed code unit sequences, as defined in section 3.9 of Unicode [Unicode]. Unless an encoding is determined by a higher-level protocol, it is also a fatal error if an XML entity contains no encoding declaration and its content is not legal UTF-8 or UTF-16.

dralley added 13 commits May 13, 2026 17:02
Document that Reader expects UTF-8 input.
Required for const fn split_at - needed to keep trim_xml* functions
const.
Make xml*_content() methods infalliable as they no longer handle
decoding.
Deprecate decode_and* methods, since they no longer serve a purpose.
It is now impossible for ReaderState to receive unvalidated bytes.
This avoids some redundant validation and allows making different
decisions about how to validate for different types of XmlSource.
Eliminates some duplicitous validation
@dralley
Copy link
Copy Markdown
Collaborator Author

dralley commented May 16, 2026

@Mingun Thoughts?

I understand your view, but duplicating the readers would end up being a lot of work I think, and I don't see a reason why we should have a different architecture than every other major XML parsing library. Even trying to put myself in the position of someone that needs to parse legacy encodings, I can't see why I would want to have to deal with those encodings in Rust.

@dralley
Copy link
Copy Markdown
Collaborator Author

dralley commented May 19, 2026

Are you having any issues with Github access?

@Mingun
Copy link
Copy Markdown
Collaborator

Mingun commented May 19, 2026

Are you having any issues with Github access?

No, just not every day enough time to read everything that appeared and answer.

duplicating the readers would end up being a lot of work I think

It seems to me that it is difficult to judge this without trying. It doesn't seem like that to me. The core reader will work with bytes, only event and and attribute types will need to be duplicated. It's not a lot of code.

For editing purposes we may want to be able to resave part of document, keeping the original encoding. We also may want to report byte offsets into the original source from which the document is parsed, which may be hard if we will reencode document always. I think, we need to think about this problems and how to handle them (or not handle at all). I would prefer not to pessimize ahead of time.

I need some time (perhaps of several months, because sometimes you need to take a break from the project in order to then take on it with renewed vigor) to review this and think about / experiment with possible alternative implementations. I will try to finish review for this PR as soon as possible.

I can't see why I would want to have to deal with those encodings in Rust.

For example, if we want to save memory. UTF-8 could be more that 2x times more memory than using legacy encodings. This can be critical if you need to keep large files in memory for some reason.

@dralley
Copy link
Copy Markdown
Collaborator Author

dralley commented May 19, 2026

Take your time on the review. However, re: your points -

Counterarguments:

It seems to me that it is difficult to judge this without trying. It doesn't seem like that to me. The core reader will work with bytes, only event and and attribute types will need to be duplicated. It's not a lot of code.

The amount of code that works with those types is pretty significant though. They'd still be different types ultimately, so even if the meat of the implementation is shared, you still need either duplicate definitions of all the user-facing methods of those types including duplicate doc comments that need to be kept in-sync; or else making a much larger portion of the codebase into macros than is already the case (which I wouldn't want to do). Tests are also an issue, because they can only test one API at a time, unless again everything becomes a macro, or you just pick one variant to test thoroughly and only trivially exercise the other.

Either way, I think that ends up being a significant maintenance cost and challenge. That's just my impression, maybe I'm wrong.

For example, if we want to save memory. UTF-8 could be more that 2x times more memory than using legacy encodings. This can be critical if you need to keep large files in memory for some reason.

quick-xml is already a streaming parser, which already puts it ahead of most parser options. Memory use would only be a major concern if the user needed to accumulate large portions of the document in memory anyway. And I would have to imagine it unlikely that a document in a legacy encoding would be so large in the first place, because such documents are usually somewhat constrained by the resources available during the time of its vintage. There's not a lot of new documents of those encodings being written since the 2000s.

Even if that was not the case - due to the streaming nature and the fact that no other XML library I can find permits this (much less streaming ones, specifically), quick-xml would still be one of the best options.

For editing purposes we may want to be able to resave part of document, keeping the original encoding. We also may want to report byte offsets into the original source from which the document is parsed, which may be hard if we will reencode document always. I think, we need to think about this problems and how to handle them (or not handle at all). I would prefer not to pessimize ahead of time.

I take this to mean "parse to position, obtain byte position, acquire a handle and seek + overwrite within the file". I guess that's possible, but has limited utility (it could only be used to overwrite exactly the same # of bytes) and comes at higher risk of corrupting the file. (On top of possible mistakes, it's also more susceptible to TOCTOU - maybe not so much in the "exploit" sense but in the correctness sense).

However it's also not that difficult or expensive to reverse-engineer the input buffer positions from a little bit of state that we already need to keep track of. We know the position of the start of the input buffer, we know the parser position in the output stream, it's just an exercise of counting characters on both sides, which can be optimized in various ways to minimize re-counting.

e.g. with single byte input encodings, the character count between the start of the output buffer and the current parser position is the byte offset (which you can add to the position of the beginning of the input buffer). The infrastructure for this is mostly identical to what needs to be added for proper reporting of line + column as well - something we can do much more reliably with a uniform input encoding.

Anyway, I'd like to not pessimize the common cases in favor of uncommon ones - I can't say that nobody would use those things, but I strongly suspect that more users would appreciate improved spec compliance and more convenient APIs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants