Improve ergonomics - rebuild Reader around native utf-8 string types by dralley · Pull Request #963 · tafia/quick-xml

dralley · 2026-05-11T15:01:37Z

Add UTF-8 validation in Reader internals for parsing events
Change name / namespace types from &[u8] to &str
Change event types from Cow<[u8]> to Cow<str>, remove Decoder
Change attribute types from Cow<[u8]> to Cow<str>
Remove Decoder & methods from public API

codecov-commenter · 2026-05-11T15:16:18Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 75.42857% with 172 lines in your changes missing coverage. Please review.
✅ Project coverage is 56.05%. Comparing base (a759d65) to head (ba067e3).
⚠️ Report is 54 commits behind head on master.

Files with missing lines	Patch %	Lines
src/events/mod.rs	55.94%	63 Missing ⚠️
src/events/attributes.rs	80.58%	20 Missing ⚠️
src/name.rs	86.02%	19 Missing ⚠️
src/reader/state.rs	84.05%	11 Missing ⚠️
examples/read_nodes.rs	0.00%	9 Missing ⚠️
benches/macrobenches.rs	0.00%	8 Missing ⚠️
examples/custom_entities.rs	0.00%	6 Missing ⚠️
src/reader/buffered_reader.rs	79.31%	6 Missing ⚠️
src/reader/slice_reader.rs	77.77%	6 Missing ⚠️
examples/nested_readers.rs	0.00%	5 Missing ⚠️
... and 10 more
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #963      +/-   ##
==========================================
+ Coverage   55.08%   56.05%   +0.96%     
==========================================
  Files          44       46       +2     
  Lines       16911    18036    +1125     
==========================================
+ Hits         9316    10110     +794     
- Misses       7595     7926     +331

Flag	Coverage Δ
unittests	`56.05% <75.42%> (+0.96%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

dralley · 2026-05-11T16:44:45Z

-                Cow::Owned(owned) => CowRef::Owned(owned),
-            },
+            Cow::Borrowed(b) => {
+                let name_str = std::str::from_utf8(&b[..start.name_len])


There will be a handful of these temporary from_utf8() calls, but they should be able to be removed by by subsequent commits as additional types are switched over.

dralley · 2026-05-11T19:50:22Z

@Mingun Would you be satisfied if BinaryStream / raw byte buffers were only supported when the rest of the document apart from those buffers is UTF-8?

edit: well, that's what I implemented.

Sidenote: maybe decoded_and_normalized_value() (etc.) ought to be marked deprecated in 0.40.1, to point people in the direction of using DecodingReader

dralley · 2026-05-11T22:15:36Z

These 3 particular commits are ready for review, with the caveat that there will be (probably) 6-8 additional commits coming.

dralley · 2026-05-12T18:27:38Z

-        } else {
-            break;
-        }
+pub fn trim_xml_start(s: &str) -> &str {


@Mingun Keeping this as const would, I think, require bumping the MSRV to 1.86 (for const fn split_at) unless we use unsafe here.

That would look like

pub const fn trim_xml_start(s: &str) -> &str { let mut bytes = s.as_bytes(); while let [first, rest @ ..] = bytes { if is_whitespace(*first) { bytes = rest; } else { break; } } // SAFETY: trimming ASCII whitespace from a valid UTF-8 string always yields valid UTF-8 unsafe { core::str::from_utf8_unchecked(bytes) } }

I went ahead with the safe split_at implementation however

dralley · 2026-05-12T18:53:42Z

 impl<'a> Attribute<'a> {
    /// Returns the attribute value normalized as per [the XML specification] (or [for 1.0]).
    ///
-    /// The document **must** be UTF-8 encoded, or pre-processed using [`DecodingReader`].


This is still true, but it doesn't really need to be said here, since the Reader itself now has this requirement.

dralley · 2026-05-12T21:27:23Z

Remaining design questions, not all of which actually need to be dealt with in this PR:

Should Reader and / or Deserializer wrap a DecodingReader automatically if the encoding feature is enabled?
- If we do, should we use from_utf8_unchecked() (since input is pre-validated)?
- Should some form of built-in stream decoding or stream validation be the "default", with a special feature to disable it for access to BinaryStream / Reader::stream()?
- Should slice_reader be converted to &str-only, with a reader over &[u8] using the buffered_reader path?
Should BytesStart, BytesEnd, BytesText (etc.) be renamed since they are no longer raw bytes but rather guaranteed UTF-8?
- e.g. XmlStart, XmlEnd, XmlText or just Start, End, Text (which would maybe be ambiguous given Event::Text, Event::Start etc.)
Would introducing a Utf8ValidatingReader be worthwhile, or is the encoding_rs dependency not that big a deal?
Should a similar built-in wrapper of the inner Reader be used to track position within the file globally (line numbers, etc.) and perform EOL normalization?

dralley · 2026-05-12T21:46:57Z

Also, I can improve the commit messages and Changelog entries if needed. The initial are pretty... concise.

Mingun · 2026-05-13T17:15:48Z

First, I would prefer to keep the ability to parse non-utf8 encoded documents without recoding. XML itself can be parsed without knowing the exact encoding, it is enough if it is XML-compatible (which is all legacy 1-byte encodings that we support). So, is it possible to create a separate reader and event which will be always UTF-8 encoded and keep the current ones for advanced usage? It is fine to promote the new UTF-8-based reader as default, but keep the ability to work with non-UTF-8 input without recoding.

Here is the same situation as for regexp -- although it is defined in terms of strings, nothing prevents it from running on top of any byte arrays. The author of regexp engine even created a bstr crate to add useful string-based methods to byte arrays.

dralley · 2026-05-13T19:27:21Z

First, I would prefer to keep the ability to parse non-utf8 encoded documents without recoding.

IMO, it is not worth the ergonomic and maintenance costs. If you look at all the major XML parsing libraries like libxml2, expat, encoding/xml (Go), and Jackson (Java) etc, they all do internal transcoding and throw errors if they encounter something that can't be decoded (or or replaced) - with no escape hatch. I suspect if this was a significant use case we would likely not be the only ones catering to it.

e.g.

libxml2 parses & handles UTF-8 only, performs a streaming decode of other encodings
https://dev.w3.org/XInclude-Test-Suite/libxml2-2.4.24/libxml2-2.4.24/doc/encoding.html

expat selects either UTF-8 or UTF-16 as an internal encoding at compile time, decodes to that, returns whichever type of string was selected
https://libexpat.github.io/doc/expat-internals-encodings/

encoding/xml is the same as libxml2 - utf-8 only
https://pkg.go.dev/encoding/xml (search CharsetReader)

Decoding is very very fast relative to XML parsing - it varies depending on encoding and the precise makeup of the document of course, but generally between 15 and 90 Gbps, whereas XML parsing is currently in the ballpark of 0.5 Gbps and often slower, so I don't really think that's a reason to avoid it either.

I would maybe accept the argument that it's a huge API change and it might be warranted to support both for some time to allow a migration, but even then it would likely be easier to just maintain an older branch for a longer period of time.

Duplicating the reader would, I think, be way way more work than it's worth.

dralley · 2026-05-13T19:57:58Z

Also, the reason the XML libraries work that way, apart from overall simplicity, is that the XML standard effectively requires working that way. The standard actually said that all XML processors should be able to handle
either UTF-16 or UTF-8, and have mandatory fatal decoding errors in many situations, the easiest way to satisfy the requirements is to just do what everyone does, which is decode the document up front, and build a parser against one canonical encoding.

I'm not a complete stickler for compliance, and we do provide a handful of features catering to noncompliant XML and XML-derived document formats (which is fine), but in this case I really don't see a good reason to go out of our way to break with it. It's just more complexity for a use case of (IMO) very questionable value.

https://www.w3.org/TR/xml/

Section 2.2

The mechanism for encoding character code points into bit patterns may vary from entity to entity. **All XML processors MUST accept **the UTF-8 and UTF-16 encodings of Unicode [Unicode]; the mechanisms for signaling which of the two is in use, or for bringing other encodings into play, are discussed later, in 4.3.3 Character Encoding in Entities.

Section 4.3.3 - Character Encoding in Entities

Each external parsed entity in an XML document may use a different encoding for its characters. All XML processors MUST be able to read entities in both the UTF-8 and UTF-16 encodings.

...

It is a fatal error when an XML processor encounters an entity with an encoding that it is unable to process. It is a fatal error if an XML entity is determined (via default, encoding declaration, or higher-level protocol) to be in a certain encoding but contains byte sequences that are not legal in that encoding. Specifically, it is a fatal error if an entity encoded in UTF-8 contains any ill-formed code unit sequences, as defined in section 3.9 of Unicode [Unicode]. Unless an encoding is determined by a higher-level protocol, it is also a fatal error if an XML entity contains no encoding declaration and its content is not legal UTF-8 or UTF-16.

Document that Reader expects UTF-8 input.

Required for const fn split_at - needed to keep trim_xml* functions const.

Make xml*_content() methods infalliable as they no longer handle decoding.

Deprecate decode_and* methods, since they no longer serve a purpose.

It is now impossible for ReaderState to receive unvalidated bytes. This avoids some redundant validation and allows making different decisions about how to validate for different types of XmlSource.

Eliminates some duplicitous validation

dralley · 2026-05-16T23:33:45Z

@Mingun Thoughts?

I understand your view, but duplicating the readers would end up being a lot of work I think, and I don't see a reason why we should have a different architecture than every other major XML parsing library. Even trying to put myself in the position of someone that needs to parse legacy encodings, I can't see why I would want to have to deal with those encodings in Rust.

dralley · 2026-05-19T13:50:36Z

Are you having any issues with Github access?

Mingun · 2026-05-19T18:52:16Z

Are you having any issues with Github access?

No, just not every day enough time to read everything that appeared and answer.

duplicating the readers would end up being a lot of work I think

It seems to me that it is difficult to judge this without trying. It doesn't seem like that to me. The core reader will work with bytes, only event and and attribute types will need to be duplicated. It's not a lot of code.

For editing purposes we may want to be able to resave part of document, keeping the original encoding. We also may want to report byte offsets into the original source from which the document is parsed, which may be hard if we will reencode document always. I think, we need to think about this problems and how to handle them (or not handle at all). I would prefer not to pessimize ahead of time.

I need some time (perhaps of several months, because sometimes you need to take a break from the project in order to then take on it with renewed vigor) to review this and think about / experiment with possible alternative implementations. I will try to finish review for this PR as soon as possible.

I can't see why I would want to have to deal with those encodings in Rust.

For example, if we want to save memory. UTF-8 could be more that 2x times more memory than using legacy encodings. This can be critical if you need to keep large files in memory for some reason.

dralley · 2026-05-19T21:13:22Z

Take your time on the review. However, re: your points -

Counterarguments:

It seems to me that it is difficult to judge this without trying. It doesn't seem like that to me. The core reader will work with bytes, only event and and attribute types will need to be duplicated. It's not a lot of code.

The amount of code that works with those types is pretty significant though. They'd still be different types ultimately, so even if the meat of the implementation is shared, you still need either duplicate definitions of all the user-facing methods of those types including duplicate doc comments that need to be kept in-sync; or else making a much larger portion of the codebase into macros than is already the case (which I wouldn't want to do). Tests are also an issue, because they can only test one API at a time, unless again everything becomes a macro, or you just pick one variant to test thoroughly and only trivially exercise the other.

Either way, I think that ends up being a significant maintenance cost and challenge. That's just my impression, maybe I'm wrong.

For example, if we want to save memory. UTF-8 could be more that 2x times more memory than using legacy encodings. This can be critical if you need to keep large files in memory for some reason.

quick-xml is already a streaming parser, which already puts it ahead of most parser options. Memory use would only be a major concern if the user needed to accumulate large portions of the document in memory anyway. And I would have to imagine it unlikely that a document in a legacy encoding would be so large in the first place, because such documents are usually somewhat constrained by the resources available during the time of its vintage. There's not a lot of new documents of those encodings being written since the 2000s.

Even if that was not the case - due to the streaming nature and the fact that no other XML library I can find permits this (much less streaming ones, specifically), quick-xml would still be one of the best options.

For editing purposes we may want to be able to resave part of document, keeping the original encoding. We also may want to report byte offsets into the original source from which the document is parsed, which may be hard if we will reencode document always. I think, we need to think about this problems and how to handle them (or not handle at all). I would prefer not to pessimize ahead of time.

I take this to mean "parse to position, obtain byte position, acquire a handle and seek + overwrite within the file". I guess that's possible, but has limited utility (it could only be used to overwrite exactly the same # of bytes) and comes at higher risk of corrupting the file. (On top of possible mistakes, it's also more susceptible to TOCTOU - maybe not so much in the "exploit" sense but in the correctness sense).

However it's also not that difficult or expensive to reverse-engineer the input buffer positions from a little bit of state that we already need to keep track of. We know the position of the start of the input buffer, we know the parser position in the output stream, it's just an exercise of counting characters on both sides, which can be optimized in various ways to minimize re-counting.

e.g. with single byte input encodings, the character count between the start of the output buffer and the current parser position is the byte offset (which you can add to the position of the beginning of the input buffer). The infrastructure for this is mostly identical to what needs to be added for proper reporting of line + column as well - something we can do much more reliably with a uniform input encoding.

Anyway, I'd like to not pessimize the common cases in favor of uncommon ones - I can't say that nobody would use those things, but I strongly suspect that more users would appreciate improved spec compliance and more convenient APIs.

dralley force-pushed the ergonomics-str branch from 82cd028 to 6959046 Compare May 11, 2026 15:03

dralley changed the title ~~Improve ergonomics - rebuild Reader around native &str types~~ Improve ergonomics - rebuild Reader around native utf-8 string types May 11, 2026

dralley commented May 11, 2026

View reviewed changes

dralley force-pushed the ergonomics-str branch 4 times, most recently from e028123 to 4a01f13 Compare May 11, 2026 19:31

dralley force-pushed the ergonomics-str branch from c8749de to 1939e95 Compare May 12, 2026 18:20

dralley commented May 12, 2026

View reviewed changes

dralley force-pushed the ergonomics-str branch 3 times, most recently from a198941 to 27e09f5 Compare May 12, 2026 18:50

dralley commented May 12, 2026

View reviewed changes

dralley force-pushed the ergonomics-str branch 3 times, most recently from 5474a71 to 266bf6f Compare May 12, 2026 20:58

dralley marked this pull request as ready for review May 12, 2026 21:11

dralley requested a review from Mingun May 12, 2026 21:37

dralley force-pushed the ergonomics-str branch 2 times, most recently from 4ba0d68 to cd8da9b Compare May 12, 2026 22:40

dralley mentioned this pull request May 13, 2026

Support for embedded raw binary with serde API #792

Open

dralley mentioned this pull request May 13, 2026

ergonomics & encodings #158

Open

dralley added 13 commits May 13, 2026 17:02

Reader now verifies that events are UTF-8

dc08dfc

Document that Reader expects UTF-8 input.

Make Name types (QName, LocalName, etc.) str-based

379e2cc

Remove Decoder from Event type / Attribute structs

010bd07

Convert Event types to str

dd2c2f3

Bump MSRV to 1.86

78cac9e

Required for const fn split_at - needed to keep trim_xml* functions const.

Change Deref<Target=..> from [u8] to &str

0ccaa71

Remove decode() method from BytesText, etc.

f233b67

Make xml*_content() methods infalliable as they no longer handle decoding.

Convert Attributes to Cow<str>

728f487

Deprecate decode_and* methods, since they no longer serve a purpose.

Remove remanents of Decoder from the Reader API

1111638

Fix fuzzing error

8790cf2

Convert ReaderState.opened_buffer from Vec<u8> to String

d4edacc

Moved UTF-8 validation from ReaderState::emit_* to XmlSource

62a5b84

It is now impossible for ReaderState to receive unvalidated bytes. This avoids some redundant validation and allows making different decisions about how to validate for different types of XmlSource.

Convert NamespaceResolver::buffer from Vec<u8> to String

ba067e3

Eliminates some duplicitous validation

dralley force-pushed the ergonomics-str branch from 61986bb to ba067e3 Compare May 13, 2026 21:03

Conversation

dralley commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

dralley May 11, 2026

Choose a reason for hiding this comment

Uh oh!

dralley commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dralley commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dralley May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dralley May 12, 2026

Choose a reason for hiding this comment

Uh oh!

dralley May 12, 2026

Choose a reason for hiding this comment

Uh oh!

dralley commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dralley commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Mingun commented May 13, 2026

Uh oh!

dralley commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dralley commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dralley commented May 16, 2026

Uh oh!

dralley commented May 19, 2026

Uh oh!

Mingun commented May 19, 2026

Uh oh!

dralley commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dralley commented May 11, 2026 •

edited

Loading

codecov-commenter commented May 11, 2026 •

edited

Loading

dralley commented May 11, 2026 •

edited

Loading

dralley commented May 11, 2026 •

edited

Loading

dralley May 12, 2026 •

edited

Loading

dralley commented May 12, 2026 •

edited

Loading

dralley commented May 12, 2026 •

edited

Loading

dralley commented May 13, 2026 •

edited

Loading

dralley commented May 13, 2026 •

edited

Loading

dralley commented May 19, 2026 •

edited

Loading