fix: decode header values per-ecosystem (Python/JS) and expose raw header bytes#492
Open
Pijukatel wants to merge 20 commits into
Open
fix: decode header values per-ecosystem (Python/JS) and expose raw header bytes#492Pijukatel wants to merge 20 commits into
Pijukatel wants to merge 20 commits into
Conversation
What we're solving: response header values are decoded as ISO-8859-1 (b as char), which garbles UTF-8 header values into mojibake (#479). That decode was introduced intentionally by #434 to stop non-ASCII header bytes from crashing Node / emptying Python (#430), so the two positions conflict. How: proposed fix is to decode UTF-8 first and fall back to the existing byte-preserving ISO-8859-1 decode only when the bytes are not valid UTF-8, in one shared core-crate helper used by both bindings. This fixes #479 while keeping #434's latin-1 case and #430's non-crash guarantee. Alternatives considered: the issue's suggested from_utf8_lossy was rejected because it regresses #434 (turns the bare 0xE4 test byte into U+FFFD). This commit contains only the devforge investigation artifacts under .devforge/; no source has been changed and the run is paused at the design gate awaiting human approval before any source edit. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VrUiE5CzcJ9TiRTqvqb1JE
What we're solving: response header values were decoded byte-for-byte as ISO-8859-1 (b as char), which garbled UTF-8 header values such as Content-Disposition: attachment; filename="naïve.pdf" into mojibake (#479). That decode was introduced deliberately in #434 to stop non-ASCII header bytes from crashing the Node bindings / emptying the Python ones (#430), so a naive switch to UTF-8 would regress those. How: added a shared decode_header_value helper in the core crate (re-exported via impit::utils) that decodes the bytes as UTF-8 when they are valid UTF-8 and otherwise falls back to the byte-preserving ISO-8859-1 decode. Both the Node and Python bindings now call it. This fixes the common UTF-8 case, keeps #434's genuine ISO-8859-1 values intact, and never emits U+FFFD replacement characters, so #430's non-crash / non-empty guarantee holds. Alternatives considered: the issue's suggested String::from_utf8_lossy was rejected because it turns invalid-UTF-8 latin-1 bytes (e.g. a lone 0xE4) into replacement characters, reintroducing the corruption #434 fixed. Exposing raw header bytes for signature/HMAC callers is left as a separate follow-up. Note: final review is still in progress; the full workspace build and JS/Py suites must run in CI as the pinned github.com/apify/h2 git dependency is not reachable from this environment. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VrUiE5CzcJ9TiRTqvqb1JE
Follow-up to the header-decode fix: the rustdoc example used an invalid byte literal (b've') that would fail cargo test --doc, and a test array literal plus the reformatted binding call sites were not run through rustfmt (a required CI job). Corrected the doctest to spell the bytes out individually and ran rustfmt across the touched files. Also strengthened the local verification to run rustfmt --check and rustdoc --test alongside the unit tests, since the doctest error was invisible to a plain rustc --test run. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VrUiE5CzcJ9TiRTqvqb1JE
Investigation/review evidence only; no source changes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VrUiE5CzcJ9TiRTqvqb1JE
Validate the header bytes against the borrow with str::from_utf8 instead of copying them into an owned Vec first. The common UTF-8 path now allocates once (the owned String) and the ISO-8859-1 fallback allocates once (the collect), with no intermediate copy. Byte semantics are unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VrUiE5CzcJ9TiRTqvqb1JE
Review evidence only; no source changes. Awaiting human create-PR decision. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VrUiE5CzcJ9TiRTqvqb1JE
…essor What we're solving: keep #479's UTF-8 fix but make each binding faithful to its reference client - httpx (UTF-8-first) in Python, Fetch (strict latin-1) in JS - and add a raw-header-bytes accessor to both for HMAC/signature callers who need exact wire bytes. Reopens the design gate: the approval marker is removed and no source is changed until re-approved. Awaiting human approval of the revised design and the raw_headers / rawHeaders API shapes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VrUiE5CzcJ9TiRTqvqb1JE
What we're solving: response header values were decoded as ISO-8859-1, garbling UTF-8 values (#479). Rather than impose one behavior everywhere, each binding now follows the reference client it emulates, and both gain a way to read the exact wire bytes for signature/HMAC use. How: Python decodes header values httpx-style (UTF-8 first, ISO-8859-1 fallback) via a shared core helper; JS keeps strict ISO-8859-1 to match the Fetch API, so its string values stay byte-recoverable. Both bindings expose the untouched header bytes - Python `Response.raw_headers` as (bytes, bytes) pairs (httpx Headers.raw parity), JS `response.rawHeaders` as [name, Uint8Array] pairs. Alternatives considered: symmetric UTF-8-first in both (deviates from Fetch on JS and breaks the Buffer.from(v,'latin1') recovery idiom); latin-1 everywhere (leaves Python worse than httpx); skipping the raw accessor (HMAC callers have no correct alternative once decoding is lossy). Note: the Rust binding glue could not be compiled in the authoring environment; binding compilation and the JS/Python test runs are verified in CI. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VrUiE5CzcJ9TiRTqvqb1JE
… socket Adds a wire-level integration test (raw-socket server) asserting httpx-style UTF-8 decoding, ISO-8859-1 fallback, and exact raw_headers bytes on the real fetch path, mirroring the JS mock-server test. Also records the ecosystem consistency references for the PR description. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VrUiE5CzcJ9TiRTqvqb1JE
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VrUiE5CzcJ9TiRTqvqb1JE
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VrUiE5CzcJ9TiRTqvqb1JE
… raw-header docs Addresses final-review findings on the raw-header accessor: - clone() now carries the impit rawHeaders bytes onto the cloned response - ImpitResponse.rawHeaders is declared in index.d.ts (public TS surface) - docs no longer claim original wire order or header-name casing: reqwest's HeaderMap lowercases names and does not retain wire order, so raw_headers / rawHeaders is httpx-.raw-like, with exact header VALUE bytes (what matters for signature/HMAC) and duplicate values preserved Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VrUiE5CzcJ9TiRTqvqb1JE
…behavior Removes a leftover "in wire order" phrase from the internal field comments so they match the getter docs: values are exact bytes, names are lowercased, and original wire order is not preserved. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VrUiE5CzcJ9TiRTqvqb1JE
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VrUiE5CzcJ9TiRTqvqb1JE
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VrUiE5CzcJ9TiRTqvqb1JE
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VrUiE5CzcJ9TiRTqvqb1JE
This unrelated archive was accidentally staged by a `git add -A`; it is not part of this change and not present on master. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VrUiE5CzcJ9TiRTqvqb1JE
These are internal devforge run artifacts, not part of the change. Untracked and ignored so they no longer appear in the PR diff. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VrUiE5CzcJ9TiRTqvqb1JE
mypy flagged `Response.raw_headers` as an unknown attribute because the .pyi stub did not declare the new getter. Adds it as a read-only property returning list[tuple[bytes, bytes]], matching the pyo3 getter. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VrUiE5CzcJ9TiRTqvqb1JE
Pijukatel
commented
Jul 1, 2026
| /target | ||
| /artifacts | ||
| .idea | ||
| .devforge/ |
Contributor
Author
There was a problem hiding this comment.
Just for the convenience of using the /devforge
Ignoring devforge's local run files belongs in a personal/local exclude, not the shared repo. Kept out of the working tree via .git/info/exclude instead. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VrUiE5CzcJ9TiRTqvqb1JE
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Context
Response header values were decoded as ISO-8859-1 (
b as char, introduced in #434 to stopnon-ASCII header bytes from crashing Node / emptying Python, #430). That corrupts the common case
of UTF-8 header values (e.g.
Content-Disposition: filename="naïve.pdf") into mojibake(#479). The two positions genuinely conflict: a byte sequence can't be decoded as both ISO-8859-1
and UTF-8, and the "right" answer differs by ecosystem.
How
Rather than force one behavior on both bindings, each binding now follows the reference client
it emulates, and both gain a way to read the exact header bytes for signature/HMAC use:
Python decodes header values UTF-8-first with an ISO-8859-1 fallback (httpx semantics),
via a shared
decode_header_valuehelper in the core crate. This fixes Header values decoded with 'b as char' (Latin-1) corrupt UTF-8 header values into mojibake #479 for Python whilekeeping fix: decode non-ASCII response header values as ISO-8859-1 #434's latin-1 case (a lone
0xE4→ä) and Crashes on non-ASCII header values #430's never-crash/never-empty guarantee(no
U+FFFD).JavaScript keeps strict ISO-8859-1 (Fetch semantics), so its string values stay
byte-recoverable via
Buffer.from(v, 'latin1').Both expose the untouched header value bytes:
Response.raw_headers→list[tuple[bytes, bytes]]response.rawHeaders→Array<[string, Uint8Array]>(declared inindex.d.ts, preservedacross
clone())Header values are exact; names are lowercased and original wire order is not preserved (a
reqwest::HeaderMaplimitation); duplicate values for a name are kept.Net effect on #479
callers needing the decoded UTF-8 value read
rawHeadersand decode withTextDecoder.Consistency with ecosystem
impit's two bindings each emulate a reference client, so header decoding is deliberately
asymmetric — and each side matches its reference exactly. Both bindings additionally expose
the raw header bytes, following the byte-access pattern each ecosystem already relies on.
Python — matches
httpx(which impit-python implements)impit-python advertises the httpx interface ("drop-in replacement for
httpx.AsyncClient"), andhttpx decodes header values UTF-8-first with an ISO-8859-1 fallback — exactly what this PR
does via the shared
decode_header_valuehelper:Headers.encodingtriesascii, thenutf-8, then falls back toiso-8859-1:httpx/_models.py@ v0.28.1,encodingproperty— "Header encoding is mandated as ascii, but we allow fallbacks to utf-8 or iso-8859-1."
Headers.raw: list[tuple[bytes, bytes]]:httpx/_models.py@ v0.28.1,rawproperty.Our new
Response.raw_headersreturns the samelist[tuple[bytes, bytes]]shape.So Python callers get the same decoded strings and a raw-bytes escape hatch like httpx's.
JavaScript — matches the Fetch API / undici (which impit-node implements)
impit-node is "API-compatible with the Fetch API
Response". In Fetch, header values are abyte sequence exposed to JS as a
ByteString, i.e. via isomorphic decode — each byte0x00–0xFFmaps to the code point of equal value (ISO-8859-1). This PR keeps impit-node on thatexact behavior (
b as char):and the
Headersinterface types names/values asByteString(ByteString get(ByteString name)).ByteStringis the isomorphic(byte ↔ code-point) mapping — i.e. ISO-8859-1.
fetch) implements exactly this: nodejs/undici#1560 "ByteString checks &conversion in Headers" and
#1317 confirm header values are handled as
Latin-1
ByteStrings.httpparser likewise decodes header values aslatin1/binary(nodejs/node#17390,
#58240); axios inherits this because its
Node adapter reads
http.IncomingMessageheaders and its browser adapter readsXMLHttpRequest/Fetch headers.Because ISO-8859-1 is isomorphic, the JS string stays byte-recoverable — the standard Fetch
workaround
Buffer.from(value, 'latin1')(orUint8Array.from(value, c => c.charCodeAt(0))inthe browser) reproduces the exact wire bytes, so a UTF-8 header can be recovered with
Buffer.from(value, 'latin1').toString('utf8').Tests
decode_header_value(ASCII, UTF-8, invalid-UTF-8 latin-1 fallback,no-replacement-char round-trip, empty).
rawHeadersyields the exact UTF-8 bytes, that the latin-1 string round-trips to those bytes,and that
rawHeaderssurvivesclone().raw_headersshape/bytes via the constructor, plus a wire-level integration test (rawsocket) covering the real fetch path — httpx-style UTF-8 decode, latin-1 fallback, and exact
raw_headersbytes.