Skip to content

fix: decode header values per-ecosystem (Python/JS) and expose raw header bytes#492

Open
Pijukatel wants to merge 20 commits into
masterfrom
claude/issue-479-fixes-r2554a
Open

fix: decode header values per-ecosystem (Python/JS) and expose raw header bytes#492
Pijukatel wants to merge 20 commits into
masterfrom
claude/issue-479-fixes-r2554a

Conversation

@Pijukatel

@Pijukatel Pijukatel commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Context

Response header values were decoded as ISO-8859-1 (b as char, introduced in #434 to stop
non-ASCII header bytes from crashing Node / emptying Python, #430). That corrupts the common case
of UTF-8 header values (e.g. Content-Disposition: filename="naïve.pdf") into mojibake
(#479). The two positions genuinely conflict: a byte sequence can't be decoded as both ISO-8859-1
and UTF-8, and the "right" answer differs by ecosystem.

How

Rather than force one behavior on both bindings, each binding now follows the reference client
it emulates
, and both gain a way to read the exact header bytes for signature/HMAC use:

Net effect on #479

  • Python: fully fixed — UTF-8 header values decode correctly.
  • JavaScript: string values remain ISO-8859-1 by design (Fetch parity, byte-recoverable);
    callers needing the decoded UTF-8 value read rawHeaders and decode with TextDecoder.

Consistency with ecosystem

impit's two bindings each emulate a reference client, so header decoding is deliberately
asymmetric — and each side matches its reference exactly. Both bindings additionally expose
the raw header bytes, following the byte-access pattern each ecosystem already relies on.

Python — matches httpx (which impit-python implements)

impit-python advertises the httpx interface ("drop-in replacement for httpx.AsyncClient"), and
httpx decodes header values UTF-8-first with an ISO-8859-1 fallback — exactly what this PR
does via the shared decode_header_value helper:

So Python callers get the same decoded strings and a raw-bytes escape hatch like httpx's.

JavaScript — matches the Fetch API / undici (which impit-node implements)

impit-node is "API-compatible with the Fetch API Response". In Fetch, header values are a
byte sequence exposed to JS as a ByteString, i.e. via isomorphic decode — each byte
0x00–0xFF maps to the code point of equal value (ISO-8859-1). This PR keeps impit-node on that
exact behavior (b as char):

  • Fetch Standard: a header value is a byte sequence,
    and the Headers interface types names/values as
    ByteString (ByteString get(ByteString name)).
  • WebIDL ByteString is the isomorphic
    (byte ↔ code-point) mapping — i.e. ISO-8859-1.
  • undici (Node's fetch) implements exactly this: nodejs/undici#1560 "ByteString checks &
    conversion in Headers"
    and
    #1317 confirm header values are handled as
    Latin-1 ByteStrings.
  • Node's core http parser likewise decodes header values as latin1/binary
    (nodejs/node#17390,
    #58240); axios inherits this because its
    Node adapter reads http.IncomingMessage headers and its browser adapter reads
    XMLHttpRequest/Fetch headers.

Because ISO-8859-1 is isomorphic, the JS string stays byte-recoverable — the standard Fetch
workaround Buffer.from(value, 'latin1') (or Uint8Array.from(value, c => c.charCodeAt(0)) in
the browser) reproduces the exact wire bytes, so a UTF-8 header can be recovered with
Buffer.from(value, 'latin1').toString('utf8').

Tests

  • Core: unit tests for decode_header_value (ASCII, UTF-8, invalid-UTF-8 latin-1 fallback,
    no-replacement-char round-trip, empty).
  • JS: existing latin-1 regression test kept; new test asserting the string stays ISO-8859-1, that
    rawHeaders yields the exact UTF-8 bytes, that the latin-1 string round-trips to those bytes,
    and that rawHeaders survives clone().
  • Python: raw_headers shape/bytes via the constructor, plus a wire-level integration test (raw
    socket) covering the real fetch path — httpx-style UTF-8 decode, latin-1 fallback, and exact
    raw_headers bytes.

claude added 16 commits July 1, 2026 07:56
What we're solving: response header values are decoded as ISO-8859-1
(b as char), which garbles UTF-8 header values into mojibake (#479). That
decode was introduced intentionally by #434 to stop non-ASCII header bytes
from crashing Node / emptying Python (#430), so the two positions conflict.

How: proposed fix is to decode UTF-8 first and fall back to the existing
byte-preserving ISO-8859-1 decode only when the bytes are not valid UTF-8,
in one shared core-crate helper used by both bindings. This fixes #479 while
keeping #434's latin-1 case and #430's non-crash guarantee.

Alternatives considered: the issue's suggested from_utf8_lossy was rejected
because it regresses #434 (turns the bare 0xE4 test byte into U+FFFD).

This commit contains only the devforge investigation artifacts under
.devforge/; no source has been changed and the run is paused at the design
gate awaiting human approval before any source edit.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01VrUiE5CzcJ9TiRTqvqb1JE
What we're solving: response header values were decoded byte-for-byte as
ISO-8859-1 (b as char), which garbled UTF-8 header values such as
Content-Disposition: attachment; filename="naïve.pdf" into mojibake (#479).
That decode was introduced deliberately in #434 to stop non-ASCII header
bytes from crashing the Node bindings / emptying the Python ones (#430), so a
naive switch to UTF-8 would regress those.

How: added a shared decode_header_value helper in the core crate
(re-exported via impit::utils) that decodes the bytes as UTF-8 when they are
valid UTF-8 and otherwise falls back to the byte-preserving ISO-8859-1
decode. Both the Node and Python bindings now call it. This fixes the common
UTF-8 case, keeps #434's genuine ISO-8859-1 values intact, and never emits
U+FFFD replacement characters, so #430's non-crash / non-empty guarantee
holds.

Alternatives considered: the issue's suggested String::from_utf8_lossy was
rejected because it turns invalid-UTF-8 latin-1 bytes (e.g. a lone 0xE4) into
replacement characters, reintroducing the corruption #434 fixed. Exposing raw
header bytes for signature/HMAC callers is left as a separate follow-up.

Note: final review is still in progress; the full workspace build and JS/Py
suites must run in CI as the pinned github.com/apify/h2 git dependency is not
reachable from this environment.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01VrUiE5CzcJ9TiRTqvqb1JE
Follow-up to the header-decode fix: the rustdoc example used an invalid byte
literal (b've') that would fail cargo test --doc, and a test array literal plus
the reformatted binding call sites were not run through rustfmt (a required CI
job). Corrected the doctest to spell the bytes out individually and ran rustfmt
across the touched files.

Also strengthened the local verification to run rustfmt --check and
rustdoc --test alongside the unit tests, since the doctest error was invisible
to a plain rustc --test run.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01VrUiE5CzcJ9TiRTqvqb1JE
Investigation/review evidence only; no source changes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01VrUiE5CzcJ9TiRTqvqb1JE
Validate the header bytes against the borrow with str::from_utf8 instead of
copying them into an owned Vec first. The common UTF-8 path now allocates once
(the owned String) and the ISO-8859-1 fallback allocates once (the collect),
with no intermediate copy. Byte semantics are unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01VrUiE5CzcJ9TiRTqvqb1JE
Review evidence only; no source changes. Awaiting human create-PR decision.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01VrUiE5CzcJ9TiRTqvqb1JE
…essor

What we're solving: keep #479's UTF-8 fix but make each binding faithful to
its reference client - httpx (UTF-8-first) in Python, Fetch (strict latin-1)
in JS - and add a raw-header-bytes accessor to both for HMAC/signature
callers who need exact wire bytes.

Reopens the design gate: the approval marker is removed and no source is
changed until re-approved. Awaiting human approval of the revised design and
the raw_headers / rawHeaders API shapes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01VrUiE5CzcJ9TiRTqvqb1JE
What we're solving: response header values were decoded as ISO-8859-1,
garbling UTF-8 values (#479). Rather than impose one behavior everywhere,
each binding now follows the reference client it emulates, and both gain a
way to read the exact wire bytes for signature/HMAC use.

How: Python decodes header values httpx-style (UTF-8 first, ISO-8859-1
fallback) via a shared core helper; JS keeps strict ISO-8859-1 to match the
Fetch API, so its string values stay byte-recoverable. Both bindings expose
the untouched header bytes - Python `Response.raw_headers` as (bytes, bytes)
pairs (httpx Headers.raw parity), JS `response.rawHeaders` as
[name, Uint8Array] pairs.

Alternatives considered: symmetric UTF-8-first in both (deviates from Fetch
on JS and breaks the Buffer.from(v,'latin1') recovery idiom); latin-1
everywhere (leaves Python worse than httpx); skipping the raw accessor
(HMAC callers have no correct alternative once decoding is lossy).

Note: the Rust binding glue could not be compiled in the authoring
environment; binding compilation and the JS/Python test runs are verified in
CI.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01VrUiE5CzcJ9TiRTqvqb1JE
… socket

Adds a wire-level integration test (raw-socket server) asserting httpx-style
UTF-8 decoding, ISO-8859-1 fallback, and exact raw_headers bytes on the real
fetch path, mirroring the JS mock-server test. Also records the ecosystem
consistency references for the PR description.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01VrUiE5CzcJ9TiRTqvqb1JE
… raw-header docs

Addresses final-review findings on the raw-header accessor:
- clone() now carries the impit rawHeaders bytes onto the cloned response
- ImpitResponse.rawHeaders is declared in index.d.ts (public TS surface)
- docs no longer claim original wire order or header-name casing: reqwest's
  HeaderMap lowercases names and does not retain wire order, so raw_headers /
  rawHeaders is httpx-.raw-like, with exact header VALUE bytes (what matters
  for signature/HMAC) and duplicate values preserved

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01VrUiE5CzcJ9TiRTqvqb1JE
…behavior

Removes a leftover "in wire order" phrase from the internal field comments so
they match the getter docs: values are exact bytes, names are lowercased, and
original wire order is not preserved.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01VrUiE5CzcJ9TiRTqvqb1JE
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01VrUiE5CzcJ9TiRTqvqb1JE
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01VrUiE5CzcJ9TiRTqvqb1JE
@Pijukatel Pijukatel marked this pull request as draft July 1, 2026 09:56
claude added 3 commits July 1, 2026 09:58
This unrelated archive was accidentally staged by a `git add -A`; it is not
part of this change and not present on master.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01VrUiE5CzcJ9TiRTqvqb1JE
These are internal devforge run artifacts, not part of the change. Untracked
and ignored so they no longer appear in the PR diff.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01VrUiE5CzcJ9TiRTqvqb1JE
mypy flagged `Response.raw_headers` as an unknown attribute because the .pyi
stub did not declare the new getter. Adds it as a read-only property returning
list[tuple[bytes, bytes]], matching the pyo3 getter.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01VrUiE5CzcJ9TiRTqvqb1JE
@Pijukatel Pijukatel changed the title fix: decode header values per-ecosystem (httpx/Fetch) and expose raw header bytes fix: decode header values per-ecosystem (Python/JS) and expose raw header bytes Jul 1, 2026
Comment thread .gitignore Outdated
/target
/artifacts
.idea
.devforge/

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for the convenience of using the /devforge

@Pijukatel Pijukatel marked this pull request as ready for review July 1, 2026 11:31
Ignoring devforge's local run files belongs in a personal/local exclude, not
the shared repo. Kept out of the working tree via .git/info/exclude instead.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01VrUiE5CzcJ9TiRTqvqb1JE
@barjin barjin self-requested a review July 2, 2026 13:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Header values decoded with 'b as char' (Latin-1) corrupt UTF-8 header values into mojibake

3 participants