fix: decode header values per-ecosystem (Python/JS) and expose raw header bytes by Pijukatel · Pull Request #492 · apify/impit

Pijukatel · 2026-07-01T09:55:21Z

Context

Response header values were decoded as ISO-8859-1 (b as char, introduced in #434 to stop
non-ASCII header bytes from crashing Node / emptying Python, #430). That corrupts the common case
of UTF-8 header values (e.g. Content-Disposition: filename="naïve.pdf") into mojibake
(#479). The two positions genuinely conflict: a byte sequence can't be decoded as both ISO-8859-1
and UTF-8, and the "right" answer differs by ecosystem.

How

Rather than force one behavior on both bindings, each binding now follows the reference client
it emulates, and both gain a way to read the exact header bytes for signature/HMAC use:

Python decodes header values UTF-8-first with an ISO-8859-1 fallback (httpx semantics),
via a shared decode_header_value helper in the core crate. This fixes Header values decoded with 'b as char' (Latin-1) corrupt UTF-8 header values into mojibake #479 for Python while
keeping fix: decode non-ASCII response header values as ISO-8859-1 #434's latin-1 case (a lone 0xE4 → ä) and Crashes on non-ASCII header values #430's never-crash/never-empty guarantee
(no U+FFFD).
JavaScript keeps strict ISO-8859-1 (Fetch semantics), so its string values stay
byte-recoverable via Buffer.from(v, 'latin1').
Both expose the untouched header value bytes:
- Python Response.raw_headers → list[tuple[bytes, bytes]]
- JS response.rawHeaders → Array<[string, Uint8Array]> (declared in index.d.ts, preserved
  across clone())
Header values are exact; names are lowercased and original wire order is not preserved (a
reqwest::HeaderMap limitation); duplicate values for a name are kept.

Net effect on #479

Python: fully fixed — UTF-8 header values decode correctly.
JavaScript: string values remain ISO-8859-1 by design (Fetch parity, byte-recoverable);
callers needing the decoded UTF-8 value read rawHeaders and decode with TextDecoder.

Consistency with ecosystem

impit's two bindings each emulate a reference client, so header decoding is deliberately
asymmetric — and each side matches its reference exactly. Both bindings additionally expose
the raw header bytes, following the byte-access pattern each ecosystem already relies on.

Python — matches `httpx` (which impit-python implements)

impit-python advertises the httpx interface ("drop-in replacement for httpx.AsyncClient"), and
httpx decodes header values UTF-8-first with an ISO-8859-1 fallback — exactly what this PR
does via the shared decode_header_value helper:

httpx Headers.encoding tries ascii, then utf-8, then falls back to iso-8859-1:
httpx/_models.py @ v0.28.1, encoding property
— "Header encoding is mandated as ascii, but we allow fallbacks to utf-8 or iso-8859-1."
httpx exposes raw bytes via Headers.raw: list[tuple[bytes, bytes]]:
httpx/_models.py @ v0.28.1, raw property.
Our new Response.raw_headers returns the same list[tuple[bytes, bytes]] shape.

So Python callers get the same decoded strings and a raw-bytes escape hatch like httpx's.

JavaScript — matches the Fetch API / undici (which impit-node implements)

impit-node is "API-compatible with the Fetch API Response". In Fetch, header values are a
byte sequence exposed to JS as a ByteString, i.e. via isomorphic decode — each byte
0x00–0xFF maps to the code point of equal value (ISO-8859-1). This PR keeps impit-node on that
exact behavior (b as char):

Fetch Standard: a header value is a byte sequence,
and the Headers interface types names/values as
ByteString (ByteString get(ByteString name)).
WebIDL ByteString is the isomorphic
(byte ↔ code-point) mapping — i.e. ISO-8859-1.
undici (Node's fetch) implements exactly this: nodejs/undici#1560 "ByteString checks &
conversion in Headers" and
#1317 confirm header values are handled as
Latin-1 ByteStrings.
Node's core http parser likewise decodes header values as latin1/binary
(nodejs/node#17390,
#58240); axios inherits this because its
Node adapter reads http.IncomingMessage headers and its browser adapter reads
XMLHttpRequest/Fetch headers.

Because ISO-8859-1 is isomorphic, the JS string stays byte-recoverable — the standard Fetch
workaround Buffer.from(value, 'latin1') (or Uint8Array.from(value, c => c.charCodeAt(0)) in
the browser) reproduces the exact wire bytes, so a UTF-8 header can be recovered with
Buffer.from(value, 'latin1').toString('utf8').

Tests

Core: unit tests for decode_header_value (ASCII, UTF-8, invalid-UTF-8 latin-1 fallback,
no-replacement-char round-trip, empty).
JS: existing latin-1 regression test kept; new test asserting the string stays ISO-8859-1, that
rawHeaders yields the exact UTF-8 bytes, that the latin-1 string round-trips to those bytes,
and that rawHeaders survives clone().
Python: raw_headers shape/bytes via the constructor, plus a wire-level integration test (raw
socket) covering the real fetch path — httpx-style UTF-8 decode, latin-1 fallback, and exact
raw_headers bytes.

What we're solving: response header values are decoded as ISO-8859-1 (b as char), which garbles UTF-8 header values into mojibake (#479). That decode was introduced intentionally by #434 to stop non-ASCII header bytes from crashing Node / emptying Python (#430), so the two positions conflict. How: proposed fix is to decode UTF-8 first and fall back to the existing byte-preserving ISO-8859-1 decode only when the bytes are not valid UTF-8, in one shared core-crate helper used by both bindings. This fixes #479 while keeping #434's latin-1 case and #430's non-crash guarantee. Alternatives considered: the issue's suggested from_utf8_lossy was rejected because it regresses #434 (turns the bare 0xE4 test byte into U+FFFD). This commit contains only the devforge investigation artifacts under .devforge/; no source has been changed and the run is paused at the design gate awaiting human approval before any source edit. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VrUiE5CzcJ9TiRTqvqb1JE

What we're solving: response header values were decoded byte-for-byte as ISO-8859-1 (b as char), which garbled UTF-8 header values such as Content-Disposition: attachment; filename="naïve.pdf" into mojibake (#479). That decode was introduced deliberately in #434 to stop non-ASCII header bytes from crashing the Node bindings / emptying the Python ones (#430), so a naive switch to UTF-8 would regress those. How: added a shared decode_header_value helper in the core crate (re-exported via impit::utils) that decodes the bytes as UTF-8 when they are valid UTF-8 and otherwise falls back to the byte-preserving ISO-8859-1 decode. Both the Node and Python bindings now call it. This fixes the common UTF-8 case, keeps #434's genuine ISO-8859-1 values intact, and never emits U+FFFD replacement characters, so #430's non-crash / non-empty guarantee holds. Alternatives considered: the issue's suggested String::from_utf8_lossy was rejected because it turns invalid-UTF-8 latin-1 bytes (e.g. a lone 0xE4) into replacement characters, reintroducing the corruption #434 fixed. Exposing raw header bytes for signature/HMAC callers is left as a separate follow-up. Note: final review is still in progress; the full workspace build and JS/Py suites must run in CI as the pinned github.com/apify/h2 git dependency is not reachable from this environment. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VrUiE5CzcJ9TiRTqvqb1JE

Follow-up to the header-decode fix: the rustdoc example used an invalid byte literal (b've') that would fail cargo test --doc, and a test array literal plus the reformatted binding call sites were not run through rustfmt (a required CI job). Corrected the doctest to spell the bytes out individually and ran rustfmt across the touched files. Also strengthened the local verification to run rustfmt --check and rustdoc --test alongside the unit tests, since the doctest error was invisible to a plain rustc --test run. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VrUiE5CzcJ9TiRTqvqb1JE

Investigation/review evidence only; no source changes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VrUiE5CzcJ9TiRTqvqb1JE

Validate the header bytes against the borrow with str::from_utf8 instead of copying them into an owned Vec first. The common UTF-8 path now allocates once (the owned String) and the ISO-8859-1 fallback allocates once (the collect), with no intermediate copy. Byte semantics are unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VrUiE5CzcJ9TiRTqvqb1JE

Review evidence only; no source changes. Awaiting human create-PR decision. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VrUiE5CzcJ9TiRTqvqb1JE

…essor What we're solving: keep #479's UTF-8 fix but make each binding faithful to its reference client - httpx (UTF-8-first) in Python, Fetch (strict latin-1) in JS - and add a raw-header-bytes accessor to both for HMAC/signature callers who need exact wire bytes. Reopens the design gate: the approval marker is removed and no source is changed until re-approved. Awaiting human approval of the revised design and the raw_headers / rawHeaders API shapes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VrUiE5CzcJ9TiRTqvqb1JE

What we're solving: response header values were decoded as ISO-8859-1, garbling UTF-8 values (#479). Rather than impose one behavior everywhere, each binding now follows the reference client it emulates, and both gain a way to read the exact wire bytes for signature/HMAC use. How: Python decodes header values httpx-style (UTF-8 first, ISO-8859-1 fallback) via a shared core helper; JS keeps strict ISO-8859-1 to match the Fetch API, so its string values stay byte-recoverable. Both bindings expose the untouched header bytes - Python `Response.raw_headers` as (bytes, bytes) pairs (httpx Headers.raw parity), JS `response.rawHeaders` as [name, Uint8Array] pairs. Alternatives considered: symmetric UTF-8-first in both (deviates from Fetch on JS and breaks the Buffer.from(v,'latin1') recovery idiom); latin-1 everywhere (leaves Python worse than httpx); skipping the raw accessor (HMAC callers have no correct alternative once decoding is lossy). Note: the Rust binding glue could not be compiled in the authoring environment; binding compilation and the JS/Python test runs are verified in CI. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VrUiE5CzcJ9TiRTqvqb1JE

… socket Adds a wire-level integration test (raw-socket server) asserting httpx-style UTF-8 decoding, ISO-8859-1 fallback, and exact raw_headers bytes on the real fetch path, mirroring the JS mock-server test. Also records the ecosystem consistency references for the PR description. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VrUiE5CzcJ9TiRTqvqb1JE

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VrUiE5CzcJ9TiRTqvqb1JE

… raw-header docs Addresses final-review findings on the raw-header accessor: - clone() now carries the impit rawHeaders bytes onto the cloned response - ImpitResponse.rawHeaders is declared in index.d.ts (public TS surface) - docs no longer claim original wire order or header-name casing: reqwest's HeaderMap lowercases names and does not retain wire order, so raw_headers / rawHeaders is httpx-.raw-like, with exact header VALUE bytes (what matters for signature/HMAC) and duplicate values preserved Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VrUiE5CzcJ9TiRTqvqb1JE

…behavior Removes a leftover "in wire order" phrase from the internal field comments so they match the getter docs: values are exact bytes, names are lowercased, and original wire order is not preserved. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VrUiE5CzcJ9TiRTqvqb1JE

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VrUiE5CzcJ9TiRTqvqb1JE

This unrelated archive was accidentally staged by a `git add -A`; it is not part of this change and not present on master. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VrUiE5CzcJ9TiRTqvqb1JE

These are internal devforge run artifacts, not part of the change. Untracked and ignored so they no longer appear in the PR diff. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VrUiE5CzcJ9TiRTqvqb1JE

mypy flagged `Response.raw_headers` as an unknown attribute because the .pyi stub did not declare the new getter. Adds it as a read-only property returning list[tuple[bytes, bytes]], matching the pyo3 getter. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VrUiE5CzcJ9TiRTqvqb1JE

Pijukatel · 2026-07-01T11:31:18Z

 /target
 /artifacts
 .idea
+.devforge/


Just for the convenience of using the /devforge

Ignoring devforge's local run files belongs in a personal/local exclude, not the shared repo. Kept out of the working tree via .git/info/exclude instead. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VrUiE5CzcJ9TiRTqvqb1JE

claude added 16 commits July 1, 2026 07:56

chore(devforge): record iter-2 reviewer pass and final-review phase

5233130

Investigation/review evidence only; no source changes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VrUiE5CzcJ9TiRTqvqb1JE

chore(devforge): record final-review pass (loop converged)

0d0ec07

Review evidence only; no source changes. Awaiting human create-PR decision. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VrUiE5CzcJ9TiRTqvqb1JE

chore(devforge): enter final review (inner loop converged, rev 2)

b54d354

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VrUiE5CzcJ9TiRTqvqb1JE

chore(devforge): record thermonuclear final review (rev 2, round 1)

750e31a

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VrUiE5CzcJ9TiRTqvqb1JE

chore(devforge): final review converged (rev 2, both reviewers PASS)

e8289e5

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VrUiE5CzcJ9TiRTqvqb1JE

chore(devforge): record create-PR approval

3bc991f

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VrUiE5CzcJ9TiRTqvqb1JE

chore(devforge): record PR #492 and finish run

dd95522

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VrUiE5CzcJ9TiRTqvqb1JE

Pijukatel marked this pull request as draft July 1, 2026 09:56

claude added 3 commits July 1, 2026 09:58

Pijukatel changed the title ~~fix: decode header values per-ecosystem (httpx/Fetch) and expose raw header bytes~~ fix: decode header values per-ecosystem (Python/JS) and expose raw header bytes Jul 1, 2026

Pijukatel commented Jul 1, 2026

View reviewed changes

Pijukatel marked this pull request as ready for review July 1, 2026 11:31

barjin self-requested a review July 2, 2026 13:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: decode header values per-ecosystem (Python/JS) and expose raw header bytes#492

fix: decode header values per-ecosystem (Python/JS) and expose raw header bytes#492
Pijukatel wants to merge 20 commits into
masterfrom
claude/issue-479-fixes-r2554a

Pijukatel commented Jul 1, 2026 •

edited

Loading

Uh oh!

Pijukatel Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

Pijukatel commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

How

Net effect on #479

Consistency with ecosystem

Python — matches httpx (which impit-python implements)

JavaScript — matches the Fetch API / undici (which impit-node implements)

Tests

Uh oh!

Pijukatel Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Pijukatel commented Jul 1, 2026 •

edited

Loading

Python — matches `httpx` (which impit-python implements)