Skip to content

fix: support encoding of special characters in path params for gen2 f…#1911

Open
IzaakGough wants to merge 2 commits into
masterfrom
@invertase/fix-gen2-support-encoding-of-special-characters-in-path-params
Open

fix: support encoding of special characters in path params for gen2 f…#1911
IzaakGough wants to merge 2 commits into
masterfrom
@invertase/fix-gen2-support-encoding-of-special-characters-in-path-params

Conversation

@IzaakGough

@IzaakGough IzaakGough commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Summary

Resolves #1459

Fix Gen 2 path parameter decoding when Eventarc delivers UTF-8 mojibake instead of already-correct Unicode, so wildcard params like Helvétios, 中文, and emoji are exposed correctly in event.params.

Problem/Root Cause

Firestore Gen 2 triggers could receive path wildcard values with special characters mangled into mojibake such as Helvétios instead of Helvétios. The existing path param decoding only handled percent-encoded values via decodeURIComponent, so non-percent-encoded mojibake was passed through unchanged.

Solution/Changes

Update the shared path pattern decoder to:

  • keep the existing decodeURIComponent behavior for properly encoded values
  • preserve malformed percent-encoded input
  • detect byte sequences that look like UTF-8 mojibake and re-decode them as UTF-8 before returning path params

Added shared regression coverage for accented Latin text, CJK, emoji, Hindi, Arabic, already-correct Unicode input, and multi-segment captures.

Testing

  • npm test
  • Manual verification in a sample Gen 2 Firestore trigger with accented Latin text, emoji, CJK, Hindi, and Arabic path params

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a mechanism to detect and repair UTF-8 mojibake in path parameters within path-pattern.ts, supported by comprehensive unit tests. The review feedback suggests optimizing the tryDecodeUtf8Mojibake function to avoid intermediate array allocations by using a single-pass loop with a pre-allocated Uint8Array.

Comment on lines +48 to +54
const bytes = Array.from(value, (char) => char.charCodeAt(0));
if (bytes.some((byte) => byte > 0xff)) {
return value;
}

try {
return new TextDecoder("utf-8", { fatal: true }).decode(Uint8Array.from(bytes));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current implementation of tryDecodeUtf8Mojibake is inefficient because Array.from(value) creates an intermediate array of strings (one for each character/code point), and Uint8Array.from(bytes) creates another copy. Since this function is called for every path segment during pattern matching, we can optimize it by using a single pass with a Uint8Array and charCodeAt to avoid these allocations.

  const bytes = new Uint8Array(value.length);
  for (let i = 0; i < value.length; i++) {
    const code = value.charCodeAt(i);
    if (code > 0xff) {
      return value;
    }
    bytes[i] = code;
  }

  try {
    return new TextDecoder("utf-8", { fatal: true }).decode(bytes);

@IzaakGough IzaakGough marked this pull request as ready for review June 18, 2026 15:39
@cabljac cabljac requested review from CorieW and cabljac June 25, 2026 13:05
}

/** @internal */
export function tryDecodeURIComponent(uri: string): string {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would the following work?

export function tryDecodeURIComponent(uri: string): string {
  try {
    return decodeURIComponent(uri);
  } catch (_e) {
    return tryDecodeUtf8Mojibake(decoded);
  }
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah that looks cleaner

@IzaakGough

IzaakGough commented Jun 25, 2026

Copy link
Copy Markdown
Contributor Author

AI just made a good point:

It treats some raw strings as “obviously broken UTF-8 mojibake” when they are not obviously broken.

Current code in src/common/utilities/path-pattern.ts:

  • runs decodeURIComponent
  • then always runs tryDecodeUtf8Mojibake
  • if string matches byte-looking pattern, it reinterprets chars as UTF-8 bytes

Bug:
same observed string can mean 2 different things.

Example:

  • é might be:
    • legit literal path segment é
    • broken form of intended é

Code cannot know which one user meant.
Current implementation always picks “broken UTF-8” and converts:

  • é -> é
  • £ -> £
  • © -> ©
  • fooébar -> fooébar

So ambiguity is:

  • fix real mojibake from issue #1459
  • but corrupt legit IDs/path segments that happen to look like mojibake bytes

Why impossible in current layer:

  • PathPattern only sees final string
  • no extra metadata says “this came in broken”
  • Firestore/RTDB/Data Connect current param path gives no trusted second source

So current implementation has no safe rule for:

  • repair all mojibake-like strings
    vs
  • preserve all raw strings

Tests prove repair cases, but miss legit ambiguous cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Gen2: Encoding issue with special characters in path parameters

3 participants