Skip to content

fix(web_search): decode HTML entities in Bing result URLs (default backend returns 0 results)#2245

Open
h3c-hexin wants to merge 1 commit into
Hmbown:mainfrom
h3c-hexin:fix/bing-ckurl-html-entity-decode
Open

fix(web_search): decode HTML entities in Bing result URLs (default backend returns 0 results)#2245
h3c-hexin wants to merge 1 commit into
Hmbown:mainfrom
h3c-hexin:fix/bing-ckurl-html-entity-decode

Conversation

@h3c-hexin
Copy link
Copy Markdown
Contributor

@h3c-hexin h3c-hexin commented May 27, 2026

Problem

With the default Bing backend, web_search returns 0 results for normal queries.

Bing's SERP wraps each result URL in a /ck/a?...&u=<base64> click-tracking redirect, and the raw HTML encodes the separators as &amp; entities. normalize_bing_url extracts the u query param without decoding HTML entities first, so it looks for key u while the actual key in the string is amp;u. The base64 redirect target is never decoded, so every result's URL collapses to the bing.com root domain. is_likely_spam_results then sees an all-bing.com batch and rejects it → 0 results.

Fix

Decode HTML entities (decode_html_entities, already used elsewhere in this module) before parsing the redirect, restoring the real target URLs.

Test

Adds bing_ckurl_with_html_entities_decodes_real_url, covering a realistic &amp;-encoded /ck/a href.

Notes

  • Affects the default search backend, so it hits any user who hasn't switched providers.
  • The DuckDuckGo path (normalize_url, uddg param) may have a similar entity issue; left out of this PR to keep it focused.

Greptile Summary

This PR fixes a zero-results bug in the default Bing backend caused by HTML entity encoding (&amp;) in click-tracking redirect URLs not being decoded before query-parameter parsing. The two-line fix adds a decode_html_entities call at the top of normalize_bing_url, exactly where the raw href is first processed, and a targeted regression test validates the full decode-→-base64-extract flow.

  • Root-cause fix: normalize_bing_url now calls decode_html_entities before extract_query_param, so amp;u is no longer mistaken for the parameter key and the real URL is correctly extracted from the base64 payload.
  • Test coverage: bing_ckurl_with_html_entities_decodes_real_url exercises a realistic &amp;-encoded Bing /ck/a href end-to-end, confirming the decoded URL matches the expected target.
  • Known gap (out of scope): normalize_url (the DuckDuckGo path) has the same entity-decoding omission for its uddg parameter, acknowledged in the PR description and deferred to a follow-up.

Confidence Score: 5/5

Safe to merge — the change is a two-line, single-function addition that restores correct URL extraction for the default Bing backend.

The fix is minimal and correctly placed: decode_html_entities is already used elsewhere in the same module for the same class of problem, the new code follows the established variable-shadowing pattern, and the regression test exercises the full decode path end-to-end. No existing behavior is removed or altered for non-Bing URL shapes.

normalize_url (the DuckDuckGo path) has an analogous missing entity-decode step for its uddg parameter — acknowledged in the PR description and deferred. No other files need attention.

Important Files Changed

Filename Overview
crates/tui/src/tools/web_search.rs Adds decode_html_entities before extract_query_param in normalize_bing_url (two lines), plus a new unit test; fix is minimal, correct, and consistent with how normalize_text already uses the same helper.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["Raw Bing SERP HTML href"] --> B["decode_html_entities(href) 🆕"]
    B --> C["Decoded href with real & separators"]
    C --> D["extract_query_param(href, 'u')"]
    D --> E["percent_decode(encoded)"]
    E --> F["strip_prefix('a1')"]
    F --> G["base64 pad & decode"]
    G --> H{Valid http/https URL?}
    H -- Yes --> I["✅ Return real target URL"]
    H -- No --> J["Fallback: return href as-is"]
Loading

Reviews (1): Last reviewed commit: "fix(web_search): decode HTML entities in..." | Re-trigger Greptile

Bing wraps every SERP result URL in a `/ck/a?...&u=<base64>` click-tracking
redirect, and in the raw HTML the separators are `&amp;` entities.
normalize_bing_url parsed the href without decoding entities first, so
extract_query_param looked for `u` while the actual key was `amp;u`. The
base64 redirect target was never recovered: every result collapsed to a
`bing.com` root domain, is_likely_spam_results rejected the whole batch,
and Bing — the default backend — returned zero results.

Decode HTML entities before extracting the redirect target. Adds a
regression test.
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request fixes an issue with the Bing search backend where click-tracking redirect URLs containing HTML entities (&amp;) failed to parse correctly, causing them to collapse to bing.com and get flagged as spam. The fix decodes HTML entities before extracting the target URL, and a regression test has been added. The reviewer suggested improving code readability in normalize_bing_url by avoiding double shadowing of the href variable.

Comment on lines +910 to +911
let href = decode_html_entities(href);
let href = href.as_str();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Shadowing the href parameter first with an owned String and then immediately shadowing it again with a &str slice of itself can be confusing to read and maintain. It is cleaner and more idiomatic to use a distinct name for the intermediate owned String (e.g., decoded_href) and then bind href to its slice.

    let decoded_href = decode_html_entities(href);
    let href = decoded_href.as_str();

h3c-hexin pushed a commit to h3c-hexin/DeepSeek-TUI that referenced this pull request May 27, 2026
bing SERP href 用 &amp; 实体编码,不先解码致 extract_query_param 取不到 u= → 整批误判 spam。已提上游 PR Hmbown#2245
Copy link
Copy Markdown
Owner

@Hmbown Hmbown left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean two-line fix. Root cause is Bing's /ck/a redirect URLs encode as , making miss the base64 payload and collapse all results to . The fix calls the existing at the top of , and the targeted regression test validates the full decode-extract-decode pipeline.

APPROVE — ready to merge.

Copy link
Copy Markdown
Owner

@Hmbown Hmbown left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean two-line fix. Bing's redirect URLs encode & as HTML entities, making query-param extraction miss the payload. The fix calls the existing decode function at the top of normalize_bing_url, and the targeted regression test validates the full pipeline.

APPROVE — ready to merge.

Copy link
Copy Markdown
Owner

@Hmbown Hmbown left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the diff and green CI. Decoding HTML entities before extracting Bing redirect targets fixes the default backend returning no useful results, with a focused regression test.

Copy link
Copy Markdown
Owner

@Hmbown Hmbown left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Focused fix for the default Bing backend: decoding &amp; before extracting the u redirect parameter matches the documented failure mode and the regression test covers the zero-results path. CI is green across Ubuntu/macOS/Windows plus GitGuardian/Greptile. Good v0.8.48 bugfix candidate.

@Hmbown
Copy link
Copy Markdown
Owner

Hmbown commented May 27, 2026

Verified locally on top of origin/main (54151a4):

The inline comment in normalize_bing_url precisely explains the failure mode (the &amp; entity prevented extract_query_param from finding u), and the new test pins both the input shape and the recovered URL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants