fix(web_search): decode HTML entities in Bing result URLs (default backend returns 0 results)#2245
fix(web_search): decode HTML entities in Bing result URLs (default backend returns 0 results)#2245h3c-hexin wants to merge 1 commit into
Conversation
Bing wraps every SERP result URL in a `/ck/a?...&u=<base64>` click-tracking redirect, and in the raw HTML the separators are `&` entities. normalize_bing_url parsed the href without decoding entities first, so extract_query_param looked for `u` while the actual key was `amp;u`. The base64 redirect target was never recovered: every result collapsed to a `bing.com` root domain, is_likely_spam_results rejected the whole batch, and Bing — the default backend — returned zero results. Decode HTML entities before extracting the redirect target. Adds a regression test.
There was a problem hiding this comment.
Code Review
This pull request fixes an issue with the Bing search backend where click-tracking redirect URLs containing HTML entities (&) failed to parse correctly, causing them to collapse to bing.com and get flagged as spam. The fix decodes HTML entities before extracting the target URL, and a regression test has been added. The reviewer suggested improving code readability in normalize_bing_url by avoiding double shadowing of the href variable.
| let href = decode_html_entities(href); | ||
| let href = href.as_str(); |
There was a problem hiding this comment.
Shadowing the href parameter first with an owned String and then immediately shadowing it again with a &str slice of itself can be confusing to read and maintain. It is cleaner and more idiomatic to use a distinct name for the intermediate owned String (e.g., decoded_href) and then bind href to its slice.
let decoded_href = decode_html_entities(href);
let href = decoded_href.as_str();bing SERP href 用 & 实体编码,不先解码致 extract_query_param 取不到 u= → 整批误判 spam。已提上游 PR Hmbown#2245。
Hmbown
left a comment
There was a problem hiding this comment.
Clean two-line fix. Root cause is Bing's /ck/a redirect URLs encode as , making miss the base64 payload and collapse all results to . The fix calls the existing at the top of , and the targeted regression test validates the full decode-extract-decode pipeline.
APPROVE — ready to merge.
Hmbown
left a comment
There was a problem hiding this comment.
Clean two-line fix. Bing's redirect URLs encode & as HTML entities, making query-param extraction miss the payload. The fix calls the existing decode function at the top of normalize_bing_url, and the targeted regression test validates the full pipeline.
APPROVE — ready to merge.
Hmbown
left a comment
There was a problem hiding this comment.
Reviewed the diff and green CI. Decoding HTML entities before extracting Bing redirect targets fixes the default backend returning no useful results, with a focused regression test.
Hmbown
left a comment
There was a problem hiding this comment.
Focused fix for the default Bing backend: decoding & before extracting the u redirect parameter matches the documented failure mode and the regression test covers the zero-results path. CI is green across Ubuntu/macOS/Windows plus GitGuardian/Greptile. Good v0.8.48 bugfix candidate.
|
Verified locally on top of
The inline comment in |
Problem
With the default Bing backend,
web_searchreturns 0 results for normal queries.Bing's SERP wraps each result URL in a
/ck/a?...&u=<base64>click-tracking redirect, and the raw HTML encodes the separators as&entities.normalize_bing_urlextracts theuquery param without decoding HTML entities first, so it looks for keyuwhile the actual key in the string isamp;u. The base64 redirect target is never decoded, so every result's URL collapses to thebing.comroot domain.is_likely_spam_resultsthen sees an all-bing.combatch and rejects it → 0 results.Fix
Decode HTML entities (
decode_html_entities, already used elsewhere in this module) before parsing the redirect, restoring the real target URLs.Test
Adds
bing_ckurl_with_html_entities_decodes_real_url, covering a realistic&-encoded/ck/ahref.Notes
normalize_url,uddgparam) may have a similar entity issue; left out of this PR to keep it focused.Greptile Summary
This PR fixes a zero-results bug in the default Bing backend caused by HTML entity encoding (
&) in click-tracking redirect URLs not being decoded before query-parameter parsing. The two-line fix adds adecode_html_entitiescall at the top ofnormalize_bing_url, exactly where the raw href is first processed, and a targeted regression test validates the full decode-→-base64-extract flow.normalize_bing_urlnow callsdecode_html_entitiesbeforeextract_query_param, soamp;uis no longer mistaken for the parameter key and the real URL is correctly extracted from the base64 payload.bing_ckurl_with_html_entities_decodes_real_urlexercises a realistic&-encoded Bing/ck/ahref end-to-end, confirming the decoded URL matches the expected target.normalize_url(the DuckDuckGo path) has the same entity-decoding omission for itsuddgparameter, acknowledged in the PR description and deferred to a follow-up.Confidence Score: 5/5
Safe to merge — the change is a two-line, single-function addition that restores correct URL extraction for the default Bing backend.
The fix is minimal and correctly placed:
decode_html_entitiesis already used elsewhere in the same module for the same class of problem, the new code follows the established variable-shadowing pattern, and the regression test exercises the full decode path end-to-end. No existing behavior is removed or altered for non-Bing URL shapes.normalize_url(the DuckDuckGo path) has an analogous missing entity-decode step for itsuddgparameter — acknowledged in the PR description and deferred. No other files need attention.Important Files Changed
decode_html_entitiesbeforeextract_query_paraminnormalize_bing_url(two lines), plus a new unit test; fix is minimal, correct, and consistent with hownormalize_textalready uses the same helper.Flowchart
%%{init: {'theme': 'neutral'}}%% flowchart TD A["Raw Bing SERP HTML href"] --> B["decode_html_entities(href) 🆕"] B --> C["Decoded href with real & separators"] C --> D["extract_query_param(href, 'u')"] D --> E["percent_decode(encoded)"] E --> F["strip_prefix('a1')"] F --> G["base64 pad & decode"] G --> H{Valid http/https URL?} H -- Yes --> I["✅ Return real target URL"] H -- No --> J["Fallback: return href as-is"]Reviews (1): Last reviewed commit: "fix(web_search): decode HTML entities in..." | Re-trigger Greptile