Conversation
Richmond upon Thames doesn't work anymore. Bin collections can now be queried with Unique Property Reference Number (UPRN) in QS. Use "https://www.richmond.gov.uk/my_richmond" as URL, will request https://www.richmond.gov.uk/my_richmond/?pid=<YourUKPropertyID> and parse. "https://www.findmyaddress.co.uk/search" can be used for finding the UPRN.
📝 WalkthroughWalkthroughReplaced Selenium-based HTML parsing with HTTP requests and regex patterns for the London Borough of Richmond Upon Thames council scraper. Added URL construction logic supporting multiple PID sourcing methods, HTTP fetching, defensive HTML parsing, and robust date extraction with fallback message handling. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
Updates the Richmond upon Thames council scraper to work with the new Richmond “My Richmond” property page flow by switching from Selenium-driven navigation to a direct HTTP fetch and HTML parsing approach.
Changes:
- Replaces Selenium interaction with a
requests.getcall to fetch the property page HTML. - Adds regex-based parsing to extract bin types and next collection dates from the “my_waste” section.
- Introduces PID/identifier handling to build the final
?pid=...URL.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| base_url = kwargs.get("url") or page | ||
| pid_arg = kwargs.get("pid") | ||
| paon = kwargs.get("paon") | ||
|
|
||
| def get_collection_days(self, bin_types, collection_days): | ||
| bindata = {"bins": []} | ||
| WEEKLY_COLLECTION = 0 | ||
| GARDEN_COLLECTION = 1 | ||
| # work out final URL, but DO NOT add #my_waste | ||
| pid_from_url = self._pid_from_url(base_url) | ||
| pid_from_paon = self._pid_from_paon(paon) | ||
|
|
||
| for index, bin_type in enumerate(bin_types): | ||
| # currently only handled weekly and garden collection, special collections like Christmas Day need to be added | ||
| if index == WEEKLY_COLLECTION: | ||
| next_collection_date = get_next_day_of_week( | ||
| collection_days[index].text.strip(), date_format | ||
| ) | ||
| elif index == GARDEN_COLLECTION: | ||
| split_date_part = collection_days[index].text.split("More dates")[0] | ||
| next_collection_date = datetime.strptime( | ||
| split_date_part.strip(), "%d %B %Y" | ||
| ).strftime(date_format) | ||
| else: | ||
| next_collection_date = datetime.strptime( | ||
| collection_days[index].text.strip(), "%d %B %Y" | ||
| ).strftime(date_format) | ||
|
|
||
| dict_data = { | ||
| "type": bin_type.text.strip(), | ||
| "collectionDate": next_collection_date, | ||
| } | ||
| bindata["bins"].append(dict_data) | ||
| if "pid=" in (base_url or ""): | ||
| target_url = base_url | ||
| elif pid_arg or pid_from_paon: | ||
| pid = pid_arg or pid_from_paon | ||
| sep = "&" if "?" in (base_url or "") else "?" | ||
| target_url = f"{base_url}{sep}pid={pid}" | ||
| else: | ||
| raise ValueError( | ||
| "Richmond: supply a URL that already has ?pid=... OR put PID in the House Number field." | ||
| ) |
There was a problem hiding this comment.
parse_data looks for kwargs["pid"], but the CLI/test harness only passes uprn (-u) and paon (-n) (see uk_bin_collection/uk_bin_collection/collect_data.py:104-107 and uk_bin_collection/tests/step_defs/test_validate_council.py:75-98). As a result, this council can’t be configured with UPRN via the existing uprn field and the current integration-test metadata for Richmond (house_number: "March Road") will now error out. Consider reading the identifier from kwargs.get("uprn") (and/or continuing to accept the old paon behavior), and update the error message to reference the actual supported inputs.
| # work out final URL, but DO NOT add #my_waste | ||
| pid_from_url = self._pid_from_url(base_url) | ||
| pid_from_paon = self._pid_from_paon(paon) | ||
|
|
||
| for index, bin_type in enumerate(bin_types): | ||
| # currently only handled weekly and garden collection, special collections like Christmas Day need to be added | ||
| if index == WEEKLY_COLLECTION: | ||
| next_collection_date = get_next_day_of_week( | ||
| collection_days[index].text.strip(), date_format | ||
| ) | ||
| elif index == GARDEN_COLLECTION: | ||
| split_date_part = collection_days[index].text.split("More dates")[0] | ||
| next_collection_date = datetime.strptime( | ||
| split_date_part.strip(), "%d %B %Y" | ||
| ).strftime(date_format) | ||
| else: | ||
| next_collection_date = datetime.strptime( | ||
| collection_days[index].text.strip(), "%d %B %Y" | ||
| ).strftime(date_format) | ||
|
|
||
| dict_data = { | ||
| "type": bin_type.text.strip(), | ||
| "collectionDate": next_collection_date, | ||
| } | ||
| bindata["bins"].append(dict_data) | ||
| if "pid=" in (base_url or ""): | ||
| target_url = base_url | ||
| elif pid_arg or pid_from_paon: | ||
| pid = pid_arg or pid_from_paon | ||
| sep = "&" if "?" in (base_url or "") else "?" | ||
| target_url = f"{base_url}{sep}pid={pid}" |
There was a problem hiding this comment.
pid_from_url is computed but never used, and the presence check uses a raw substring ("pid=" in base_url). This can lead to incorrect behavior (e.g., different casing, URL-encoded query, or pid present without pid=) and is also dead code. Use _pid_from_url(base_url) as the source of truth (and remove the unused variable), e.g. treat pid_from_url as satisfying the “already has pid” case.
| if paon and str(paon).isdigit() and 10 <= len(str(paon)) <= 14: | ||
| return str(paon) |
There was a problem hiding this comment.
_pid_from_paon only accepts all-digit values with length 10–14. UPRNs are not guaranteed to be at least 10 digits (and may include leading zeros), so this can reject valid identifiers and force users into the unsupported pid kwarg. Consider accepting any non-empty digit string (possibly after stripping whitespace) or preferentially using kwargs.get("uprn") for UPRN while keeping this heuristic only as a fallback.
| if paon and str(paon).isdigit() and 10 <= len(str(paon)) <= 14: | |
| return str(paon) | |
| # accept any non-empty, digit-only string (after stripping whitespace), | |
| # to support UPRNs/PIDs that may be shorter than 10 digits or include leading zeros | |
| if paon is None: | |
| return None | |
| paon_str = str(paon).strip() | |
| if paon_str and paon_str.isdigit(): | |
| return paon_str |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In
`@uk_bin_collection/uk_bin_collection/councils/LondonBoroughOfRichmondUponThames.py`:
- Around line 150-166: In _first_date_or_message, the regex date_rx can match
strings that datetime.strptime cannot parse, so wrap the datetime.strptime(ds,
fmt) call in a try/except ValueError block: catch ValueError, optionally log a
warning that ds failed to parse (include ds and fmt) and then continue to the
next line rather than letting the exception propagate; keep returning
dt.strftime(date_format) when parsing succeeds.
- Around line 20-44: parse_data currently computes pid_from_url but never uses
it and always issues a second HTTP GET via _fetch_html instead of using the
already-fetched page; update parse_data to use pid_arg or pid_from_url or
pid_from_paon (i.e., pid = pid_arg or pid_from_url or pid_from_paon) when
building target_url and avoid the double request by using the provided page
content when it's already HTML (use page if present/truthy) and only call
_fetch_html(target_url) when page is empty or when skip_get_url is not set; also
ensure integration can set skip_get_url=True (get_and_parse_data/get_data) if
you prefer to prevent the base-class initial GET.
🧹 Nitpick comments (4)
uk_bin_collection/uk_bin_collection/councils/LondonBoroughOfRichmondUponThames.py (4)
1-1: Incorrect filename in comment.The comment references
richmond_gov_uk.pybut this file isLondonBoroughOfRichmondUponThames.py.
48-57: Duplicates the base classget_datamethod.
AbstractGetBinDataClass.get_dataalready performs a GET with a User-Agent and returns arequests.Response. This private method re-implements the same logic with slightly different settings (different UA,timeout=30vs120, SSL verifyTruevsFalse). If you address the double-request issue above (by usingskip_get_url), this method is fine to keep. Otherwise, consider reusing the base class infrastructure to avoid divergence.
61-111: Parsing logic looks reasonable for the described markup.The regex-based section extraction (
<h4>→ next<h4>) with<ul>/<li>and<p>fallback is well-structured. One consideration: bins whose date lines don't match the regex and aren't "no collection" messages are silently dropped (line 103). If the council changes date formats, this could cause all bins to disappear without any warning. Consider logging a warning for unrecognized date lines per the project learning about preferring explicit failures.
126-133: Unused return value aside, consider narrowing the exception catch.Ruff flags the bare
except Exception(BLE001).urlparse/parse_qstypically raiseValueErrorat most — narrowing to that would avoid masking unexpected errors.
| def parse_data(self, page: str, **kwargs) -> dict: | ||
| print(f"Arguments are f{kwargs}") | ||
| driver = None | ||
| try: | ||
| page = kwargs["url"] | ||
| street_name = kwargs.get("paon") | ||
| web_driver = kwargs.get("web_driver") | ||
| headless = kwargs.get("headless") | ||
|
|
||
| driver = create_webdriver(web_driver, headless, None, __name__) | ||
| driver.get(page) | ||
|
|
||
| wait = WebDriverWait(driver, 60) | ||
|
|
||
| self.dismiss_cookie_banner(wait) | ||
| self.input_street_name(street_name, wait) | ||
| self.submit(wait) | ||
| bin_types, collection_days = self.get_bins(driver) | ||
| bindata = self.get_collection_days(bin_types, collection_days) | ||
|
|
||
| print(bindata) | ||
|
|
||
| except Exception as e: | ||
| # Here you can log the exception if needed | ||
| print(f"An error occurred: {e}") | ||
| # Optionally, re-raise the exception if you want it to propagate | ||
| raise | ||
| finally: | ||
| # This block ensures that the driver is closed regardless of an exception | ||
| if driver: | ||
| driver.quit() | ||
| return bindata | ||
| base_url = kwargs.get("url") or page | ||
| pid_arg = kwargs.get("pid") | ||
| paon = kwargs.get("paon") | ||
|
|
||
| def get_collection_days(self, bin_types, collection_days): | ||
| bindata = {"bins": []} | ||
| WEEKLY_COLLECTION = 0 | ||
| GARDEN_COLLECTION = 1 | ||
| # work out final URL, but DO NOT add #my_waste | ||
| pid_from_url = self._pid_from_url(base_url) | ||
| pid_from_paon = self._pid_from_paon(paon) | ||
|
|
||
| for index, bin_type in enumerate(bin_types): | ||
| # currently only handled weekly and garden collection, special collections like Christmas Day need to be added | ||
| if index == WEEKLY_COLLECTION: | ||
| next_collection_date = get_next_day_of_week( | ||
| collection_days[index].text.strip(), date_format | ||
| ) | ||
| elif index == GARDEN_COLLECTION: | ||
| split_date_part = collection_days[index].text.split("More dates")[0] | ||
| next_collection_date = datetime.strptime( | ||
| split_date_part.strip(), "%d %B %Y" | ||
| ).strftime(date_format) | ||
| else: | ||
| next_collection_date = datetime.strptime( | ||
| collection_days[index].text.strip(), "%d %B %Y" | ||
| ).strftime(date_format) | ||
|
|
||
| dict_data = { | ||
| "type": bin_type.text.strip(), | ||
| "collectionDate": next_collection_date, | ||
| } | ||
| bindata["bins"].append(dict_data) | ||
| if "pid=" in (base_url or ""): | ||
| target_url = base_url | ||
| elif pid_arg or pid_from_paon: | ||
| pid = pid_arg or pid_from_paon | ||
| sep = "&" if "?" in (base_url or "") else "?" | ||
| target_url = f"{base_url}{sep}pid={pid}" | ||
| else: | ||
| raise ValueError( | ||
| "Richmond: supply a URL that already has ?pid=... OR put PID in the House Number field." | ||
| ) | ||
|
|
||
| html = self._fetch_html(target_url) | ||
| bindata = self._parse_html_for_waste(html) | ||
| if not bindata["bins"]: | ||
| raise RuntimeError("Richmond: no bins found in page HTML.") | ||
| return bindata |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's examine the complete file to understand the class structure
fd -n "LondonBoroughOfRichmondUponThames.py" -x cat -n {}Repository: robbrad/UKBinCollectionData
Length of output: 301
🏁 Script executed:
# Find the base class to understand get_and_parse_data
fd -n "\.py" -path "*/uk_bin_collection/*" | grep -i "base\|council" | head -20Repository: robbrad/UKBinCollectionData
Length of output: 301
🏁 Script executed:
# Search for how skip_get_url is used across the codebase
rg -n "skip_get_url" --type py -B 2 -A 2Repository: robbrad/UKBinCollectionData
Length of output: 15505
🏁 Script executed:
# Let's examine the Richmond Upon Thames file directly
cat uk_bin_collection/uk_bin_collection/councils/LondonBoroughOfRichmondUponThames.pyRepository: robbrad/UKBinCollectionData
Length of output: 6122
🏁 Script executed:
# Let's look at the base class get_and_parse_data method to understand the flow
cat uk_bin_collection/uk_bin_collection/get_bin_data.py | head -120Repository: robbrad/UKBinCollectionData
Length of output: 3737
🏁 Script executed:
# Check if pid_from_url is used anywhere after line 26 in the Richmond file
rg "pid_from_url" uk_bin_collection/uk_bin_collection/councils/LondonBoroughOfRichmondUponThames.pyRepository: robbrad/UKBinCollectionData
Length of output: 182
🏁 Script executed:
# Check if LondonBoroughOfRichmondUponThames.py exists as a separate file
fd "LondonBoroughOfRichmondUponThames" --type fRepository: robbrad/UKBinCollectionData
Length of output: 152
🏁 Script executed:
# Also search for any Council files related to Richmond
fd "Richmond\|richmond" --type f uk_bin_collection/uk_bin_collection/councils/Repository: robbrad/UKBinCollectionData
Length of output: 53
🏁 Script executed:
# Now let's check the actual LondonBoroughOfRichmondUponThames.py file
cat uk_bin_collection/uk_bin_collection/councils/LondonBoroughOfRichmondUponThames.pyRepository: robbrad/UKBinCollectionData
Length of output: 6122
🏁 Script executed:
# Verify the content matches what was shown in the review comment
cat uk_bin_collection/uk_bin_collection/councils/LondonBoroughOfRichmondUponThames.py | head -50Repository: robbrad/UKBinCollectionData
Length of output: 1757
🏁 Script executed:
# Check what's actually in the LondonBoroughOfRichmondUponThames.py file
file uk_bin_collection/uk_bin_collection/councils/LondonBoroughOfRichmondUponThames.pyRepository: robbrad/UKBinCollectionData
Length of output: 115
🏁 Script executed:
# Try to see if it's a symlink or something
ls -la uk_bin_collection/uk_bin_collection/councils/LondonBoroughOfRichmondUponThames.pyRepository: robbrad/UKBinCollectionData
Length of output: 201
🏁 Script executed:
# Let's check the first few lines of both files to understand
echo "=== LondonBoroughOfRichmondUponThames.py ===" && head -5 uk_bin_collection/uk_bin_collection/councils/LondonBoroughOfRichmondUponThames.py && echo "" && echo "=== richmond_gov_uk.py ===" && head -5 uk_bin_collection/uk_bin_collection/councils/richmond_gov_uk.pyRepository: robbrad/UKBinCollectionData
Length of output: 419
🏁 Script executed:
# Let's get the full content of the actual file
cat uk_bin_collection/uk_bin_collection/councils/LondonBoroughOfRichmondUponThames.pyRepository: robbrad/UKBinCollectionData
Length of output: 6122
Bug: pid_from_url is computed but never used; double HTTP request.
Two issues in parse_data:
-
pid_from_url(line 26) is assigned but never referenced — it looks like line 31 was intended to include it in the fallback chain (e.g.,pid_arg or pid_from_url or pid_from_paon), but it was accidentally omitted. As-is, the only check for an existing PID in the URL is the fragile string test"pid=" in base_urlat line 29. Since_pid_from_urlproperly parses the URL parameter, it should be used. -
Double HTTP request — the base class
get_and_parse_datacallsself.get_data(address_url)(which performs a GET), then passes the result aspagetoparse_data. This code ignorespageand makes a second request via_fetch_html(target_url)at line 40. Either use the already-fetched page content or setskip_get_urlin the kwargs to prevent the base class from making the first request.
Proposed fix
def parse_data(self, page: str, **kwargs) -> dict:
base_url = kwargs.get("url") or page
pid_arg = kwargs.get("pid")
paon = kwargs.get("paon")
# work out final URL, but DO NOT add `#my_waste`
pid_from_url = self._pid_from_url(base_url)
pid_from_paon = self._pid_from_paon(paon)
if "pid=" in (base_url or ""):
target_url = base_url
- elif pid_arg or pid_from_paon:
- pid = pid_arg or pid_from_paon
+ elif pid_arg or pid_from_url or pid_from_paon:
+ pid = pid_arg or pid_from_url or pid_from_paon
sep = "&" if "?" in (base_url or "") else "?"
target_url = f"{base_url}{sep}pid={pid}"
else:
raise ValueError(
"Richmond: supply a URL that already has ?pid=... OR put PID in the House Number field."
)To avoid the double request, set skip_get_url=True in your integration config to prevent the base class from making the initial request.
🧰 Tools
🪛 Ruff (0.14.14)
[error] 26-26: Local variable pid_from_url is assigned to but never used
Remove assignment to unused variable pid_from_url
(F841)
[warning] 36-38: Avoid specifying long messages outside the exception class
(TRY003)
[warning] 43-43: Avoid specifying long messages outside the exception class
(TRY003)
🤖 Prompt for AI Agents
In
`@uk_bin_collection/uk_bin_collection/councils/LondonBoroughOfRichmondUponThames.py`
around lines 20 - 44, parse_data currently computes pid_from_url but never uses
it and always issues a second HTTP GET via _fetch_html instead of using the
already-fetched page; update parse_data to use pid_arg or pid_from_url or
pid_from_paon (i.e., pid = pid_arg or pid_from_url or pid_from_paon) when
building target_url and avoid the double request by using the provided page
content when it's already HTML (use page if present/truthy) and only call
_fetch_html(target_url) when page is empty or when skip_get_url is not set; also
ensure integration can set skip_get_url=True (get_and_parse_data/get_data) if
you prefer to prevent the base-class initial GET.
| def _first_date_or_message(self, lines) -> str | None: | ||
| # match "Thursday 23 October 2025" or "23 October 2025" | ||
| date_rx = re.compile( | ||
| r"(?:(Monday|Tuesday|Wednesday|Thursday|Friday|Saturday|Sunday)\s+)?" | ||
| r"(\d{1,2}\s+[A-Za-z]+\s+\d{4})" | ||
| ) | ||
| cookie_banner.send_keys(Keys.ENTER) | ||
| for line in lines: | ||
| m = date_rx.search(line) | ||
| if m: | ||
| ds = m.group(0) | ||
| fmt = "%A %d %B %Y" if m.group(1) else "%d %B %Y" | ||
| dt = datetime.strptime(ds, fmt) | ||
| return dt.strftime(date_format) | ||
|
|
||
| lower = line.lower() | ||
| if "no collection" in lower or "no contract" in lower or "no subscription" in lower: | ||
| return line |
There was a problem hiding this comment.
datetime.strptime can raise ValueError on regex-matched but unparseable strings.
The regex \d{1,2}\s+[A-Za-z]+\s+\d{4} is intentionally loose — it will match text like "23 Octobar 2025" or "1 TBC 2025", which then causes datetime.strptime at line 161 to raise an unhandled ValueError, crashing the entire parse.
Wrap the strptime call in a try/except ValueError so parsing can continue to the next line or bin.
Proposed fix
for line in lines:
m = date_rx.search(line)
if m:
ds = m.group(0)
fmt = "%A %d %B %Y" if m.group(1) else "%d %B %Y"
- dt = datetime.strptime(ds, fmt)
- return dt.strftime(date_format)
+ try:
+ dt = datetime.strptime(ds, fmt)
+ return dt.strftime(date_format)
+ except ValueError:
+ continue # regex matched but date was not validBased on learnings, functions should prefer explicit failures with clear error causes over silently swallowed errors — here a continue with an optional log warning strikes the right balance between not crashing and not silently hiding the issue.
🤖 Prompt for AI Agents
In
`@uk_bin_collection/uk_bin_collection/councils/LondonBoroughOfRichmondUponThames.py`
around lines 150 - 166, In _first_date_or_message, the regex date_rx can match
strings that datetime.strptime cannot parse, so wrap the datetime.strptime(ds,
fmt) call in a try/except ValueError block: catch ValueError, optionally log a
warning that ds failed to parse (include ds and fmt) and then continue to the
next line rather than letting the exception propagate; keep returning
dt.strftime(date_format) when parsing succeeds.
Richmond upon Thames doesn't work anymore.
Bin collections can now be queried with Unique Property Reference Number (UPRN) in QS. Use "https://www.richmond.gov.uk/my_richmond" as URL, will request https://www.richmond.gov.uk/my_richmond/?pid= and parse.
"https://www.findmyaddress.co.uk/search" can be used for finding the UPRN.
Summary by CodeRabbit
Release Notes