Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #23 +/- ##
=========================================
Coverage ? 97.04%
=========================================
Files ? 8
Lines ? 1254
Branches ? 0
=========================================
Hits ? 1217
Misses ? 37
Partials ? 0 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Pull request overview
Adds a fallback path for DOIs that don’t support application/x-bibtex content negotiation (notably ChinaDOI/CNKI), and updates the Streamlit UI to surface a manual-visit link when automatic BibTeX lookup isn’t possible.
Changes:
- Switch DOI resolution to use GET + content validation for BibTeX responses, with a browser-header redirect probe for ChinaDOI/CNKI targets.
- Add Streamlit handling for “manual check required” lookup results (warn + link to resolved article page).
- Add an integration test case for a ChinaDOI-style DOI and document the fix in the changelog.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
bib_lookup/bib_lookup.py |
Implements GET-based DOI lookup and a redirect-fallback message path for ChinaDOI/CNKI. |
streamlit_app.py |
Detects redirect-flag results and shows a warning + link button instead of treating it as BibTeX. |
test/test_doi_bib_lookup.py |
Adds coverage for the new ChinaDOI fallback behavior. |
CHANGELOG.rst |
Documents the ChinaDOI/CNKI fallback fix. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Standard Attempt: Use API (Crossref/DataCite) | ||
| try: | ||
| r = self.session.post(**feed_content) | ||
| res = r.content.decode("utf-8") | ||
| except requests.Timeout: | ||
| res = self.timeout_err | ||
| except requests.RequestException: | ||
| res = self.network_err | ||
| if self.verbose > 3: | ||
| print_func(f"via `_handle_doi`, fetched content = {res}") | ||
| return res | ||
| r = self.session.get(**feed_content) | ||
|
|
||
| if r.status_code == 200: | ||
| res = r.content.decode("utf-8") | ||
|
|
||
| if is_requesting_bibtex: | ||
| # If we asked for BibTeX, strictly check if it looks like BibTeX (starts with @) | ||
| # This filters out ChinaDOI's HTML pages when they return status 200. | ||
| if res.strip().startswith("@"): | ||
| if self.verbose > 3: | ||
| print_func(f"via `_handle_doi`, fetched content = {res}") | ||
| return res | ||
| else: | ||
| # If we asked for Text/XML/RIS, trust the 200 OK response. | ||
| # This fixes the bug where format="text" was returning Network Error. | ||
| if self.verbose > 3: | ||
| print_func(f"via `_handle_doi`, fetched content = {res}") | ||
| return res | ||
|
|
||
| except (requests.Timeout, requests.RequestException): | ||
| pass | ||
|
|
There was a problem hiding this comment.
_handle_doi currently swallows requests.Timeout / RequestException (just pass) and later returns self.network_err, which loses the timeout vs network distinction and changes previous behavior. Also, by only decoding/returning the body on status_code == 200, responses like 404 that include "DOI Not Found" won’t reach _handle_network_error and may be misreported as Network Error. Consider: (1) returning self.timeout_err on Timeout and self.network_err on other RequestException; and (2) decoding the response body (or handling specific status codes) even when status_code != 200 so error mapping still works.
| try: | ||
| r_probe = requests.get( | ||
| url, headers=browser_headers, timeout=feed_content.get("timeout", 10), allow_redirects=True | ||
| ) | ||
|
|
||
| final_url = r_probe.url | ||
|
|
||
| # Check for ChinaDOI / CNKI | ||
| if "chndoi.org" in final_url or "cnki" in final_url: | ||
| msg = f"[Manual Check Required] Automatic lookup failed. Please visit: {final_url}" | ||
| if self.verbose >= 0: | ||
| print("-" * 60) | ||
| print(f"[!] {msg}") | ||
| print("-" * 60) | ||
| return msg |
There was a problem hiding this comment.
The fallback probe uses requests.get(...) instead of self.session.get(...), which bypasses the session configuration this class sets up (connection pooling/retries/proxies, and any user customization of self.session). It also prints unconditionally for the manual-check case because self.verbose >= 0 is always true at default verbosity, which will produce unexpected stdout in normal library usage. Use self.session.get(...) and gate printing behind the same verbosity convention used elsewhere (e.g., self.verbose >= 1).
| print_func(f"via `_handle_doi`, fetched content = {res}") | ||
| return res | ||
|
|
||
| except (requests.Timeout, requests.RequestException): |
There was a problem hiding this comment.
'except' clause does nothing but pass and there is no explanatory comment.
| print("-" * 60) | ||
| return msg | ||
|
|
||
| except Exception: |
There was a problem hiding this comment.
'except' clause does nothing but pass and there is no explanatory comment.
| except Exception: | |
| except Exception: | |
| # Intentionally ignore all errors from this best-effort probe. | |
| # If this browser-style request fails, we simply fall back to `self.network_err` below. |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
This PR addresses the issue where DOIs registered via ChinaDOI (e.g., CNKI publications like
10.13748/j.cnki...) failed to resolve. These sources often do not supportapplication/x-bibtexcontent negotiation and return 403 Forbidden or redirect to an intermediate HTML page.The streamlit app is also updated along with this change.