Modular PDF & full-text fetching framework#48
Conversation
Approach A: keep prefix-keyed metadata sources, add an ordered full-text provider chain (pmc -> unpaywall -> openalex -> custom) plus an extractor registry for downloading and text-extracting PDFs (and HTML/XML). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
18 TDD tasks: models/config, extractor registry (pdf/html/xml), content acquirer, full-text provider chain (pmc/unpaywall/openalex/custom), identifier crosswalk, orchestration + caching, URL->PDF, CLI flag, docs. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…t provenance fields Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ext CLI flag Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Drop unused `import pytest` (test_acquire, test_fulltext_loader) and an unused `ReferenceIdentifiers` import (test_reference_fetcher) flagged by ruff F401, so `just test` / `ruff check .` is clean for feature files. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
Claude finished @cmungall's task —— View job PR Review: Modular PDF & full-text fetching frameworkThis is a substantial and well-designed PR that implements a modular PDF and full-text fetching framework. After comprehensive analysis of the codebase, I'm providing the following detailed review. ✅ Overall Assessment: Excellent Architecture and ImplementationStrengths1. Outstanding Architecture Design
2. Robust Implementation
3. Excellent Security Practices
4. Comprehensive Test Coverage
Minor Issues & RecommendationsPerformance Considerations
Code Quality Improvements
Security Hardening
Excellent Practices Observed1. Test-Driven Development
2. Configuration Management
3. Documentation
Performance & ScalabilityStrengths:
Considerations:
Notable Implementation Details
✅ Final Recommendation: APPROVEThis PR represents exemplary software engineering:
The minor suggestions above are optimizations rather than blockers. This implementation follows the repository's TDD approach perfectly and significantly enhances the validator's capabilities. |
…ulltext-fetching # Conflicts: # src/linkml_reference_validator/cli/validate.py
|
Claude finished @cmungall's task in 2m 43s —— View job PR Review: Modular PDF & full-text fetching framework
I read the full diff against 🔴 Correctness / behavior1. Transient full-text failures get cached permanently as So a one-off Unpaywall/PMC outage (or a rate-limit blip) bakes in "no full text" indefinitely. There's no negative-result TTL or "full-text attempted but unresolved" marker to distinguish "we have the abstract and that's all that exists" from "we failed to fetch full text this time." Worth at least recording the attempt (e.g. a 2. Streaming download leaks the connection on the size-cap path. 3. 🟡 Robustness4. No content sniffing in the chain's extract path — a 5. 6. Custom-provider header interpolation silently drops missing env vars. 🟢 Minor / polish
Test coverageStrong — new dedicated suites for acquire/extractors/loader/providers/identifiers plus an end-to-end mocked chain, and the existing fetcher unit tests were correctly isolated by setting
Overall: solid, mergeable architecture. Issue #1 (cached negative results) is the one I'd most want addressed or explicitly documented before merge, since it changes long-run behavior in a non-obvious way; #2 and #3 are small, contained fixes. Nice work. |
…t sniffing Correctness: - Don't permanently cache transient full-text failures (#1). Add a full_text_attempted marker to ReferenceContent, set only when the chain is consulted without a transient (exception) error. On a disk-cache hit the chain is re-run when the record still needs full text and was never cleanly attempted, so a one-off provider outage no longer bakes in "abstract_only". Persisted to/from the markdown cache. - Close the streamed response on the size-cap path in ContentAcquirer (#2) by wrapping requests.get(stream=True) in a `with` block. - Route URLSource through ContentAcquirer (#3) so the size cap, rate-limit delay, and User-Agent apply uniformly (no more whole-PDF-in-memory). Robustness: - Add sniff_format() magic-byte detection; _materialize and url.py trust the bytes over server content-type / provider hint, so a "pdf" hint at an HTML landing page routes to HTML instead of pypdf (#4/#5). - Warn on an unset ${VAR} in custom-provider header interpolation (#6). Polish: - Annotate _materialize(location: FullTextLocation); build the PDFExtractor once in __init__ (fail-fast backend validation + no per-call rebuild); document the MIN_FULL_TEXT_CHARS / _MIN_PMC_FULLTEXT_CHARS thresholds. Tests: - Cover transient-retry vs clean-exhaustion, attempted-flag round-trip, sniff-over-hint, and response-close-on-cap. - Fix a pre-existing sys.modules isolation leak in test_optional_linkml_dependency that this change exposed (snapshot/restore). Dead PMC code in sources/pmid.py tracked separately in #52. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Claude finished @cmungall's task in 1m 54s —— View job PR Review: Modular PDF & full-text fetching framework (follow-up on
|
Summary
Adds a modular, ordered full-text provider chain that downloads and text-extracts PDFs (and HTML/XML) into
ReferenceContent.content, so supporting-text validation can run against full text instead of just abstracts. Approach A from the design spec: the existing prefix-keyed metadata sources are untouched; three small new layers are added on top.etl/fulltext/): ordered, configurable, try-until-success. Built-in providerspmc → unpaywall → openalex(default-on for DOIs), plus declarative custom providers via YAML (mirrors the existing JSON-API source loader).etl/acquire.py,etl/extract/): streaming download honoring the existing size cap; a format-keyedExtractorregistry (PDF/HTML/XML) with a pluggable PDF backend behind one interface (defaultpypdf, swappable to docling/grobid later).etl/identifiers.py): builds DOI/PMID/PMCID/url from fetched content so DOI-keyed providers work regardless of input prefix.ReferenceFetcher): when metadata yields no full text, walk the chain → acquire → extract → assemble, recording provenance (full_text_provider,full_text_url,oa_status,license,local_pdf_path) and caching the downloaded PDF. Provenance round-trips through the markdown cache.url:reference pointing at a PDF is detected and text-extracted.fetch_full_text,full_text_providers,pdf_backend,download_pdfs,full_text_providers_file;--full-text/--no-full-textflag; reusesemailandmax_supplementary_file_size.docs/how-to/fetch-full-text-and-pdfs.md.Design spec and plan are included under
docs/superpowers/.Test Plan
uv run pytest -q— 551 passeduv run mypy src— clean (44 files)uv run ruff check .— clean for feature filesuv run pytest --doctest-modules src— 173 passeduv run mkdocs build -q— cleanKnown follow-ups (deliberately deferred in the plan)
sources/pmid.py(logic was duplicated intoPMCFullTextProvider); a follow-up can delete them and migrate the_get_pmciderror-handling test to the provider.files_cache_diris hardcoded ascache_dir/filesrather than exposed as config; lazy PMID→DOI via NCBI ID Converter not implemented (YAGNI — esummary already supplies the DOI on the happy path).Note
A small pre-existing uncommitted edit to
cli/validate.py(adatacommand summary message tweak that predated this work) was inadvertently folded into one of the commits. It's cosmetic; flagging in case you want it separated.🤖 Generated with Claude Code