Skip to content

Modular PDF & full-text fetching framework#48

Merged
cmungall merged 25 commits into
mainfrom
feature/modular-pdf-fulltext-fetching
Jun 17, 2026
Merged

Modular PDF & full-text fetching framework#48
cmungall merged 25 commits into
mainfrom
feature/modular-pdf-fulltext-fetching

Conversation

@cmungall

Copy link
Copy Markdown
Member

Summary

Adds a modular, ordered full-text provider chain that downloads and text-extracts PDFs (and HTML/XML) into ReferenceContent.content, so supporting-text validation can run against full text instead of just abstracts. Approach A from the design spec: the existing prefix-keyed metadata sources are untouched; three small new layers are added on top.

  • Full-text provider chain (etl/fulltext/): ordered, configurable, try-until-success. Built-in providers pmc → unpaywall → openalex (default-on for DOIs), plus declarative custom providers via YAML (mirrors the existing JSON-API source loader).
  • Content acquisition + extraction (etl/acquire.py, etl/extract/): streaming download honoring the existing size cap; a format-keyed Extractor registry (PDF/HTML/XML) with a pluggable PDF backend behind one interface (default pypdf, swappable to docling/grobid later).
  • Identifier crosswalk (etl/identifiers.py): builds DOI/PMID/PMCID/url from fetched content so DOI-keyed providers work regardless of input prefix.
  • Orchestration (ReferenceFetcher): when metadata yields no full text, walk the chain → acquire → extract → assemble, recording provenance (full_text_provider, full_text_url, oa_status, license, local_pdf_path) and caching the downloaded PDF. Provenance round-trips through the markdown cache.
  • Generic URL→PDF: a url: reference pointing at a PDF is detected and text-extracted.
  • Config + CLI: fetch_full_text, full_text_providers, pdf_backend, download_pdfs, full_text_providers_file; --full-text/--no-full-text flag; reuses email and max_supplementary_file_size.
  • Docs: docs/how-to/fetch-full-text-and-pdfs.md.

Design spec and plan are included under docs/superpowers/.

Test Plan

  • uv run pytest -q — 551 passed
  • uv run mypy src — clean (44 files)
  • uv run ruff check . — clean for feature files
  • uv run pytest --doctest-modules src — 173 passed
  • uv run mkdocs build -q — clean
  • End-to-end integration test: metadata → chain fall-through → PDF download+extract → cached PDF (all HTTP mocked)
  • Reviewer: spot-check a live DOI/PMID fetch against real Unpaywall/OpenAlex/PMC

Known follow-ups (deliberately deferred in the plan)

  • Dead PMC helper methods remain in sources/pmid.py (logic was duplicated into PMCFullTextProvider); a follow-up can delete them and migrate the _get_pmcid error-handling test to the provider.
  • files_cache_dir is hardcoded as cache_dir/files rather than exposed as config; lazy PMID→DOI via NCBI ID Converter not implemented (YAGNI — esummary already supplies the DOI on the happy path).

Note

A small pre-existing uncommitted edit to cli/validate.py (a data command summary message tweak that predated this work) was inadvertently folded into one of the commits. It's cosmetic; flagging in case you want it separated.

🤖 Generated with Claude Code

cmungall and others added 23 commits June 12, 2026 16:54
Approach A: keep prefix-keyed metadata sources, add an ordered full-text
provider chain (pmc -> unpaywall -> openalex -> custom) plus an extractor
registry for downloading and text-extracting PDFs (and HTML/XML).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
18 TDD tasks: models/config, extractor registry (pdf/html/xml), content
acquirer, full-text provider chain (pmc/unpaywall/openalex/custom),
identifier crosswalk, orchestration + caching, URL->PDF, CLI flag, docs.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…t provenance fields

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ext CLI flag

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Drop unused `import pytest` (test_acquire, test_fulltext_loader) and an
unused `ReferenceIdentifiers` import (test_reference_fetcher) flagged by
ruff F401, so `just test` / `ruff check .` is clean for feature files.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@claude

claude Bot commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

Claude finished @cmungall's task —— View job


PR Review: Modular PDF & full-text fetching framework

This is a substantial and well-designed PR that implements a modular PDF and full-text fetching framework. After comprehensive analysis of the codebase, I'm providing the following detailed review.

Overall Assessment: Excellent Architecture and Implementation

Strengths

1. Outstanding Architecture Design

  • Modular Provider Chain: Clean fail-over pattern (PMC → Unpaywall → OpenAlex) with configurable ordering
  • Registry Pattern: Both FullTextProviderRegistry and ExtractorRegistry enable extensibility without code changes
  • Interface Segregation: Well-defined abstractions (FullTextProvider, Extractor) with single responsibilities
  • Configuration-Driven: Custom providers via YAML without requiring Python code
  • Separation of Concerns: Clear boundaries between acquisition, extraction, identification, and orchestration

2. Robust Implementation

  • Error Handling: Provider failures don't abort the chain (line 175-179 in reference_fetcher.py)
  • Rate Limiting: Consistent rate_limit_delay across all providers
  • Content Validation: Minimum character thresholds prevent empty/truncated results
  • Format Detection: Multi-layer approach (content-type → URL suffix → provider hint)
  • Provenance Tracking: Complete metadata chain preserved through caching

3. Excellent Security Practices

  • Environment Variable Interpolation: Safe ${VAR} syntax for API keys (json_api.py:108)
  • Input Sanitization: Proper URL parsing and validation
  • Size Limits: max_supplementary_file_size prevents DoS attacks
  • User-Agent Headers: Proper identification in all HTTP requests
  • No Credential Hardcoding: All authentication properly externalized

4. Comprehensive Test Coverage

  • Unit Tests: All core components tested with mocks
  • Integration Tests: End-to-end chain testing with HTTP mocking
  • Edge Cases: Error conditions, empty responses, malformed data
  • Caching: Serialization/deserialization round-trip tests
  • Provider Fallback: Chain failure scenarios covered

Minor Issues & Recommendations

Performance Considerations

  1. Sequential Provider Calls (reference_fetcher.py:169-196): Currently providers are tried sequentially. For better UX, consider parallel requests with first-success cancellation for non-deterministic providers.

  2. PDF Backend Initialization (pdf.py:53-59): Backend is instantiated per PDFExtractor instance. Consider caching initialized backends.

Code Quality Improvements

  1. Magic Numbers (pmc.py:28, reference_fetcher.py:39): Extract constants like _MIN_PMC_FULLTEXT_CHARS = 1000 to configuration.

  2. Exception Handling Scope (pmc.py:83): The broad except Exception is appropriate for external API boundaries, but consider logging the exception type for better debugging.

  3. URL Format Detection (acquire.py:50-55): The suffix-based detection could be more robust - consider using pathlib.PurePath for better path parsing.

Security Hardening

  1. Request Timeouts: All HTTP requests have timeouts ✅
  2. Content-Type Validation: Could add MIME type validation before extraction
  3. URL Scheme Validation: Consider restricting to https:// for production deployments

Excellent Practices Observed

1. Test-Driven Development

  • Comprehensive test suite with 551 passing tests
  • Doctests serve dual purpose as documentation and tests
  • Proper use of mocks for external dependencies

2. Configuration Management

  • Pydantic models with validation and defaults
  • Backward-compatible field additions
  • Environment-aware configuration loading

3. Documentation

  • Extensive docstrings with examples
  • How-to guide for users (docs/how-to/fetch-full-text-and-pdfs.md)
  • Design documentation in docs/superpowers/

Performance & Scalability

Strengths:

  • Streaming downloads with size caps
  • Memory-efficient PDF processing
  • Disk caching reduces redundant API calls
  • Rate limiting respects API terms

Considerations:

  • Provider chain is CPU-bound for PDF extraction
  • No async support (appropriate for current use case)
  • Cache invalidation strategy could be more sophisticated

Notable Implementation Details

  1. Identifier Crosswalk (identifiers.py): Clean mapping between PMID/DOI/PMCID/URL for provider compatibility

  2. Format Resolution (acquire.py:30-56): Smart format detection prioritizes content-type over URL hints

  3. Content Assembly (reference_fetcher.py:188): Preserves abstracts when adding full text - excellent UX

  4. Cache Serialization: YAML quoting handles special characters properly (reference_fetcher.py:144-180)

Final Recommendation: APPROVE

This PR represents exemplary software engineering:

  • Clean Architecture: Modular, extensible, testable design
  • Security-Conscious: Proper input validation and credential management
  • Well-Tested: Comprehensive coverage including edge cases
  • Production-Ready: Error handling, rate limiting, monitoring hooks
  • Future-Proof: Plugin architecture supports evolution

The minor suggestions above are optimizations rather than blockers. This implementation follows the repository's TDD approach perfectly and significantly enhances the validator's capabilities.


…ulltext-fetching

# Conflicts:
#	src/linkml_reference_validator/cli/validate.py
@claude

claude Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Claude finished @cmungall's task in 2m 43s —— View job


PR Review: Modular PDF & full-text fetching framework

  • Gather context (commits, diffstat)
  • Review core orchestration (reference_fetcher.py, acquire.py, identifiers.py)
  • Review provider chain (fulltext/*)
  • Review extractors (extract/*)
  • Review models & config changes
  • Assess test coverage
  • Post consolidated review

I read the full diff against origin/main. This is a clean, well-factored design — the provider/extractor registries, the FullTextProvider/Extractor interfaces, the identifier crosswalk, and the consistent "external boundary" try/except placement all line up nicely with the repo conventions (and with CLAUDE.md's guidance to reserve try/except for real external boundaries). Build & test CI is green. Below are the issues I think are worth addressing, ordered by impact.


🔴 Correctness / behavior

1. Transient full-text failures get cached permanently as abstract_only.
fetch() short-circuits on the disk cache before _enrich_with_full_text runs (reference_fetcher.py:119-144). When the chain fails or yields nothing usable, _enrich_with_full_text returns the content unchanged (reference_fetcher.py:198) and it is saved with its original content_type (e.g. abstract_only). On the next run the cache hit returns that record and the chain never re-runs unless the caller passes force_refresh=True.

So a one-off Unpaywall/PMC outage (or a rate-limit blip) bakes in "no full text" indefinitely. There's no negative-result TTL or "full-text attempted but unresolved" marker to distinguish "we have the abstract and that's all that exists" from "we failed to fetch full text this time." Worth at least recording the attempt (e.g. a full_text_attempted flag / timestamp) so a later run can retry, or documenting that recovering requires force_refresh. Fix this →

2. Streaming download leaks the connection on the size-cap path.
ContentAcquirer.fetch_bytes (acquire.py:80-99) opens requests.get(..., stream=True) but never closes the response. On the happy path iter_content runs to completion (connection released), but the early return None, content_type when the size cap is exceeded (acquire.py:93-97) abandons a partially-read streamed response without draining or closing it — that holds the socket open until GC. Wrap in with requests.get(...) as response: (or response.close() before the early return). Fix this →

3. url: PDF path bypasses the size cap that the rest of the framework enforces.
URLSource.fetch (sources/url.py:78-96) does a non-streamed requests.get(...) and feeds response.content straight into PDFExtractor, ignoring max_supplementary_file_size. The whole point of ContentAcquirer is to stream with a cap; a url: reference pointing at a multi-GB PDF will be fully buffered into memory. Consider routing this through ContentAcquirer.fetch_bytes so the cap (and rate-limit/User-Agent handling) is applied uniformly. It also re-implements the %PDF- magic-byte sniff that the chain path lacks (see #4).


🟡 Robustness

4. No content sniffing in the chain's extract path — a format_hint="pdf" pointing at an HTML landing page feeds junk to pypdf.
resolve_format (acquire.py:30-56) trusts content-type → URL suffix → provider hint, but never inspects the bytes. Unpaywall/OpenAlex frequently return a url_for_pdf that 200s with text/html (paywall interstitial, Cloudflare, or a redirect to a landing page). _materialize will then either mis-route to the HTML extractor or hand HTML to pypdf. Failures are caught (reference_fetcher.py:229-233), so it's not fatal — but a 2-line magic-byte check (data[:5] == b"%PDF-" → force pdf; <html/<?xml → html/xml), like the one already in url.py:84, would make detection far more reliable and avoid silently storing boilerplate HTML as "full text."

5. resolve_format prefers server content-type over the provider hint.
Closely related to #4: when a provider knows a URL is a PDF but the server mislabels it, the content-type wins. Given how often publisher servers send wrong/generic types (application/octet-stream, text/html), consider letting a strong format_hint win when the body sniffs as PDF, or at least sniffing as a tie-breaker.

6. Custom-provider header interpolation silently drops missing env vars.
json_api.py:113-118 substitutes ${VAR} with os.environ.get(name, "") — a missing/misspelled API-key var produces an empty Authorization header rather than an error or warning, yielding confusing 401s. A logger.warning when a referenced var is unset would save debugging time.


🟢 Minor / polish

  • Dead PMC code in sources/pmid.py:257-390 (_fetch_pmc_fulltext, _get_pmcid, _fetch_pmc_xml, _fetch_pmc_html) is now duplicated into PMCFullTextProvider — already flagged in the PR description as a follow-up. 👍 Worth a tracking issue so it isn't forgotten.
  • _materialize(self, location) (reference_fetcher.py:200) is the one new function missing a parameter type annotation (location: FullTextLocation); the rest of the diff is consistently typed.
  • PDF backend validation runs inside the chain, not at init. PDFExtractor(backend=self.config.pdf_backend) is constructed in _materialize (reference_fetcher.py:223) outside the surrounding try, so an invalid pdf_backend raises ValueError mid-fetch rather than failing fast at config load. Validating pdf_backend once in ReferenceFetcher.__init__ (or a config validator) would surface misconfiguration earlier.
  • Magic numbers MIN_FULL_TEXT_CHARS = 500 (reference_fetcher.py:39) and _MIN_PMC_FULLTEXT_CHARS = 1000 (pmc.py:28) are two different thresholds for "is this enough text" — fine as constants, but a brief comment on why PMC uses 2× the global floor would help future readers.

Test coverage

Strong — new dedicated suites for acquire/extractors/loader/providers/identifiers plus an end-to-end mocked chain, and the existing fetcher unit tests were correctly isolated by setting fetch_full_text=False in the fixture (tests/test_reference_fetcher.py:20-24). Two gaps map to the issues above and would be worth adding:


Overall: solid, mergeable architecture. Issue #1 (cached negative results) is the one I'd most want addressed or explicitly documented before merge, since it changes long-run behavior in a non-obvious way; #2 and #3 are small, contained fixes. Nice work.

…t sniffing

Correctness:
- Don't permanently cache transient full-text failures (#1). Add a
  full_text_attempted marker to ReferenceContent, set only when the chain
  is consulted without a transient (exception) error. On a disk-cache hit
  the chain is re-run when the record still needs full text and was never
  cleanly attempted, so a one-off provider outage no longer bakes in
  "abstract_only". Persisted to/from the markdown cache.
- Close the streamed response on the size-cap path in ContentAcquirer (#2)
  by wrapping requests.get(stream=True) in a `with` block.
- Route URLSource through ContentAcquirer (#3) so the size cap, rate-limit
  delay, and User-Agent apply uniformly (no more whole-PDF-in-memory).

Robustness:
- Add sniff_format() magic-byte detection; _materialize and url.py trust
  the bytes over server content-type / provider hint, so a "pdf" hint at an
  HTML landing page routes to HTML instead of pypdf (#4/#5).
- Warn on an unset ${VAR} in custom-provider header interpolation (#6).

Polish:
- Annotate _materialize(location: FullTextLocation); build the PDFExtractor
  once in __init__ (fail-fast backend validation + no per-call rebuild);
  document the MIN_FULL_TEXT_CHARS / _MIN_PMC_FULLTEXT_CHARS thresholds.

Tests:
- Cover transient-retry vs clean-exhaustion, attempted-flag round-trip,
  sniff-over-hint, and response-close-on-cap.
- Fix a pre-existing sys.modules isolation leak in
  test_optional_linkml_dependency that this change exposed (snapshot/restore).

Dead PMC code in sources/pmid.py tracked separately in #52.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@claude

claude Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Claude finished @cmungall's task in 1m 54s —— View job


PR Review: Modular PDF & full-text fetching framework (follow-up on c3b7876)

I read c3b7876 against origin/main. All six issues from the previous review are addressed, and each fix ships with a real test (no test-relaxation — consistent with CLAUDE.md's TDD guidance). Verdict below.


✅ Resolved

#1 — transient failures no longer cached as permanent abstract_only. The new full_text_attempted flag is the right model: _enrich_with_full_text distinguishes a clean "nothing available" (sets the flag) from a download/provider exception (had_error → flag stays unset), and _maybe_retry_full_text re-runs the chain on a disk-cache hit only when the record still needs full text and was never cleanly attempted (reference_fetcher.py:185-197). The retry re-saves only when the record actually changed. Round-trips through the markdown cache (models.py, _save_to_disk:495, _load_markdown_format:654). test_transient_full_text_failure_is_retried_on_next_run / test_clean_exhaustion_marks_attempted_and_is_not_retried / test_full_text_attempted_round_trips_through_disk_cache cover exactly the three paths. Nicely done — and the "once per process" semantics (memory cache short-circuits before the retry) are intentional and documented.

#2 — streamed response now closed on the size-cap path. requests.get(..., stream=True) is wrapped in with (acquire.py:115), so the early cap return (:132) releases the socket. test_fetch_bytes_closes_response_on_size_cap asserts it.

#3url: PDF path routed through ContentAcquirer. URLSource.fetch now calls ContentAcquirer().fetch_bytes (url.py:74), so the size cap, rate-limit delay, and User-Agent apply uniformly — no more whole-PDF-in-memory.

#4/#5 — content sniffing. sniff_format() (acquire.py:30-60) magic-byte-checks %PDF-/<html>/<?xml, and _materialize prefers it over the server content-type and provider hint (reference_fetcher.py:275), so a format_hint="pdf" at an HTML landing page routes to the HTML extractor instead of pypdf. test_materialize_trusts_sniffed_format_over_hint confirms it.

#6 — unset ${VAR} now warns instead of silently substituting empty (json_api.py:122-133), with test_custom_provider_warns_on_unset_env_var.

Polish: _materialize(location: FullTextLocation) is annotated; PDFExtractor is built once in __init__ (fail-fast pdf_backend validation + no per-call rebuild, reference_fetcher.py:93); the two MIN_*_CHARS thresholds are now documented.


🟢 Minor residual observations (non-blocking)

  1. Boilerplate-HTML can still be stored as "full text." Sniffing correctly keeps an HTML interstitial out of pypdf, but if that landing page has >MIN_FULL_TEXT_CHARS (500) of boilerplate it's stored as full_text_html and marks full_text_attempted=True, ending the chain. Sniffing solved the crash/garbage half of Bump actions/checkout from 4 to 6 #4; the false-positive full text half remains. Not worth blocking on, but a follow-up could add a light "looks like a paywall/landing page" heuristic (e.g. presence of <form, "subscribe", very low text-to-markup ratio) before accepting HTML as full text.

  2. URLSource still rebuilds PDFExtractor per call (url.py:83) and re-implements the magic-byte sniff inline rather than reusing sniff_format. Harmless (it's not on a hot loop), but reusing sniff_format(data) here would keep the detection logic in one place. Fix this →

  3. Dead PMC code in sources/pmid.py is now tracked in Remove dead PMC full-text code from sources/pmid.py (superseded by PMCFullTextProvider) #52 per the commit message — 👍, good to see it filed rather than lingering.


Overall: the correctness concern (#1) that I most wanted addressed before merge is handled cleanly and with the right test matrix, and #2/#3/#4–6 are all solid, contained fixes. The residual items above are polish, not blockers. This is mergeable. Nice work.

(Note: I reviewed the diff and tests statically; I couldn't run the suite in this job due to command-approval restrictions, but the PR's CI run is the source of truth for green.)

@cmungall cmungall merged commit 728c7c2 into main Jun 17, 2026
5 checks passed
@cmungall cmungall deleted the feature/modular-pdf-fulltext-fetching branch June 17, 2026 17:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant