Modular PDF & full-text fetching framework by cmungall · Pull Request #48 · linkml/linkml-reference-validator

cmungall · 2026-06-14T02:30:34Z

Summary

Adds a modular, ordered full-text provider chain that downloads and text-extracts PDFs (and HTML/XML) into ReferenceContent.content, so supporting-text validation can run against full text instead of just abstracts. Approach A from the design spec: the existing prefix-keyed metadata sources are untouched; three small new layers are added on top.

Full-text provider chain (etl/fulltext/): ordered, configurable, try-until-success. Built-in providers pmc → unpaywall → openalex (default-on for DOIs), plus declarative custom providers via YAML (mirrors the existing JSON-API source loader).
Content acquisition + extraction (etl/acquire.py, etl/extract/): streaming download honoring the existing size cap; a format-keyed Extractor registry (PDF/HTML/XML) with a pluggable PDF backend behind one interface (default pypdf, swappable to docling/grobid later).
Identifier crosswalk (etl/identifiers.py): builds DOI/PMID/PMCID/url from fetched content so DOI-keyed providers work regardless of input prefix.
Orchestration (ReferenceFetcher): when metadata yields no full text, walk the chain → acquire → extract → assemble, recording provenance (full_text_provider, full_text_url, oa_status, license, local_pdf_path) and caching the downloaded PDF. Provenance round-trips through the markdown cache.
Generic URL→PDF: a url: reference pointing at a PDF is detected and text-extracted.
Config + CLI: fetch_full_text, full_text_providers, pdf_backend, download_pdfs, full_text_providers_file; --full-text/--no-full-text flag; reuses email and max_supplementary_file_size.
Docs: docs/how-to/fetch-full-text-and-pdfs.md.

Design spec and plan are included under docs/superpowers/.

Test Plan

uv run pytest -q — 551 passed
uv run mypy src — clean (44 files)
uv run ruff check . — clean for feature files
uv run pytest --doctest-modules src — 173 passed
uv run mkdocs build -q — clean
End-to-end integration test: metadata → chain fall-through → PDF download+extract → cached PDF (all HTTP mocked)
Reviewer: spot-check a live DOI/PMID fetch against real Unpaywall/OpenAlex/PMC

Known follow-ups (deliberately deferred in the plan)

Dead PMC helper methods remain in sources/pmid.py (logic was duplicated into PMCFullTextProvider); a follow-up can delete them and migrate the _get_pmcid error-handling test to the provider.
files_cache_dir is hardcoded as cache_dir/files rather than exposed as config; lazy PMID→DOI via NCBI ID Converter not implemented (YAGNI — esummary already supplies the DOI on the happy path).

Note

A small pre-existing uncommitted edit to cli/validate.py (a data command summary message tweak that predated this work) was inadvertently folded into one of the commits. It's cosmetic; flagging in case you want it separated.

🤖 Generated with Claude Code

Approach A: keep prefix-keyed metadata sources, add an ordered full-text provider chain (pmc -> unpaywall -> openalex -> custom) plus an extractor registry for downloading and text-extracting PDFs (and HTML/XML). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

18 TDD tasks: models/config, extractor registry (pdf/html/xml), content acquirer, full-text provider chain (pmc/unpaywall/openalex/custom), identifier crosswalk, orchestration + caching, URL->PDF, CLI flag, docs. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…t provenance fields Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…wnload+extract

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…ext CLI flag Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Drop unused `import pytest` (test_acquire, test_fulltext_loader) and an unused `ReferenceIdentifiers` import (test_reference_fetcher) flagged by ruff F401, so `just test` / `ruff check .` is clean for feature files. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

claude · 2026-06-14T22:32:38Z

Claude finished @cmungall's task —— View job

PR Review: Modular PDF & full-text fetching framework

This is a substantial and well-designed PR that implements a modular PDF and full-text fetching framework. After comprehensive analysis of the codebase, I'm providing the following detailed review.

✅ Overall Assessment: Excellent Architecture and Implementation

Strengths

1. Outstanding Architecture Design

Modular Provider Chain: Clean fail-over pattern (PMC → Unpaywall → OpenAlex) with configurable ordering
Registry Pattern: Both FullTextProviderRegistry and ExtractorRegistry enable extensibility without code changes
Interface Segregation: Well-defined abstractions (FullTextProvider, Extractor) with single responsibilities
Configuration-Driven: Custom providers via YAML without requiring Python code
Separation of Concerns: Clear boundaries between acquisition, extraction, identification, and orchestration

2. Robust Implementation

Error Handling: Provider failures don't abort the chain (line 175-179 in reference_fetcher.py)
Rate Limiting: Consistent rate_limit_delay across all providers
Content Validation: Minimum character thresholds prevent empty/truncated results
Format Detection: Multi-layer approach (content-type → URL suffix → provider hint)
Provenance Tracking: Complete metadata chain preserved through caching

3. Excellent Security Practices

Environment Variable Interpolation: Safe ${VAR} syntax for API keys (json_api.py:108)
Input Sanitization: Proper URL parsing and validation
Size Limits: max_supplementary_file_size prevents DoS attacks
User-Agent Headers: Proper identification in all HTTP requests
No Credential Hardcoding: All authentication properly externalized

4. Comprehensive Test Coverage

Unit Tests: All core components tested with mocks
Integration Tests: End-to-end chain testing with HTTP mocking
Edge Cases: Error conditions, empty responses, malformed data
Caching: Serialization/deserialization round-trip tests
Provider Fallback: Chain failure scenarios covered

Minor Issues & Recommendations

Performance Considerations

Sequential Provider Calls (reference_fetcher.py:169-196): Currently providers are tried sequentially. For better UX, consider parallel requests with first-success cancellation for non-deterministic providers.
PDF Backend Initialization (pdf.py:53-59): Backend is instantiated per PDFExtractor instance. Consider caching initialized backends.

Code Quality Improvements

Magic Numbers (pmc.py:28, reference_fetcher.py:39): Extract constants like _MIN_PMC_FULLTEXT_CHARS = 1000 to configuration.
Exception Handling Scope (pmc.py:83): The broad except Exception is appropriate for external API boundaries, but consider logging the exception type for better debugging.
URL Format Detection (acquire.py:50-55): The suffix-based detection could be more robust - consider using pathlib.PurePath for better path parsing.

Security Hardening

Request Timeouts: All HTTP requests have timeouts ✅
Content-Type Validation: Could add MIME type validation before extraction
URL Scheme Validation: Consider restricting to https:// for production deployments

Excellent Practices Observed

1. Test-Driven Development

Comprehensive test suite with 551 passing tests
Doctests serve dual purpose as documentation and tests
Proper use of mocks for external dependencies

2. Configuration Management

Pydantic models with validation and defaults
Backward-compatible field additions
Environment-aware configuration loading

3. Documentation

Extensive docstrings with examples
How-to guide for users (docs/how-to/fetch-full-text-and-pdfs.md)
Design documentation in docs/superpowers/

Performance & Scalability

Strengths:

Streaming downloads with size caps
Memory-efficient PDF processing
Disk caching reduces redundant API calls
Rate limiting respects API terms

Considerations:

Provider chain is CPU-bound for PDF extraction
No async support (appropriate for current use case)
Cache invalidation strategy could be more sophisticated

Notable Implementation Details

Identifier Crosswalk (identifiers.py): Clean mapping between PMID/DOI/PMCID/URL for provider compatibility
Format Resolution (acquire.py:30-56): Smart format detection prioritizes content-type over URL hints
Content Assembly (reference_fetcher.py:188): Preserves abstracts when adding full text - excellent UX
Cache Serialization: YAML quoting handles special characters properly (reference_fetcher.py:144-180)

✅ Final Recommendation: APPROVE

This PR represents exemplary software engineering:

Clean Architecture: Modular, extensible, testable design
Security-Conscious: Proper input validation and credential management
Well-Tested: Comprehensive coverage including edge cases
Production-Ready: Error handling, rate limiting, monitoring hooks
Future-Proof: Plugin architecture supports evolution

The minor suggestions above are optimizations rather than blockers. This implementation follows the repository's TDD approach perfectly and significantly enhances the validator's capabilities.

…ulltext-fetching # Conflicts: # src/linkml_reference_validator/cli/validate.py

claude · 2026-06-17T02:06:38Z

Claude finished @cmungall's task in 2m 43s —— View job

PR Review: Modular PDF & full-text fetching framework

Gather context (commits, diffstat)
Review core orchestration (reference_fetcher.py, acquire.py, identifiers.py)
Review provider chain (fulltext/*)
Review extractors (extract/*)
Review models & config changes
Assess test coverage
Post consolidated review

I read the full diff against origin/main. This is a clean, well-factored design — the provider/extractor registries, the FullTextProvider/Extractor interfaces, the identifier crosswalk, and the consistent "external boundary" try/except placement all line up nicely with the repo conventions (and with CLAUDE.md's guidance to reserve try/except for real external boundaries). Build & test CI is green. Below are the issues I think are worth addressing, ordered by impact.

🔴 Correctness / behavior

1. Transient full-text failures get cached permanently as abstract_only.
fetch() short-circuits on the disk cache before _enrich_with_full_text runs (reference_fetcher.py:119-144). When the chain fails or yields nothing usable, _enrich_with_full_text returns the content unchanged (reference_fetcher.py:198) and it is saved with its original content_type (e.g. abstract_only). On the next run the cache hit returns that record and the chain never re-runs unless the caller passes force_refresh=True.

So a one-off Unpaywall/PMC outage (or a rate-limit blip) bakes in "no full text" indefinitely. There's no negative-result TTL or "full-text attempted but unresolved" marker to distinguish "we have the abstract and that's all that exists" from "we failed to fetch full text this time." Worth at least recording the attempt (e.g. a full_text_attempted flag / timestamp) so a later run can retry, or documenting that recovering requires force_refresh. Fix this →

2. Streaming download leaks the connection on the size-cap path.
ContentAcquirer.fetch_bytes (acquire.py:80-99) opens requests.get(..., stream=True) but never closes the response. On the happy path iter_content runs to completion (connection released), but the early return None, content_type when the size cap is exceeded (acquire.py:93-97) abandons a partially-read streamed response without draining or closing it — that holds the socket open until GC. Wrap in with requests.get(...) as response: (or response.close() before the early return). Fix this →

3. url: PDF path bypasses the size cap that the rest of the framework enforces.
URLSource.fetch (sources/url.py:78-96) does a non-streamed requests.get(...) and feeds response.content straight into PDFExtractor, ignoring max_supplementary_file_size. The whole point of ContentAcquirer is to stream with a cap; a url: reference pointing at a multi-GB PDF will be fully buffered into memory. Consider routing this through ContentAcquirer.fetch_bytes so the cap (and rate-limit/User-Agent handling) is applied uniformly. It also re-implements the %PDF- magic-byte sniff that the chain path lacks (see #4).

🟡 Robustness

4. No content sniffing in the chain's extract path — a format_hint="pdf" pointing at an HTML landing page feeds junk to pypdf.
resolve_format (acquire.py:30-56) trusts content-type → URL suffix → provider hint, but never inspects the bytes. Unpaywall/OpenAlex frequently return a url_for_pdf that 200s with text/html (paywall interstitial, Cloudflare, or a redirect to a landing page). _materialize will then either mis-route to the HTML extractor or hand HTML to pypdf. Failures are caught (reference_fetcher.py:229-233), so it's not fatal — but a 2-line magic-byte check (data[:5] == b"%PDF-" → force pdf; <html/<?xml → html/xml), like the one already in url.py:84, would make detection far more reliable and avoid silently storing boilerplate HTML as "full text."

5. resolve_format prefers server content-type over the provider hint.
Closely related to #4: when a provider knows a URL is a PDF but the server mislabels it, the content-type wins. Given how often publisher servers send wrong/generic types (application/octet-stream, text/html), consider letting a strong format_hint win when the body sniffs as PDF, or at least sniffing as a tie-breaker.

6. Custom-provider header interpolation silently drops missing env vars.
json_api.py:113-118 substitutes ${VAR} with os.environ.get(name, "") — a missing/misspelled API-key var produces an empty Authorization header rather than an error or warning, yielding confusing 401s. A logger.warning when a referenced var is unset would save debugging time.

🟢 Minor / polish

Dead PMC code in sources/pmid.py:257-390 (_fetch_pmc_fulltext, _get_pmcid, _fetch_pmc_xml, _fetch_pmc_html) is now duplicated into PMCFullTextProvider — already flagged in the PR description as a follow-up. 👍 Worth a tracking issue so it isn't forgotten.
_materialize(self, location) (reference_fetcher.py:200) is the one new function missing a parameter type annotation (location: FullTextLocation); the rest of the diff is consistently typed.
PDF backend validation runs inside the chain, not at init. PDFExtractor(backend=self.config.pdf_backend) is constructed in _materialize (reference_fetcher.py:223) outside the surrounding try, so an invalid pdf_backend raises ValueError mid-fetch rather than failing fast at config load. Validating pdf_backend once in ReferenceFetcher.__init__ (or a config validator) would surface misconfiguration earlier.
Magic numbers MIN_FULL_TEXT_CHARS = 500 (reference_fetcher.py:39) and _MIN_PMC_FULLTEXT_CHARS = 1000 (pmc.py:28) are two different thresholds for "is this enough text" — fine as constants, but a brief comment on why PMC uses 2× the global floor would help future readers.

Test coverage

Strong — new dedicated suites for acquire/extractors/loader/providers/identifiers plus an end-to-end mocked chain, and the existing fetcher unit tests were correctly isolated by setting fetch_full_text=False in the fixture (tests/test_reference_fetcher.py:20-24). Two gaps map to the issues above and would be worth adding:

a test that an exhausted/failed chain doesn't poison the cache (issue Bump actions/checkout from 4 to 5 #1);
a test that the size cap is honored on the streaming path and that the response is closed on the cap-exceeded branch (issue Bump actions/setup-python from 5 to 6 #2).

Overall: solid, mergeable architecture. Issue #1 (cached negative results) is the one I'd most want addressed or explicitly documented before merge, since it changes long-run behavior in a non-obvious way; #2 and #3 are small, contained fixes. Nice work.

…t sniffing Correctness: - Don't permanently cache transient full-text failures (#1). Add a full_text_attempted marker to ReferenceContent, set only when the chain is consulted without a transient (exception) error. On a disk-cache hit the chain is re-run when the record still needs full text and was never cleanly attempted, so a one-off provider outage no longer bakes in "abstract_only". Persisted to/from the markdown cache. - Close the streamed response on the size-cap path in ContentAcquirer (#2) by wrapping requests.get(stream=True) in a `with` block. - Route URLSource through ContentAcquirer (#3) so the size cap, rate-limit delay, and User-Agent apply uniformly (no more whole-PDF-in-memory). Robustness: - Add sniff_format() magic-byte detection; _materialize and url.py trust the bytes over server content-type / provider hint, so a "pdf" hint at an HTML landing page routes to HTML instead of pypdf (#4/#5). - Warn on an unset ${VAR} in custom-provider header interpolation (#6). Polish: - Annotate _materialize(location: FullTextLocation); build the PDFExtractor once in __init__ (fail-fast backend validation + no per-call rebuild); document the MIN_FULL_TEXT_CHARS / _MIN_PMC_FULLTEXT_CHARS thresholds. Tests: - Cover transient-retry vs clean-exhaustion, attempted-flag round-trip, sniff-over-hint, and response-close-on-cap. - Fix a pre-existing sys.modules isolation leak in test_optional_linkml_dependency that this change exposed (snapshot/restore). Dead PMC code in sources/pmid.py tracked separately in #52. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

claude · 2026-06-17T03:23:41Z

Claude finished @cmungall's task in 1m 54s —— View job

PR Review: Modular PDF & full-text fetching framework (follow-up on `c3b7876`)

Review caching/retry fix (Bump actions/checkout from 4 to 5 #1)
Review size-cap connection-close fix (Bump actions/setup-python from 5 to 6 #2)
Review URL→ContentAcquirer routing (Bump astral-sh/setup-uv from 6 to 7 #3)
Review content sniffing (Bump actions/checkout from 4 to 6 #4/Add fuzzy matching suggestions for validation errors #5)
Review env-var warning (Verify capitalization detection in test_suggests_capitalization_fix #6)
Review polish items + tests

I read c3b7876 against origin/main. All six issues from the previous review are addressed, and each fix ships with a real test (no test-relaxation — consistent with CLAUDE.md's TDD guidance). Verdict below.

✅ Resolved

#1 — transient failures no longer cached as permanent abstract_only. The new full_text_attempted flag is the right model: _enrich_with_full_text distinguishes a clean "nothing available" (sets the flag) from a download/provider exception (had_error → flag stays unset), and _maybe_retry_full_text re-runs the chain on a disk-cache hit only when the record still needs full text and was never cleanly attempted (reference_fetcher.py:185-197). The retry re-saves only when the record actually changed. Round-trips through the markdown cache (models.py, _save_to_disk:495, _load_markdown_format:654). test_transient_full_text_failure_is_retried_on_next_run / test_clean_exhaustion_marks_attempted_and_is_not_retried / test_full_text_attempted_round_trips_through_disk_cache cover exactly the three paths. Nicely done — and the "once per process" semantics (memory cache short-circuits before the retry) are intentional and documented.

#2 — streamed response now closed on the size-cap path. requests.get(..., stream=True) is wrapped in with (acquire.py:115), so the early cap return (:132) releases the socket. test_fetch_bytes_closes_response_on_size_cap asserts it.

#3 — url: PDF path routed through ContentAcquirer. URLSource.fetch now calls ContentAcquirer().fetch_bytes (url.py:74), so the size cap, rate-limit delay, and User-Agent apply uniformly — no more whole-PDF-in-memory.

#4/#5 — content sniffing. sniff_format() (acquire.py:30-60) magic-byte-checks %PDF-/<html>/<?xml, and _materialize prefers it over the server content-type and provider hint (reference_fetcher.py:275), so a format_hint="pdf" at an HTML landing page routes to the HTML extractor instead of pypdf. test_materialize_trusts_sniffed_format_over_hint confirms it.

#6 — unset ${VAR} now warns instead of silently substituting empty (json_api.py:122-133), with test_custom_provider_warns_on_unset_env_var.

Polish: _materialize(location: FullTextLocation) is annotated; PDFExtractor is built once in __init__ (fail-fast pdf_backend validation + no per-call rebuild, reference_fetcher.py:93); the two MIN_*_CHARS thresholds are now documented.

🟢 Minor residual observations (non-blocking)

Boilerplate-HTML can still be stored as "full text." Sniffing correctly keeps an HTML interstitial out of pypdf, but if that landing page has >MIN_FULL_TEXT_CHARS (500) of boilerplate it's stored as full_text_html and marks full_text_attempted=True, ending the chain. Sniffing solved the crash/garbage half of Bump actions/checkout from 4 to 6 #4; the false-positive full text half remains. Not worth blocking on, but a follow-up could add a light "looks like a paywall/landing page" heuristic (e.g. presence of <form, "subscribe", very low text-to-markup ratio) before accepting HTML as full text.
URLSource still rebuilds PDFExtractor per call (url.py:83) and re-implements the magic-byte sniff inline rather than reusing sniff_format. Harmless (it's not on a hot loop), but reusing sniff_format(data) here would keep the detection logic in one place. Fix this →
Dead PMC code in sources/pmid.py is now tracked in Remove dead PMC full-text code from sources/pmid.py (superseded by PMCFullTextProvider) #52 per the commit message — 👍, good to see it filed rather than lingering.

Overall: the correctness concern (#1) that I most wanted addressed before merge is handled cleanly and with the right test matrix, and #2/#3/#4–6 are all solid, contained fixes. The residual items above are polish, not blockers. This is mergeable. Nice work.

(Note: I reviewed the diff and tests statically; I couldn't run the suite in this job due to command-approval restrictions, but the PR's CI run is the source of truth for green.)

cmungall and others added 23 commits June 12, 2026 16:54

feat: add ReferenceIdentifiers, FullTextLocation, and ReferenceConten…

2eb5ce6

…t provenance fields Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

feat: add full-text fetching config fields and files cache dir

74395f7

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

feat: add Extractor base class and ExtractorRegistry

2d6fe92

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

feat: add HTML and JATS/XML extractors

4dd326d

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

feat: add pluggable PDF extractor with pypdf default backend

c29c881

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

feat: add ContentAcquirer with size cap and format resolution

f9a0b70

feat: add FullTextProvider base class and registry

8313753

feat: add UnpaywallProvider

be510ee

feat: add OpenAlexProvider

82b537e

feat: add PMCFullTextProvider; route PMID full text through the chain

902c9d9

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

feat: add identifier crosswalk for full-text providers

b8fd5a6

feat: wire full-text provider chain into ReferenceFetcher with PDF do…

58f42bc

…wnload+extract

feat: persist full-text provenance fields in the disk cache

bed4c5a

test: isolate fetcher unit tests from full-text enrichment default

4ebc0e9

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

feat: extract text from PDFs served at a url: reference

95733dd

feat: add declarative custom full-text providers and YAML loader

4b4b950

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

feat: register custom full-text providers at init and add --no-full-t…

b2d02c0

…ext CLI flag Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

test: end-to-end full-text chain integration (metadata -> chain -> PDF)

d3054ab

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

docs: how-to for full-text and PDF fetching

882663f

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Merge branch 'main' into feature/modular-pdf-fulltext-fetching

5220d36

Merge remote-tracking branch 'origin/main' into feature/modular-pdf-f…

cdbf876

…ulltext-fetching # Conflicts: # src/linkml_reference_validator/cli/validate.py

This was referenced Jun 17, 2026

test: cover deeply nested evidence traversal in the reference plugin #51

Merged

Remove dead PMC full-text code from sources/pmid.py (superseded by PMCFullTextProvider) #52

Open

cmungall merged commit 728c7c2 into main Jun 17, 2026
5 checks passed

cmungall deleted the feature/modular-pdf-fulltext-fetching branch June 17, 2026 17:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modular PDF & full-text fetching framework#48

Modular PDF & full-text fetching framework#48
cmungall merged 25 commits into
mainfrom
feature/modular-pdf-fulltext-fetching

cmungall commented Jun 14, 2026

Uh oh!

claude Bot commented Jun 14, 2026 •

edited

Loading

Uh oh!

claude Bot commented Jun 17, 2026 •

edited

Loading

Uh oh!

claude Bot commented Jun 17, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cmungall commented Jun 14, 2026

Summary

Test Plan

Known follow-ups (deliberately deferred in the plan)

Note

Uh oh!

claude Bot commented Jun 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Modular PDF & full-text fetching framework

✅ Overall Assessment: Excellent Architecture and Implementation

Strengths

1. Outstanding Architecture Design

2. Robust Implementation

3. Excellent Security Practices

4. Comprehensive Test Coverage

Minor Issues & Recommendations

Performance Considerations

Code Quality Improvements

Security Hardening

Excellent Practices Observed

1. Test-Driven Development

2. Configuration Management

3. Documentation

Performance & Scalability

Notable Implementation Details

✅ Final Recommendation: APPROVE

Uh oh!

claude Bot commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Modular PDF & full-text fetching framework

🔴 Correctness / behavior

🟡 Robustness

🟢 Minor / polish

Test coverage

Uh oh!

claude Bot commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Modular PDF & full-text fetching framework (follow-up on c3b7876)

✅ Resolved

🟢 Minor residual observations (non-blocking)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude Bot commented Jun 14, 2026 •

edited

Loading

claude Bot commented Jun 17, 2026 •

edited

Loading

claude Bot commented Jun 17, 2026 •

edited

Loading

PR Review: Modular PDF & full-text fetching framework (follow-up on `c3b7876`)