Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
b5ce42f
docs: design spec for modular PDF & full-text fetching framework
cmungall Jun 12, 2026
2eabf06
docs: implementation plan for modular PDF & full-text fetching
cmungall Jun 13, 2026
2eb5ce6
feat: add ReferenceIdentifiers, FullTextLocation, and ReferenceConten…
cmungall Jun 13, 2026
74395f7
feat: add full-text fetching config fields and files cache dir
cmungall Jun 13, 2026
2d6fe92
feat: add Extractor base class and ExtractorRegistry
cmungall Jun 13, 2026
4dd326d
feat: add HTML and JATS/XML extractors
cmungall Jun 13, 2026
c29c881
feat: add pluggable PDF extractor with pypdf default backend
cmungall Jun 13, 2026
f9a0b70
feat: add ContentAcquirer with size cap and format resolution
cmungall Jun 13, 2026
8313753
feat: add FullTextProvider base class and registry
cmungall Jun 13, 2026
be510ee
feat: add UnpaywallProvider
cmungall Jun 13, 2026
82b537e
feat: add OpenAlexProvider
cmungall Jun 13, 2026
902c9d9
feat: add PMCFullTextProvider; route PMID full text through the chain
cmungall Jun 13, 2026
b8fd5a6
feat: add identifier crosswalk for full-text providers
cmungall Jun 13, 2026
58f42bc
feat: wire full-text provider chain into ReferenceFetcher with PDF do…
cmungall Jun 13, 2026
bed4c5a
feat: persist full-text provenance fields in the disk cache
cmungall Jun 13, 2026
4ebc0e9
test: isolate fetcher unit tests from full-text enrichment default
cmungall Jun 13, 2026
95733dd
feat: extract text from PDFs served at a url: reference
cmungall Jun 13, 2026
4b4b950
feat: add declarative custom full-text providers and YAML loader
cmungall Jun 13, 2026
b2d02c0
feat: register custom full-text providers at init and add --no-full-t…
cmungall Jun 13, 2026
d3054ab
test: end-to-end full-text chain integration (metadata -> chain -> PDF)
cmungall Jun 13, 2026
882663f
docs: how-to for full-text and PDF fetching
cmungall Jun 13, 2026
6ad01ac
style: remove unused imports in full-text test modules
cmungall Jun 13, 2026
5220d36
Merge branch 'main' into feature/modular-pdf-fulltext-fetching
cmungall Jun 14, 2026
cdbf876
Merge remote-tracking branch 'origin/main' into feature/modular-pdf-f…
cmungall Jun 17, 2026
c3b7876
fix(etl): address PR #48 review — full-text caching, size cap, conten…
cmungall Jun 17, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
139 changes: 139 additions & 0 deletions docs/how-to/fetch-full-text-and-pdfs.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,139 @@
# Fetching Full Text and PDFs

This guide explains how the validator obtains the **full text** of a reference
(not just its abstract or metadata) by trying a chain of full-text providers,
optionally downloading and extracting text from PDFs.

## Overview

When a metadata source (Crossref, PubMed, DataCite, etc.) returns only an
abstract or title, the validator can fall through to a **full-text provider
chain**. Each provider attempts to *locate* an open-access copy of the
reference. The first provider that yields usable full text wins; the located
resource is downloaded and, if it is a PDF, the text is extracted and cached
alongside the reference.

This behaviour is **on by default**.

## The provider chain

The validator tries providers in order until one returns usable full text.
The default order is:

```
pmc → unpaywall → openalex
```

- **`pmc`** — PubMed Central open-access subset (XML/HTML full text for many
biomedical articles).
- **`unpaywall`** — Unpaywall open-access lookup by DOI (often a publisher or
repository PDF).
- **`openalex`** — OpenAlex open-access location (PDF/HTML).

Providers are skipped silently if they have nothing for a given reference, so
the chain degrades gracefully: a miss in `pmc` simply moves on to `unpaywall`,
and so on. If no provider yields full text, the reference keeps whatever
metadata-only content it already had.

## Configuration

Full-text fetching is controlled by the following configuration keys (set in
your config YAML or on the `ReferenceValidationConfig` object):

| Key | Default | Description |
|-----|---------|-------------|
| `fetch_full_text` | `true` | Attempt to obtain full text via the provider chain when a metadata source does not already return full text. |
| `full_text_providers` | `[pmc, unpaywall, openalex]` | Ordered list of provider names to try until one yields usable full text. |
| `pdf_backend` | `pypdf` | Name of the PDF text-extraction backend. |
| `download_pdfs` | `true` | If true, persist downloaded PDFs to the files cache directory. |
| `full_text_providers_file` | `null` | Optional path to a YAML file defining custom full-text providers. |

Two existing keys are also reused by the full-text machinery:

| Key | Description |
|-----|-------------|
| `email` | Used for "polite pool" access to providers such as Unpaywall and OpenAlex (and for Crossref/Entrez). Set this for more reliable access. |
| `max_supplementary_file_size` | Upper bound (in bytes) on individual file downloads; a downloaded full-text PDF larger than this limit is not persisted. |

Example config YAML:

```yaml
email: you@example.org
fetch_full_text: true
full_text_providers:
- pmc
- unpaywall
- openalex
pdf_backend: pypdf
download_pdfs: true
```

## CLI flag

Every `validate` subcommand accepts a `--full-text/--no-full-text` flag that
toggles `fetch_full_text` for that run (default: on):

```bash
# Default: full-text chain is tried
linkml-reference-validator validate data data.yaml \
--schema schema.yaml --target-class Statement

# Disable full-text fetching (metadata only)
linkml-reference-validator validate data data.yaml \
--schema schema.yaml --target-class Statement --no-full-text
```

## Custom providers

You can declare your own JSON-API-backed provider in a YAML file and point
`full_text_providers_file` at it. Each entry under `full_text_providers` is
keyed by the provider name and must supply a `url_template`. The template may
reference `{doi}`, `{pmid}`, or `{pmcid}`. `location_field` is a JSONPath into
the response that holds the URL of the full-text resource; `format_hint` tells
the validator how to treat it; and `headers` are sent with the request, with
`${ENV_VAR}` placeholders interpolated from the environment.

```yaml
full_text_providers:
myrepo:
url_template: https://api.example.org/fulltext/{doi}
location_field: $.links.pdf
format_hint: pdf
headers:
Authorization: Bearer ${MYREPO_TOKEN}
```

To actually use a custom provider, add its name to the `full_text_providers`
chain (it is not enabled merely by being defined), for example:

```yaml
full_text_providers_file: providers.yaml
full_text_providers:
- myrepo
- pmc
- unpaywall
- openalex
```

## Provenance fields

When a reference is enriched with full text, the cached reference records where
the text came from. These provenance fields are written alongside the content:

| Field | Description |
|-------|-------------|
| `full_text_provider` | Name of the provider that supplied the full text (e.g. `pmc`, `unpaywall`, or a custom name). |
| `oa_status` | Open-access status reported by the provider (e.g. `gold`, `green`, `bronze`). |
| `license` | License of the open-access copy, when known. |
| `local_pdf_path` | Path (relative to the cache directory) to the downloaded PDF, when one was persisted. |

When full text from a PDF is obtained, the reference's `content_type` becomes
`full_text_pdf` and the extracted text replaces the abstract-only content. The
downloaded PDF lives under the cache directory and can be located at
`cache_dir / local_pdf_path`.

## See Also

- [Validating DOIs](validate-dois.md) — DOI metadata sources and supplementary files
- [Validating Entrez Accessions](validate-entrez.md) — PubMed/PMC references
- [CLI Reference](../reference/cli.md) — complete command documentation
Loading