Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 12 additions & 5 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,10 +9,8 @@ Use this file primarily when operating as a coding agent. Its intent is to captu
- When instructions here conflict with new information, trust the current codebase and update AGENTS.md alongside your change. If critical context is still missing, pause and ask the maintainer rather than guessing.

## Project Snapshot
- CLI-first tool for normalizing taxonomy: ingest (Parquet/CSV) → parse/group (`TaxonomicEntry`/`EntryGroupRef`) → plan + run GNVerifier queries → classify via strategy profiles → write
resolved & unsolved outputs → optional common-name enrichment.
- Source layout: CLI entry (`src/taxonopy/cli.py`), parsing/grouping/cache (`input_parser`, `entry_grouper`, `cache_manager`), query stack (`query/planner|executor|gnverifier_client`),
resolution logic (`resolution/attempt_manager` + profiles), outputs (`output_manager`), tracing (`trace/entry.py`).
- CLI-first tool for normalizing taxonomy: ingest (Parquet/CSV) → parse/group (`TaxonomicEntry`/`EntryGroupRef`) → plan + run GNVerifier queries → classify via strategy profiles → write resolved & unsolved outputs → optional common-name enrichment.
- Source layout: CLI entry (`src/taxonopy/cli.py`), parsing/grouping/cache (`input_parser`, `entry_grouper`, `cache_manager`), query stack (`query/planner|executor|gnverifier_client`), resolution logic (`resolution/attempt_manager` + profiles), outputs (`output_manager`), manifest tracking (`manifest.py`), tracing (`trace/entry.py`).
- Dependencies (see `pyproject.toml`): Python ≥ 3.10, Polars, Pandas/PyArrow, Pydantic v2, tqdm, requests; dev extras provide Ruff, pytest scaffolding, datamodel-code-generator, pre-commit.

## Environment Setup
Expand Down Expand Up @@ -73,13 +71,17 @@ taxonopy common-names \
- `--clear-cache`
- `--refresh-cache` (per run) to ignore stale grouping/parsing caches.
- Don’t delete cache files manually unless instructed; prefer the flags above.
- `--full-rerun` clears the input-scoped cache namespace and deletes only the files listed in `taxonopy_resolve_manifest.json` (written before any output on every run). Non-TaxonoPy files in the output directory are never touched. If no manifest is found (pre-v0.3.0 output), a warning is logged and no files are removed.

## Validation & QA
- Run `ruff check .` after modifying Python files (requires the `dev` extra).
- Run `pytest` even though the suite is sparse today; it protects future additions and should pass cleanly.
- Validate functional changes by running `taxonopy resolve` against `examples/input` (or issue-specific datasets) and reviewing outputs/logs, plus `taxonopy trace entry ...` when touching parsing/grouping logic.

## Coding Conventions
- Don't hard-wrap comments. Only use line breaks for new paragraphs. Let the editor soft-wrap content.
- Don't hard-wrap string literals. Keep each log or user-facing message in a single source line and rely on soft wrapping when reading it.
- Don't hard-wrap markdown prose in documentation. Let the renderer wrap lines as needed.
- Prefer frozen dataclasses (`types/data_classes.py`) for shared structures; mutate via new objects rather than in-place edits.
- Rely on strong typing + Pydantic models for external data (`types/gnverifier.py`); regenerate via the helper script instead of editing generated files.
- Log through the standard logging config (`logging_config.setup_logging`) and keep tqdm progress bars for long-running loops.
Expand All @@ -100,7 +102,12 @@ taxonopy common-names \
- Follow best version control practices including, but not limited to, the following:
- At the start of a session, ensure that work is done on a relevant branch (not `main`), and pull the latest changes from `main` before starting.
- Make commit messages imperative, one line, and descriptive of the change's "what" and "why" (not "how"). Any needed description beyond this can go in the extended body.
- For every commit you produce, append "[AI-assisted session]" as a final line in the extended commit message body.
- For every commit you produce, add a `Co-Authored-By` trailer in the extended commit message body identifying the model and provider, e.g.:
```
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: GPT-4o <noreply@openai.com>
Co-Authored-By: Gemini 2.0 Flash <noreply@google.com>
```
- Do not use Git or the GitHub CLI for any destructive actions like `git reset --hard`, `git rebase`, `git push --force`, `git branch -D`, `gh repo delete`, `gh issue delete`, and so on, nor commands like `rm -rf` that delete files or directories. If you consider a destructive command to be necessary, stop and discuss the situation with a maintainer.
- When modifying CLI behavior, resolution strategies, or caching semantics, update this AGENTS file so future agents follow the latest contract.
- Run `ruff check .`, `pytest`, and the sample `taxonopy resolve` workflow before handing off changes or opening discussions with maintainers.
Expand Down
8 changes: 3 additions & 5 deletions docs/user-guide/io/cache.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
# Cache

TaxonoPy caches intermediate results (like parsed inputs and grouped entries) to
speed up repeated runs on the same dataset.
TaxonoPy caches intermediate results (like parsed inputs and grouped entries) to speed up repeated runs on the same dataset.

## Location

Expand Down Expand Up @@ -40,7 +39,6 @@ This keeps caches isolated across datasets and releases.
- `--cache-stats` — show cache statistics and exit.
- `--clear-cache` — remove cached objects.
- `--refresh-cache` (resolve only) — ignore cached parse/group results.
- `--full-rerun` (resolve only) — clear cache for the input and overwrite outputs.
- `--full-rerun` (resolve only) — clear the input-scoped cache and remove TaxonoPy-specific output files before rerunning. See [Reruns](reruns.md) for full details.

If you change input files or want to force a clean run, use `--refresh-cache` or
`--full-rerun`.
If you change input files or want to force a clean run, use `--refresh-cache` or `--full-rerun`.
5 changes: 2 additions & 3 deletions docs/user-guide/io/index.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,8 @@
# IO

TaxonoPy accepts CSV or Parquet inputs with the same schema. Use the pages below
for the exact input columns, the structure of resolved/unsolved outputs, and how
the cache supports provenance and transparency throughout the resolution process.
TaxonoPy accepts CSV or Parquet inputs with the same schema. Use the pages below for the exact input columns, the structure of resolved/unsolved outputs, and how the cache supports provenance and transparency throughout the resolution process.

- [Input](input.md)
- [Output](output.md)
- [Cache](cache.md)
- [Reruns](reruns.md)
15 changes: 8 additions & 7 deletions docs/user-guide/io/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,20 +5,21 @@ When you run `taxonopy resolve`, TaxonoPy writes two outputs for each input file
- **Resolved**: `<input_name>.resolved.<csv|parquet>`
- **Unsolved**: `<input_name>.unsolved.<csv|parquet>`

The output directory mirrors the input directory structure. Output format is
controlled by the `--output-format` flag (`csv` or `parquet`).
The output directory mirrors the input directory structure. Output format is controlled by the `--output-format` flag (`csv` or `parquet`).

## What’s Inside
TaxonoPy also writes a manifest file to the output directory before creating any other files. This manifest lists every file the run intends to produce and is used by `--full-rerun` to clean up precisely. Each command writes its own manifest (`taxonopy_resolve_manifest.json` and `taxonopy_common_names_manifest.json` respectively) so they coexist safely if both commands share an output directory. See [Reruns](reruns.md) for details.

Each output row corresponds to one input record. Resolved entries contain the
standardized taxonomy where available, while unsolved entries preserve the
original input ranks. Both outputs include resolution metadata such as status
and strategy information.
## What's Inside

Each output row corresponds to one input record. Resolved entries contain the standardized taxonomy where available, while unsolved entries preserve the original input ranks. Both outputs include resolution metadata such as status and strategy information.

Running through the sample resolution results in the following core files:

- `taxonopy resolve`:
- `examples/resolved/sample.resolved.parquet`
- `examples/resolved/sample.unsolved.parquet`
- `examples/resolved/resolution_stats.json`
- `examples/resolved/taxonopy_resolve_manifest.json`
- `taxonopy common-names`:
- `examples/resolved/common/sample.resolved.parquet`
- `examples/resolved/common/taxonopy_common_names_manifest.json`
77 changes: 77 additions & 0 deletions docs/user-guide/io/reruns.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
# Reruns

## The Guard

TaxonoPy checks for existing output before processing. If a prior run is detected for the current input, it exits with a warning rather than silently overwriting:

```
Existing cache (...) and/or output (...) detected for this input.
Rerun with --full-rerun to replace them.
```

Detection uses two signals:

- the presence of a `taxonopy_resolve_manifest.json` in the output directory (written by any run using TaxonoPy v0.3.0 or later), or
- `.resolved.*` files in the output directory root (legacy fallback for output produced by earlier versions).

## `--full-rerun`

`--full-rerun` is the explicit escape hatch through the guard. It clears the input-scoped cache namespace and removes all TaxonoPy-specific files from the output directory before proceeding.

```console
taxonopy resolve \
--input examples/input \
--output-dir examples/resolved \
--full-rerun
```

### What it touches

- **Cache**: the namespace scoped to the current command, TaxonoPy version, and input fingerprint. Other namespaces (different inputs, different versions) are not affected.
- **Output files**: only the files listed in `taxonopy_resolve_manifest.json`. Any other files in the output directory are left untouched.

### What it does not touch

- Files not listed in the manifest — including any non-TaxonoPy files you have placed in the output directory.
- Cache namespaces from other runs.

### No manifest found

If `--full-rerun` is used but no manifest is present (e.g. output from a pre-v0.3.0 run, or a manually populated directory), TaxonoPy logs a warning and proceeds without removing any files:

```
--full-rerun: no manifest found in <output-dir>; no output files were removed.
```

The run then writes fresh output and a new manifest.

## The Manifest

Every TaxonoPy run writes a manifest file to the output directory **before** creating any output. This means interrupted runs leave a complete record of what should be cleaned up — `--full-rerun` deletes exactly those files and nothing else.

Manifest files are command-scoped so they coexist safely if multiple commands share an output directory:

| Command | Manifest file |
|---|---|
| `resolve` | `taxonopy_resolve_manifest.json` |
| `common-names` | `taxonopy_common_names_manifest.json` |

### Schema

```json
{
"taxonopy_version": "0.3.0",
"command": "resolve",
"created_at": "2025-07-19T10:38:04.123456",
"input": "examples/input",
"cache_namespace": "~/.cache/taxonopy/resolve_v0.3.0_a3f9b2c1d4e5f678",
"files": [
"sample.resolved.parquet",
"sample.unsolved.parquet",
"resolution_stats.json",
"taxonopy_resolve_manifest.json"
]
}
```

All paths in `files` are relative to the output directory. `cache_namespace` is `null` for `common-names`, which does not use an input-scoped cache.
4 changes: 3 additions & 1 deletion docs/user-guide/quick-reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,7 @@ taxonopy common-names \
--output-dir examples/resolved/common
```

This command uses GBIF Backbone data only and applies deterministic fallback: species to kingdom, with English names preferred at each rank.
This command uses GBIF Backbone data only and applies deterministic fallback: species to kingdom, with English names preferred at each rank. It also writes a `taxonopy_common_names_manifest.json` to the output directory.

_**Sample common-name output (`examples/resolved/common/sample.resolved.parquet`)**; the last two rows (both Laelia rosea) fall back to family-level common names—none available at species or genus rank._
<div class="table-cell-scroll" markdown>
Expand Down Expand Up @@ -120,6 +120,8 @@ The `resolution_stats.json` file summarizes counts of how many entries from the

TaxonoPy also writes cache data to disk (default: `~/.cache/taxonopy`) so it can trace provenance and avoid reprocessing. Use `--show-cache-path`, `--cache-stats`, or `--clear-cache` if you want to inspect or manage it, or see the [Cache](io/cache.md) guide for details.

If TaxonoPy detects existing output for your input it will exit with a warning. Use `--full-rerun` to clear the cache and remove previous outputs before rerunning. TaxonoPy tracks exactly which files it produced via a manifest, so only TaxonoPy-specific files are removed — nothing else in your output directory is touched. See [Reruns](io/reruns.md) for details.

## Trace an Entry

You can trace how a single UUID was resolved. For example, let's trace one of the _Laelia rosea_ entries:
Expand Down
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ nav:
- Input: user-guide/io/input.md
- Output: user-guide/io/output.md
- Cache: user-guide/io/cache.md
- Reruns: user-guide/io/reruns.md
- Development:
- Contributing:
- development/contributing/index.md
Expand Down
34 changes: 28 additions & 6 deletions src/taxonopy/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,12 @@
from pathlib import Path
from typing import List, Optional
import json
import shutil
from taxonopy.manifest import (
MANIFEST_FILENAMES,
delete_from_manifest,
get_intended_files_for_resolve,
write_manifest,
)

from taxonopy import __version__
from taxonopy.config import config
Expand Down Expand Up @@ -182,7 +187,10 @@ def run_resolve(args: argparse.Namespace) -> int:

namespace_stats = get_cache_stats()
existing_namespace = namespace_stats["entry_count"] > 0 and not cache_cleared_via_flag
existing_output = any(output_dir.glob("*.resolved.*"))
existing_output = (
(output_dir / MANIFEST_FILENAMES["resolve"]).exists()
or any(output_dir.glob("*.resolved.*"))
)
if (existing_namespace or existing_output) and not args.full_rerun:
logging.warning(
"Existing cache (%s) and/or output (%s) detected for this input. Rerun with --full-rerun to replace them.",
Expand All @@ -191,11 +199,13 @@ def run_resolve(args: argparse.Namespace) -> int:
)
return 0
if args.full_rerun:
logging.info("--full-rerun set: clearing cache and output directory before proceeding.")
logging.info("--full-rerun set: clearing cache and TaxonoPy-specific output files before proceeding.")
clear_cache()
if output_dir.exists():
shutil.rmtree(output_dir)
output_dir.mkdir(parents=True, exist_ok=True)
if not delete_from_manifest(str(output_dir), "resolve"):
logging.warning(
"--full-rerun: no manifest found in %s; no output files were removed.",
output_dir,
)

try:
start_time = time.time()
Expand All @@ -204,6 +214,12 @@ def run_resolve(args: argparse.Namespace) -> int:

if args.force_input:
logging.info("Skipping resolution due to --force-input flag")
write_manifest(
str(output_dir), "resolve", __version__, args.input, str(cache_path),
get_intended_files_for_resolve(
args.input, input_files, str(output_dir), args.output_format, force_input=True
),
)
generated_files = generate_forced_output(args.input, args.output_dir, args.output_format)
elapsed_time = time.time() - start_time
logging.info(f"Forced output completed in {elapsed_time:.2f} seconds. Files: {generated_files}")
Expand Down Expand Up @@ -259,6 +275,12 @@ def run_resolve(args: argparse.Namespace) -> int:


# 4. Generate output
write_manifest(
str(output_dir), "resolve", __version__, args.input, str(cache_path),
get_intended_files_for_resolve(
args.input, input_files, str(output_dir), args.output_format, force_input=False
),
)
logging.info("Generating output files...")
# Pass only the manager and entry group map
resolved_files, unsolved_files = generate_resolution_output(
Expand Down
Loading