Skip to content

Commit 51d2eea

Browse files
Copilotalexdryden
authored andcommitted
feat(arclight#29): Refactor run orchestration for threaded and single-scope runs
Restructured the pipeline for collections and creators to run independently with their own timestamps, proper cleanup, and parallel execution orchestrated via ThreadPoolExecutor Changes: - Split last_updated into last_updated_collections and last_updated_creators - Extract run_collections() and run_creators() from monolithic run() - Add run_all() that orchestrates both via ThreadPoolExecutor - Scope Solr cleanup to record type using is_creator flag - Update process_deleted_records() to accept scope parameter - Move update_repositories() into run_all() (only runs for full updates) - Fix timestamp comparisons to use min() where needed - Add directory creation safeguards (os.makedirs with exist_ok) - Change is_creator from string 'true' to boolean true - Add proper exception handling in parallel execution Benefits: - Collections and creators can be rebuilt independently (--collections-only, --agents-only) - Full runs execute both pipelines in parallel (faster) - Each record type maintains its own timestamp state - Solr cleanup is scoped to avoid deleting unrelated records
1 parent 5952798 commit 51d2eea

3 files changed

Lines changed: 191 additions & 90 deletions

File tree

README.md

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -61,9 +61,9 @@ This filtering ensures that only legitimate archival creators are discoverable i
6161

6262
### How Creator Records Work
6363

64-
1. **Extraction**: `get_all_agents()` fetches all agents from ArchivesSpace
65-
2. **Filtering**: `is_target_agent()` filters out system users, donors, and non-creator agents
66-
3. **Processing**: `task_agent()` generates an EAC-CPF XML document for each target agent with bioghist notes
64+
1. **Extraction**: Agent data is exported from ArchivesSpace for use in creator records
65+
2. **Filtering**: Creator vs. non-creator agents are determined via Solr queries built from `_get_target_agent_criteria()` and `_get_nontarget_agent_criteria()`, which exclude system users, donors, and other non-creator agents
66+
3. **Processing**: For each target creator agent, ArcFlow generates an EAC-CPF XML document that includes bioghist notes
6767
4. **Linking**: Handled via Solr using the persistent_id field (agents and collections linked through bioghist references)
6868
5. **Indexing**: Creator XML files are indexed to Solr using `traject_config_eac_cpf.rb`
6969

@@ -182,7 +182,8 @@ python -m arcflow.main --arclight-dir /path --aspace-dir /path --solr-url http:/
182182
Required arguments:
183183
- `--arclight-dir` - Path to ArcLight installation directory
184184
- `--aspace-dir` - Path to ArchivesSpace installation directory
185-
- `--solr-url` - URL of the Solr core (e.g., http://localhost:8983/solr/blacklight-core)
185+
- `--solr-url` - URL of the ArcLight Solr core (e.g., http://localhost:8983/solr/blacklight-core)
186+
- `--aspace-solr-url` URL of the ASpace Solr core
186187

187188
Optional arguments:
188189
- `--force-update` - Force update of all data (recreates everything from scratch)

0 commit comments

Comments
 (0)