Production-readiness refactor: drop MinIO, make Crate IDs more robust, add async validation and offline cache#186
Production-readiness refactor: drop MinIO, make Crate IDs more robust, add async validation and offline cache#186alexhambley wants to merge 16 commits into
Conversation
- Replace the import-time Config classes with a validated Settings dataclass built in create_app(). - Fails early with an aggregated error when storage is enabled but the required S3/Celery vars are missing. - Gate storage routes on settings.storage_enabled; Settings injected in tests instead of monkeypatching. - Add app/storage as a "vendor-neutral" abstraction (so, works with MinIO, RustFS, or any other S3 service) - Closes #171, #172
Implement S3Backend using boto3. Supports MinIO SDK features such as pagination. Includes tests Refs #172
Add app/crates/ids.py: validate_crate_id / is_valid_crate_id and an InvalidCrateId exception. IDs are constrained to a safe charset (start alphanumeric, [A-Za-z0-9._-], max 128, no '/' or '..'). Fixes the ".zip in crate ID" fragility. IDs are no longer parsed. Closes #174
Add app/crates/layout.py (separate crate/result key prefixes) and resolver.py. `resolve_crate()` locates a crate by direct stat checks on "canonical keys" instead of prefix-listing. - Distinguishes zip vs directory, - confirms ro-crate-metadata.json for directories, and - reports "AmbiguousCrate" / "CrateNotFound" - Fixes sibling false-match, .zip-in-id, and some missing-metadata gaps. Closes #175
- Add `app/validation/results.py` (`ValidationOutcome` with valid/invalid/error status, detail/error, serialisation) and `runner.py` wrapping rocrate_validator so both entry points always return an outcome; - Replaces the ValidationResult|str / isinstance pattern. Closes #176
- replaces metadata path with `run_metadata_validation()`, which validates inline via the runner and returns a `ValidationOutcome` (200 valid/invalid, 422 unvalidatable). - remove the metadata Celery task and `perform_metadata_validation`, which addresses the result.get() blocking and the return-in-finally bug. closes #169
- Move POST/task/GET onto StorageBackend + resolver + runner with server-side credentials. POST resolves before queueing (400/404/409/503 via app error handlers); - the Celery task (`run_validation_job`) runs fetch->validate->persist->webhook with retries and persists error outcomes; GET reads the results prefix. - Add s3_crate_prefix/s3_results_prefix to Settings. Delete minio_utils and test_minio, remove minio from requirements and recompile the lock. - Covers the store-backed validation flow Closes #177, #178, #179
- Refactor `send_webhook_notification` to retry failures with exponential backoff, add a request timeout, and raise `WebhookDeliveryError` on terminal failure instead of accepting it. - Results are persisted so a delivery failure can be surfaced without losing the result. Closes #180
Rewrite logging_service with a JsonFormatter, a RedactionFilter that masks configured S3 credentials, and a RequestIdFilter closes #177
- Adds endpoints to check if the S3 is reachable and Celery / broker connected; - Returns 503 with a per-check breakdown when a dependency is down. - When storage is disabled both checks report "disabled". Closes #181
- Consolidate packaging and tooling into pyproject.toml and standardise linting and formatting with ruff. - Move metadata and deps into pyproject.toml; compile requirements.txt (runtime) and requirements-dev.txt (runtime + dev: pytest, pytest-mock, moto, ruff) from it via pip-tools; retire requirements.in. - Move pytest config into [tool.pytest.ini_options] and remove pytest.ini. - Add [tool.ruff] config; apply ruff check --fix and ruff format across code. - CI: unit_tests installs requirements-dev.txt; add a lint workflow running ruff check and ruff format --check on PRs to develop. Closes #183, #173
- Replace the MinIO dev container with a vendor-neutral "objectstore" service running rustfs/rustfs. - Update .env/example.env to S3_* app config plus RUSTFS_* container credentials; the app/worker stay vendor-neutral (S3_* only). - Start with `docker compose --profile objectstore up`.
- Move unit tests into app-mirroring sub-packages (storage/, crates/, etc.) and switch pytest to importlib import mode with pythonpath=".". - Rewrite the integration test for the new stack - Update the integration workflow deps Closes #184
… and image A few final things to bundle together: - bump `roc-validator` 0.9.0 -> 0.10.0 to use offline/cache/extra-profiles - rename the WSGI entrypoint and align image/label naming. - Add `VALIDATION_OFFLINE`, `CACHE_PATH` and `EXTRA_PROFILES_PATH` settings - Offline validation is opt-in (default off). - `extra_profiles_path` ADDS to the bundled profiles (roc-validator 0.10.0 fix) - Warm validator HTTP cache at image build (rocrate-validator cache warm) - Replace the site-packages profile bake with extraction to /app/extra-profiles loaded via `EXTRA_PROFILES_PATH` - This is passed as a build arg from the profiles workflow. - Rename entrypoint `cratey.py` -> `wsgi.py` to fit convention - Rename the published image and the five-safes label to ro-crate-validation-service; point image.source at the renamed repo. Closes #162, #163, #164
|
@alexhambley to make this slightly more manageable to review, are you able to pop the last three commits (particularly the last one because it touches many files) off into a separate PR that depends on this one? |
AnnieZQH
left a comment
There was a problem hiding this comment.
I was able to bring the dev stack up locally and successfully tested the API endpoints. I've also read through the changes and the tests. The coverage report looks good.
The only issue I've found: the renamed image is not on eScienceLab's container registry yet, so using docker-compose.yml would throw an error.
| flask: | ||
| platform: linux/x86_64 | ||
| image: "ghcr.io/esciencelab/cratey-validator:0.1" | ||
| image: "ghcr.io/esciencelab/ro-crate-validation-service:0.1" |
douglowe
left a comment
There was a problem hiding this comment.
I've a few build comments, and some small swagger api comments.
I do wish that this PR was a series of separate PR's for each feature. I've not reviewed the async validation and offline cache changes yet, because I've been checking other changes.
I like the health checks - I think these are a good addition. And the making crate ID's more robust seems sensible. As does moving to more generic S3 libraries.
I am concerned about the removal of the S3 secrets from the API call. These were added at the explicit request of our project partners, because they access S3 services using ephemeral secrets. We should have a design conversation about this, to decide if this is something that needs to be supported or not.
Regarding the logging checks - I agree we should try to avoid recording secrets in the logs. This will be more difficult if we are allowing ephemeral secrets to be used, as we won't be able to directly grep for known secrets. But we could add field name redaction, or entropy / length checks, so that most secrets which might be passed are screened out.
|
|
||
| steps: | ||
| - name: Checkout code | ||
| uses: actions/checkout@v4 |
There was a problem hiding this comment.
Within the RSE team we're moving to specifying the exact hash of the action we want to use, rather than only the version number (c.f.: https://github.com/UoMResearchIT/Actions/blob/20365fa75ab643043ec188ef0b67fd5e537d326c/reuse/action.yml#L67). Should we do the same here?
| run: | | ||
| python -m pip install --upgrade pip | ||
| pip install pytest requests minio docker | ||
| pip install pytest requests boto3 |
There was a problem hiding this comment.
Should we pin our pytest, requests, boto3 versions here? The most recent tests for this PR ran with pytest-9.1.0, requests-2.33.1, and boto3-1.43.29
| --strip-components=3 \ | ||
| "rocrate-validator-${FIVE_SAFES_PROFILE_VERSION}/rocrate_validator/profiles/five-safes-crate" && \ | ||
| rm /tmp/profiles.tar.gz ; \ | ||
| fi |
There was a problem hiding this comment.
We can, and should, organise this RUN command better:
RUN <<EOF_PROFILES
if [ -n "$PROFILES_ARCHIVE_URL" ]; then
mkdir -p /app/extra-profiles
wget -O /tmp/profiles.tar.gz "$PROFILES_ARCHIVE_URL"
tar -xzf /tmp/profiles.tar.gz
-C /app/extra-profiles \
--strip-components=3 \
"rocrate-validator-${FIVE_SAFES_PROFILE_VERSION}/rocrate_validator/profiles/five-safes-crate"
rm /tmp/profiles.tar.gz
fi
EOF_PROFILES
| --extra-profiles-path "$EXTRA_PROFILES_PATH" --cache-path "$CACHE_PATH" ; \ | ||
| else \ | ||
| rocrate-validator cache warm --all-profiles --cache-path "$CACHE_PATH" ; \ | ||
| fi |
There was a problem hiding this comment.
Same as above, let's do this better:
RUN <<<EOF_PROFILES_PATH
if [ -n "$EXTRA_PROFILES_PATH" ]; then
rocrate-validator cache warm --all-profiles \
--extra-profiles-path "$EXTRA_PROFILES_PATH" --cache-path "$CACHE_PATH"
else
rocrate-validator cache warm --all-profiles --cache-path "$CACHE_PATH"
fi
EOF_PROFILES_PATH
| if: always() | ||
| run: docker compose down | ||
| run: > | ||
| docker compose -f docker-compose-develop.yml -p cratey_integration |
There was a problem hiding this comment.
Do we need the project name string here? (-p cratey_integration)
| accesskey = String(required=True) | ||
| secret = String(required=True) | ||
| ssl = Boolean(required=True) | ||
| bucket = String(required=True) |
There was a problem hiding this comment.
This class was originally added because of a request from the Nottingham team. They manage access to MinIO using temporary access and secret keys - so requested that we have such a schema in the API for access to the object store. If we move these back into the docker compose file then we are locking each validator instance to only accessing a single S3 store, and potentially blocking compatibility of this tool with the 5S-TES stack.
Can we discuss this change with our collaborators, to make sure we are moving in the correct direction here?
| ) | ||
|
|
||
| # Always-on blueprint: | ||
| post_routes_bp = APIBlueprint("post_routes", __name__) |
There was a problem hiding this comment.
It's annoying me that the swagger docs have 2 post routes (a small problem I know, but I think it makes the swagger docs look less coherent).
We could combine these by adding the same tag to each of the APIBlueprint calls. e.g.:
post_routes_bp = APIBlueprint("post_routes", __name__, tag="Post_Routes")
...
minio_post_routes_bp = APIBlueprint("minio_post_routes", __name__, tag="Post_Routes")
TLDR
This is a big refactor that strengthens the validation service by adding:
minio-pySDKPR
I've broken down the changes in more detail below:
Storage and RO-Crates
StorageBackendabstraction-layer with a boto3S3Backend(this should work with AWS / MinIO / RustFS / etc.) and an in-memory fake for tests.minio-pySDK is removed;/or..). This fixes the bugs/fragility with.zip-in-ID, by treating IDs as opaque;Validation
ValidationOutcometype (valid/invalid/error) that replaces the oldValidationResult | str; one boundary torocrate-validator;result.get());rocrate-validatorto 0.10.0 for (opt-in) offline validation. Cache warming at build, and--extra-profiles-path.APIs
DevOps Stuff
/healthzand/readyzadded to check the health and readiness for the storage + Redis broker ;pyproject.toml+ pip-tools locks, ruff lint/format, CI updated;cratey.py→wsgi.pyfor convention;ro-crate-validation-service;app/.Breaking Changes
minio_config/root_path.crates/, results undervalidation-results/.rocrate-validator >= 0.10.0Closes #170