Production-readiness refactor: drop MinIO, make Crate IDs more robust, add async validation and offline cache by alexhambley · Pull Request #186 · eScienceLab/RO-Crate-Validation-Service

alexhambley · 2026-06-18T09:47:27Z

TLDR

This is a big refactor that strengthens the validation service by adding:

An agnostic S3 storage layer;
Reworking the "Crate ID" model;
Adding consistent result/error types;
Adding a faster metadata path;
Fixing some major production concerns (retries, structured logging, health checks, offline validation);
Removing reliance on minio-py SDK

PR

I've broken down the changes in more detail below:

Storage and RO-Crates

Added a "vendor-neutral" StorageBackend abstraction-layer with a boto3 S3Backend (this should work with AWS / MinIO / RustFS / etc.) and an in-memory fake for tests. minio-py SDK is removed;
Added stricter/safer Crate-ID validation (charset/length, no / or ..). This fixes the bugs/fragility with .zip-in-ID, by treating IDs as opaque;
Added deterministic crate resolution by direct key checks and explicit zip vs directory, ambiguous (409) and not-found (404) raised;
Set a canonical layout with separate crate and result prefixes.

Validation

There is now a single ValidationOutcome type (valid/invalid/error) that replaces the old ValidationResult | str; one boundary to rocrate-validator;
Metadata-only validation is now synchronous (removes the blocking result.get());
Bumped rocrate-validator to 0.10.0 for (opt-in) offline validation. Cache warming at build, and --extra-profiles-path.

APIs

ID endpoints take server-side credentials, so there are no storage secrets in request bodies/logs!
Resolve-before-queue;
The celery worker runs a staged fetch → validate → persist → webhook with retries and persisted error outcomes;
Webhook delivery retries with backoff and surfaces any failures.

DevOps Stuff

JSON logging with request IDs and secret redaction;
/healthz and /readyz added to check the health and readiness for the storage + Redis broker ;
I've tidied up dependencies. pyproject.toml + pip-tools locks, ruff lint/format, CI updated;
There is now a single parameterised Dockerfile;
Flask entrypoint is renamed cratey.py → wsgi.py for convention;
Image and labels renamed to ro-crate-validation-service;
The README.md has been rewritten and has some architecture/flow mermaid diagrams to help future external collaborators with understanding how the validator works;
Tests are now organised properly to mirror app/.

Breaking Changes

Environment variables are renamed;
ID endpoints no longer accept minio_config / root_path.
Status codes are clearer
Storage layout is explicit: crates under crates/, results under validation-results/.
Requires rocrate-validator >= 0.10.0

Closes #170

- Replace the import-time Config classes with a validated Settings dataclass built in create_app(). - Fails early with an aggregated error when storage is enabled but the required S3/Celery vars are missing. - Gate storage routes on settings.storage_enabled; Settings injected in tests instead of monkeypatching. - Add app/storage as a "vendor-neutral" abstraction (so, works with MinIO, RustFS, or any other S3 service) - Closes #171, #172

Implement S3Backend using boto3. Supports MinIO SDK features such as pagination. Includes tests Refs #172

Add app/crates/ids.py: validate_crate_id / is_valid_crate_id and an InvalidCrateId exception. IDs are constrained to a safe charset (start alphanumeric, [A-Za-z0-9._-], max 128, no '/' or '..'). Fixes the ".zip in crate ID" fragility. IDs are no longer parsed. Closes #174

Add app/crates/layout.py (separate crate/result key prefixes) and resolver.py. `resolve_crate()` locates a crate by direct stat checks on "canonical keys" instead of prefix-listing. - Distinguishes zip vs directory, - confirms ro-crate-metadata.json for directories, and - reports "AmbiguousCrate" / "CrateNotFound" - Fixes sibling false-match, .zip-in-id, and some missing-metadata gaps. Closes #175

- Add `app/validation/results.py` (`ValidationOutcome` with valid/invalid/error status, detail/error, serialisation) and `runner.py` wrapping rocrate_validator so both entry points always return an outcome; - Replaces the ValidationResult|str / isinstance pattern. Closes #176

- replaces metadata path with `run_metadata_validation()`, which validates inline via the runner and returns a `ValidationOutcome` (200 valid/invalid, 422 unvalidatable). - remove the metadata Celery task and `perform_metadata_validation`, which addresses the result.get() blocking and the return-in-finally bug. closes #169

- Move POST/task/GET onto StorageBackend + resolver + runner with server-side credentials. POST resolves before queueing (400/404/409/503 via app error handlers); - the Celery task (`run_validation_job`) runs fetch->validate->persist->webhook with retries and persists error outcomes; GET reads the results prefix. - Add s3_crate_prefix/s3_results_prefix to Settings. Delete minio_utils and test_minio, remove minio from requirements and recompile the lock. - Covers the store-backed validation flow Closes #177, #178, #179

- Refactor `send_webhook_notification` to retry failures with exponential backoff, add a request timeout, and raise `WebhookDeliveryError` on terminal failure instead of accepting it. - Results are persisted so a delivery failure can be surfaced without losing the result. Closes #180

Rewrite logging_service with a JsonFormatter, a RedactionFilter that masks configured S3 credentials, and a RequestIdFilter closes #177

- Adds endpoints to check if the S3 is reachable and Celery / broker connected; - Returns 503 with a per-check breakdown when a dependency is down. - When storage is disabled both checks report "disabled". Closes #181

- Consolidate packaging and tooling into pyproject.toml and standardise linting and formatting with ruff. - Move metadata and deps into pyproject.toml; compile requirements.txt (runtime) and requirements-dev.txt (runtime + dev: pytest, pytest-mock, moto, ruff) from it via pip-tools; retire requirements.in. - Move pytest config into [tool.pytest.ini_options] and remove pytest.ini. - Add [tool.ruff] config; apply ruff check --fix and ruff format across code. - CI: unit_tests installs requirements-dev.txt; add a lint workflow running ruff check and ruff format --check on PRs to develop. Closes #183, #173

- Replace the MinIO dev container with a vendor-neutral "objectstore" service running rustfs/rustfs. - Update .env/example.env to S3_* app config plus RUSTFS_* container credentials; the app/worker stay vendor-neutral (S3_* only). - Start with `docker compose --profile objectstore up`.

- Move unit tests into app-mirroring sub-packages (storage/, crates/, etc.) and switch pytest to importlib import mode with pythonpath=".". - Rewrite the integration test for the new stack - Update the integration workflow deps Closes #184

Closes #185

… and image A few final things to bundle together: - bump `roc-validator` 0.9.0 -> 0.10.0 to use offline/cache/extra-profiles - rename the WSGI entrypoint and align image/label naming. - Add `VALIDATION_OFFLINE`, `CACHE_PATH` and `EXTRA_PROFILES_PATH` settings - Offline validation is opt-in (default off). - `extra_profiles_path` ADDS to the bundled profiles (roc-validator 0.10.0 fix) - Warm validator HTTP cache at image build (rocrate-validator cache warm) - Replace the site-packages profile bake with extraction to /app/extra-profiles loaded via `EXTRA_PROFILES_PATH` - This is passed as a build arg from the profiles workflow. - Rename entrypoint `cratey.py` -> `wsgi.py` to fit convention - Rename the published image and the five-safes label to ro-crate-validation-service; point image.source at the renamed repo. Closes #162, #163, #164

elichad · 2026-06-18T09:54:20Z

@alexhambley to make this slightly more manageable to review, are you able to pop the last three commits (particularly the last one because it touches many files) off into a separate PR that depends on this one?

AnnieZQH

I was able to bring the dev stack up locally and successfully tested the API endpoints. I've also read through the changes and the tests. The coverage report looks good.

The only issue I've found: the renamed image is not on eScienceLab's container registry yet, so using docker-compose.yml would throw an error.

AnnieZQH · 2026-06-30T15:12:47Z

  flask:
    platform: linux/x86_64
-    image: "ghcr.io/esciencelab/cratey-validator:0.1"
+    image: "ghcr.io/esciencelab/ro-crate-validation-service:0.1"


This package does not exist yet.

douglowe

I've a few build comments, and some small swagger api comments.

I do wish that this PR was a series of separate PR's for each feature. I've not reviewed the async validation and offline cache changes yet, because I've been checking other changes.

I like the health checks - I think these are a good addition. And the making crate ID's more robust seems sensible. As does moving to more generic S3 libraries.

I am concerned about the removal of the S3 secrets from the API call. These were added at the explicit request of our project partners, because they access S3 services using ephemeral secrets. We should have a design conversation about this, to decide if this is something that needs to be supported or not.

Regarding the logging checks - I agree we should try to avoid recording secrets in the logs. This will be more difficult if we are allowing ephemeral secrets to be used, as we won't be able to directly grep for known secrets. But we could add field name redaction, or entropy / length checks, so that most secrets which might be passed are screened out.

douglowe · 2026-07-02T14:54:44Z

+
+    steps:
+      - name: Checkout code
+        uses: actions/checkout@v4


Within the RSE team we're moving to specifying the exact hash of the action we want to use, rather than only the version number (c.f.: https://github.com/UoMResearchIT/Actions/blob/20365fa75ab643043ec188ef0b67fd5e537d326c/reuse/action.yml#L67). Should we do the same here?

douglowe · 2026-07-02T14:56:26Z

        run: |
          python -m pip install --upgrade pip
-          pip install pytest requests minio docker
+          pip install pytest requests boto3


Should we pin our pytest, requests, boto3 versions here? The most recent tests for this PR ran with pytest-9.1.0, requests-2.33.1, and boto3-1.43.29

douglowe · 2026-07-02T16:11:38Z

+            --strip-components=3 \
+            "rocrate-validator-${FIVE_SAFES_PROFILE_VERSION}/rocrate_validator/profiles/five-safes-crate" && \
+        rm /tmp/profiles.tar.gz ; \
+    fi


We can, and should, organise this RUN command better:

RUN <<EOF_PROFILES if [ -n "$PROFILES_ARCHIVE_URL" ]; then mkdir -p /app/extra-profiles wget -O /tmp/profiles.tar.gz "$PROFILES_ARCHIVE_URL" tar -xzf /tmp/profiles.tar.gz -C /app/extra-profiles \ --strip-components=3 \ "rocrate-validator-${FIVE_SAFES_PROFILE_VERSION}/rocrate_validator/profiles/five-safes-crate" rm /tmp/profiles.tar.gz fi EOF_PROFILES

douglowe · 2026-07-02T16:13:26Z

+            --extra-profiles-path "$EXTRA_PROFILES_PATH" --cache-path "$CACHE_PATH" ; \
+    else \
+        rocrate-validator cache warm --all-profiles --cache-path "$CACHE_PATH" ; \
+    fi


Same as above, let's do this better:

RUN <<<EOF_PROFILES_PATH if [ -n "$EXTRA_PROFILES_PATH" ]; then rocrate-validator cache warm --all-profiles \ --extra-profiles-path "$EXTRA_PROFILES_PATH" --cache-path "$CACHE_PATH" else rocrate-validator cache warm --all-profiles --cache-path "$CACHE_PATH" fi EOF_PROFILES_PATH

douglowe · 2026-07-02T16:52:21Z

        if: always()
-        run: docker compose down
+        run: >
+          docker compose -f docker-compose-develop.yml -p cratey_integration


Do we need the project name string here? (-p cratey_integration)

douglowe · 2026-07-02T17:05:55Z

-    accesskey = String(required=True)
-    secret = String(required=True)
-    ssl = Boolean(required=True)
-    bucket = String(required=True)


This class was originally added because of a request from the Nottingham team. They manage access to MinIO using temporary access and secret keys - so requested that we have such a schema in the API for access to the object store. If we move these back into the docker compose file then we are locking each validator instance to only accessing a single S3 store, and potentially blocking compatibility of this tool with the 5S-TES stack.

Can we discuss this change with our collaborators, to make sure we are moving in the correct direction here?

douglowe · 2026-07-02T19:27:38Z

 )

 # Always-on blueprint:
 post_routes_bp = APIBlueprint("post_routes", __name__)


It's annoying me that the swagger docs have 2 post routes (a small problem I know, but I think it makes the swagger docs look less coherent).

We could combine these by adding the same tag to each of the APIBlueprint calls. e.g.:

post_routes_bp = APIBlueprint("post_routes", __name__, tag="Post_Routes") ... minio_post_routes_bp = APIBlueprint("minio_post_routes", __name__, tag="Post_Routes")

alexhambley added 16 commits June 12, 2026 12:12

feat(storage): add boto3 S3Backend

a3aa2e7

Implement S3Backend using boto3. Supports MinIO SDK features such as pagination. Includes tests Refs #172

feat(logging): structured JSON logs, request IDs, secret redaction

ee62364

Rewrite logging_service with a JsonFormatter, a RedactionFilter that masks configured S3 credentials, and a RequestIdFilter closes #177

feat: add /healthz and /readyz endpoints

f900ae2

- Adds endpoints to check if the S3 is reachable and Celery / broker connected; - Returns 503 with a per-check breakdown when a dependency is down. - When storage is disabled both checks report "disabled". Closes #181

docs: rewrite README for the new architecture with diagrams

3289a9a

Closes #185

chore(docker): consolidate Dockerfiles and tidy up compose

0b340ce

alexhambley requested review from AnnieZQH and douglowe June 18, 2026 09:47

AnnieZQH reviewed Jun 30, 2026

View reviewed changes

douglowe requested changes Jul 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Production-readiness refactor: drop MinIO, make Crate IDs more robust, add async validation and offline cache#186

Production-readiness refactor: drop MinIO, make Crate IDs more robust, add async validation and offline cache#186
alexhambley wants to merge 16 commits into
developfrom
refactor-production-readiness

alexhambley commented Jun 18, 2026

Uh oh!

elichad commented Jun 18, 2026

Uh oh!

AnnieZQH left a comment

Uh oh!

AnnieZQH Jun 30, 2026

Uh oh!

douglowe left a comment

Uh oh!

douglowe Jul 2, 2026

Uh oh!

douglowe Jul 2, 2026

Uh oh!

douglowe Jul 2, 2026

Uh oh!

douglowe Jul 2, 2026

Uh oh!

douglowe Jul 2, 2026

Uh oh!

douglowe Jul 2, 2026

Uh oh!

douglowe Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

alexhambley commented Jun 18, 2026

TLDR

PR

Storage and RO-Crates

Validation

APIs

DevOps Stuff

Breaking Changes

Uh oh!

elichad commented Jun 18, 2026

Uh oh!

AnnieZQH left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

douglowe left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants