Skip to content

Production-readiness refactor: drop MinIO, make Crate IDs more robust, add async validation and offline cache#186

Open
alexhambley wants to merge 16 commits into
developfrom
refactor-production-readiness
Open

Production-readiness refactor: drop MinIO, make Crate IDs more robust, add async validation and offline cache#186
alexhambley wants to merge 16 commits into
developfrom
refactor-production-readiness

Conversation

@alexhambley

Copy link
Copy Markdown
Contributor

TLDR

This is a big refactor that strengthens the validation service by adding:

  • An agnostic S3 storage layer;
  • Reworking the "Crate ID" model;
  • Adding consistent result/error types;
  • Adding a faster metadata path;
  • Fixing some major production concerns (retries, structured logging, health checks, offline validation);
  • Removing reliance on minio-py SDK

PR

I've broken down the changes in more detail below:

Storage and RO-Crates

  • Added a "vendor-neutral" StorageBackend abstraction-layer with a boto3 S3Backend (this should work with AWS / MinIO / RustFS / etc.) and an in-memory fake for tests. minio-py SDK is removed;
  • Added stricter/safer Crate-ID validation (charset/length, no / or ..). This fixes the bugs/fragility with .zip-in-ID, by treating IDs as opaque;
  • Added deterministic crate resolution by direct key checks and explicit zip vs directory, ambiguous (409) and not-found (404) raised;
  • Set a canonical layout with separate crate and result prefixes.

Validation

  • There is now a single ValidationOutcome type (valid/invalid/error) that replaces the old ValidationResult | str; one boundary to rocrate-validator;
  • Metadata-only validation is now synchronous (removes the blocking result.get());
  • Bumped rocrate-validator to 0.10.0 for (opt-in) offline validation. Cache warming at build, and --extra-profiles-path.

APIs

  • ID endpoints take server-side credentials, so there are no storage secrets in request bodies/logs!
  • Resolve-before-queue;
  • The celery worker runs a staged fetch → validate → persist → webhook with retries and persisted error outcomes;
  • Webhook delivery retries with backoff and surfaces any failures.

DevOps Stuff

  • JSON logging with request IDs and secret redaction;
  • /healthz and /readyz added to check the health and readiness for the storage + Redis broker ;
  • I've tidied up dependencies. pyproject.toml + pip-tools locks, ruff lint/format, CI updated;
  • There is now a single parameterised Dockerfile;
  • Flask entrypoint is renamed cratey.pywsgi.py for convention;
  • Image and labels renamed to ro-crate-validation-service;
  • The README.md has been rewritten and has some architecture/flow mermaid diagrams to help future external collaborators with understanding how the validator works;
  • Tests are now organised properly to mirror app/.

Breaking Changes

  • Environment variables are renamed;
  • ID endpoints no longer accept minio_config / root_path.
  • Status codes are clearer
  • Storage layout is explicit: crates under crates/, results under validation-results/.
  • Requires rocrate-validator >= 0.10.0

Closes #170

- Replace the import-time Config classes with a validated Settings dataclass built in create_app().
- Fails early with an aggregated error when storage is enabled but the required S3/Celery vars are missing.
- Gate storage routes on settings.storage_enabled; Settings injected in tests instead of monkeypatching.
- Add app/storage as a "vendor-neutral" abstraction (so, works with MinIO, RustFS, or any other S3 service)
- Closes #171, #172
Implement S3Backend using boto3. Supports MinIO SDK features such as pagination.
Includes tests
Refs #172
Add app/crates/ids.py: validate_crate_id / is_valid_crate_id and an InvalidCrateId exception. IDs are constrained to a safe charset (start alphanumeric, [A-Za-z0-9._-], max 128, no '/' or '..'). Fixes the ".zip in crate ID" fragility. IDs are no longer parsed.
Closes #174
Add app/crates/layout.py (separate crate/result key prefixes) and resolver.py. `resolve_crate()` locates a crate by direct stat checks on "canonical keys" instead of prefix-listing.
- Distinguishes zip vs directory,
- confirms ro-crate-metadata.json for directories, and
- reports "AmbiguousCrate" / "CrateNotFound"
- Fixes sibling false-match, .zip-in-id, and some missing-metadata gaps.
Closes #175
- Add `app/validation/results.py` (`ValidationOutcome` with valid/invalid/error status, detail/error, serialisation) and `runner.py` wrapping rocrate_validator so both entry points always return an outcome;
- Replaces the ValidationResult|str / isinstance pattern.
Closes #176
- replaces metadata path with `run_metadata_validation()`, which validates inline via the runner and returns a `ValidationOutcome` (200 valid/invalid, 422 unvalidatable).
- remove the metadata Celery task and `perform_metadata_validation`, which addresses the result.get() blocking and the return-in-finally bug.

closes #169
- Move POST/task/GET onto StorageBackend + resolver + runner with server-side credentials. POST resolves before queueing (400/404/409/503 via app error handlers);
- the Celery task (`run_validation_job`) runs fetch->validate->persist->webhook with retries and persists error outcomes; GET reads the results prefix.
- Add s3_crate_prefix/s3_results_prefix to Settings. Delete minio_utils and test_minio, remove minio from requirements and recompile the lock.
- Covers the store-backed validation flow

Closes #177, #178, #179
- Refactor `send_webhook_notification` to retry failures with exponential backoff, add a request timeout, and raise `WebhookDeliveryError` on terminal failure instead of accepting it. - Results are persisted so a delivery failure can be surfaced without losing the result.

Closes #180
Rewrite logging_service with a JsonFormatter, a RedactionFilter that masks configured S3 credentials, and a RequestIdFilter
closes #177
- Adds endpoints to check if the S3 is reachable and Celery / broker connected;
- Returns 503 with a per-check breakdown when a dependency is down.
- When storage is disabled both checks report "disabled".

Closes #181
- Consolidate packaging and tooling into pyproject.toml and standardise linting and formatting with ruff.
- Move metadata and deps into pyproject.toml; compile requirements.txt (runtime) and requirements-dev.txt (runtime + dev: pytest, pytest-mock, moto, ruff) from it via pip-tools; retire requirements.in.
- Move pytest config into [tool.pytest.ini_options] and remove pytest.ini.
- Add [tool.ruff] config; apply ruff check --fix and ruff format across code.
- CI: unit_tests installs requirements-dev.txt; add a lint workflow running ruff check and ruff format --check on PRs to develop.

Closes #183, #173
- Replace the MinIO dev container with a vendor-neutral "objectstore" service running rustfs/rustfs.
- Update .env/example.env to S3_* app config plus RUSTFS_* container credentials; the app/worker stay vendor-neutral (S3_* only).
- Start with `docker compose --profile objectstore up`.
- Move unit tests into app-mirroring sub-packages (storage/, crates/, etc.) and switch pytest to importlib import mode with pythonpath=".".
- Rewrite the integration test for the new stack
- Update the integration workflow deps

Closes #184
… and image

A few final things to bundle together:
- bump `roc-validator` 0.9.0 -> 0.10.0 to use offline/cache/extra-profiles
- rename the WSGI entrypoint and align image/label naming.
- Add `VALIDATION_OFFLINE`, `CACHE_PATH` and `EXTRA_PROFILES_PATH` settings
- Offline validation is opt-in (default off).
- `extra_profiles_path` ADDS to the bundled profiles (roc-validator 0.10.0 fix)
- Warm validator HTTP cache at image build (rocrate-validator cache warm)
- Replace the site-packages profile bake with extraction to /app/extra-profiles loaded via `EXTRA_PROFILES_PATH`
- This is passed as a build arg from the profiles workflow.
- Rename entrypoint `cratey.py` -> `wsgi.py` to fit convention
- Rename the published image and the five-safes label to ro-crate-validation-service; point image.source at the renamed repo.

Closes #162, #163, #164
@alexhambley alexhambley requested review from AnnieZQH and douglowe June 18, 2026 09:47
@elichad

elichad commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

@alexhambley to make this slightly more manageable to review, are you able to pop the last three commits (particularly the last one because it touches many files) off into a separate PR that depends on this one?

@AnnieZQH AnnieZQH left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was able to bring the dev stack up locally and successfully tested the API endpoints. I've also read through the changes and the tests. The coverage report looks good.

The only issue I've found: the renamed image is not on eScienceLab's container registry yet, so using docker-compose.yml would throw an error.

Comment thread docker-compose.yml
flask:
platform: linux/x86_64
image: "ghcr.io/esciencelab/cratey-validator:0.1"
image: "ghcr.io/esciencelab/ro-crate-validation-service:0.1"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This package does not exist yet.

@douglowe douglowe left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've a few build comments, and some small swagger api comments.

I do wish that this PR was a series of separate PR's for each feature. I've not reviewed the async validation and offline cache changes yet, because I've been checking other changes.

I like the health checks - I think these are a good addition. And the making crate ID's more robust seems sensible. As does moving to more generic S3 libraries.

I am concerned about the removal of the S3 secrets from the API call. These were added at the explicit request of our project partners, because they access S3 services using ephemeral secrets. We should have a design conversation about this, to decide if this is something that needs to be supported or not.

Regarding the logging checks - I agree we should try to avoid recording secrets in the logs. This will be more difficult if we are allowing ephemeral secrets to be used, as we won't be able to directly grep for known secrets. But we could add field name redaction, or entropy / length checks, so that most secrets which might be passed are screened out.


steps:
- name: Checkout code
uses: actions/checkout@v4

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Within the RSE team we're moving to specifying the exact hash of the action we want to use, rather than only the version number (c.f.: https://github.com/UoMResearchIT/Actions/blob/20365fa75ab643043ec188ef0b67fd5e537d326c/reuse/action.yml#L67). Should we do the same here?

run: |
python -m pip install --upgrade pip
pip install pytest requests minio docker
pip install pytest requests boto3

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we pin our pytest, requests, boto3 versions here? The most recent tests for this PR ran with pytest-9.1.0, requests-2.33.1, and boto3-1.43.29

Comment thread Dockerfile
--strip-components=3 \
"rocrate-validator-${FIVE_SAFES_PROFILE_VERSION}/rocrate_validator/profiles/five-safes-crate" && \
rm /tmp/profiles.tar.gz ; \
fi

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can, and should, organise this RUN command better:

RUN <<EOF_PROFILES
if [ -n "$PROFILES_ARCHIVE_URL" ]; then
   mkdir -p /app/extra-profiles
   wget -O /tmp/profiles.tar.gz "$PROFILES_ARCHIVE_URL"
   tar -xzf /tmp/profiles.tar.gz
            -C /app/extra-profiles \
            --strip-components=3 \
            "rocrate-validator-${FIVE_SAFES_PROFILE_VERSION}/rocrate_validator/profiles/five-safes-crate"
   rm /tmp/profiles.tar.gz
fi
EOF_PROFILES

Comment thread Dockerfile
--extra-profiles-path "$EXTRA_PROFILES_PATH" --cache-path "$CACHE_PATH" ; \
else \
rocrate-validator cache warm --all-profiles --cache-path "$CACHE_PATH" ; \
fi

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, let's do this better:

RUN <<<EOF_PROFILES_PATH
if [ -n "$EXTRA_PROFILES_PATH" ]; then
    rocrate-validator cache warm --all-profiles \
            --extra-profiles-path "$EXTRA_PROFILES_PATH" --cache-path "$CACHE_PATH"
else
    rocrate-validator cache warm --all-profiles --cache-path "$CACHE_PATH"
fi
EOF_PROFILES_PATH

if: always()
run: docker compose down
run: >
docker compose -f docker-compose-develop.yml -p cratey_integration

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the project name string here? (-p cratey_integration)

accesskey = String(required=True)
secret = String(required=True)
ssl = Boolean(required=True)
bucket = String(required=True)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class was originally added because of a request from the Nottingham team. They manage access to MinIO using temporary access and secret keys - so requested that we have such a schema in the API for access to the object store. If we move these back into the docker compose file then we are locking each validator instance to only accessing a single S3 store, and potentially blocking compatibility of this tool with the 5S-TES stack.

Can we discuss this change with our collaborators, to make sure we are moving in the correct direction here?

)

# Always-on blueprint:
post_routes_bp = APIBlueprint("post_routes", __name__)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's annoying me that the swagger docs have 2 post routes (a small problem I know, but I think it makes the swagger docs look less coherent).

We could combine these by adding the same tag to each of the APIBlueprint calls. e.g.:

post_routes_bp = APIBlueprint("post_routes", __name__, tag="Post_Routes")
...
minio_post_routes_bp = APIBlueprint("minio_post_routes", __name__, tag="Post_Routes")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Production-readiness refactor

4 participants