Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
1243e26
Add MDF v2 backend with streaming, curation, and DOI minting
blaiszik Feb 1, 2026
542907c
feat: complete v2 publication pipeline with DOI minting, search inges…
blaiszik Feb 7, 2026
906d5be
fix: correct DataCite test credentials (Globus.TEST, prefix 10.23677)
blaiszik Feb 7, 2026
5f5d03b
test: add publish pipeline tests and fix async jobs test
blaiszik Feb 7, 2026
a6ae919
fix: wire up real Globus Search ingest on staging
blaiszik Feb 7, 2026
78af2bb
feat: add dataset versioning with DOI inheritance and version-specifi…
blaiszik Feb 8, 2026
4aa0e3c
feat: add domains and external import provenance to v2 backend
blaiszik Feb 8, 2026
648dd9b
chore: configure prod deployment with test credentials and update git…
blaiszik Feb 18, 2026
aadfd58
docs: rewrite README with v2 architecture and full deploy path
blaiszik Feb 18, 2026
e65ac82
feat: fix Globus transfer lifecycle and HTTPS upload reliability
blaiszik Feb 27, 2026
ded9a8b
feat: add versioning fields, download_url, and v1 organizations migra…
blaiszik Feb 27, 2026
8515cbb
feat: add CLI UX demo script for direct publish and curation workflow
blaiszik Feb 27, 2026
65baccc
feat: enforce curator and submitter group membership
blaiszik Feb 27, 2026
adddfb2
feat: major/minor versioning — new data bumps major, metadata-only bu…
blaiszik Feb 27, 2026
5fa89fe
Harden v2 backend search and expand backend tooling
blaiszik Feb 28, 2026
9a65b03
feat: add archive_size field to DatasetMetadata
blaiszik Feb 28, 2026
12b8ab1
feat: faceted search, minor versioning fixes, and card improvements
blaiszik Mar 4, 2026
2770a58
feat: email notifications for curation lifecycle events
blaiszik Mar 4, 2026
49a5e4e
fix: harden email_utils HTML escaping and fix rejection link
blaiszik Mar 4, 2026
ddd9633
feat: cross-publish support — ExternalSource model, DataCite relation…
blaiszik Mar 9, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 37 additions & 0 deletions .github/workflows/test-v2.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
name: V2 Backend Tests

on:
pull_request:
branches:
- prod
- v2-backend-curation

jobs:
test:
runs-on: ubuntu-latest
timeout-minutes: 15

env:
STORE_BACKEND: sqlite
AUTH_MODE: dev
USE_MOCK_DATACITE: "true"
USE_MOCK_SEARCH: "true"

steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.12'

- name: Install dependencies
working-directory: aws
run: |
python -m pip install --upgrade pip
pip install -r requirements.txt 2>/dev/null || true
pip install fastapi uvicorn mangum pydantic httpx globus-sdk pytest boto3

- name: Run tests
working-directory: aws
run: |
PYTHONPATH=. pytest v2/test_v2_*.py -v
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,8 @@ travis.tar
.mdfsecrets
.mdfsecrets.*
aws/python
aws/.aws-sam/
.idea/
secrets.env
.DS_Store

24 changes: 24 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# Repository Guidelines

## Project Structure & Module Organization
Service logic resides in `aws/`, where each Lambda-backed endpoint has a Python module (`submit.py`, `status.py`, `submissions.py`) and shared helpers live in `utils.py`. Automated flow definitions and deployment helpers sit in `automate/` (notably `minimus_mdf_flow.py` and `deploy_mdf_flow.py`). Operational scripts for tokens, submissions, and schema sync live in `scripts/`, while infrastructure templates and IAM policies are grouped in `infra/`. Test suites and BDD feature files are colocated in `aws/tests/` with payload fixtures in `aws/tests/schemas/`, and high-level background material remains in `docs/`.

## Build, Test, and Development Commands
Target Python 3.7.10 to mirror production. Recommended setup:
- `python3 -m venv .venv && source .venv/bin/activate` — create an isolated environment.
- `pip install -r aws/requirements.txt` — Lambda runtime dependencies.
- `pip install -r aws/tests/requirements-test.txt` — pytest, pytest-bdd, and boto mocks.
- `PYTHONPATH=aws python -m pytest aws/tests --ignore=aws/tests/schemas` — run the suite locally.
For flow updates, install `automate/requirements.txt` before invoking `python automate/deploy_mdf_flow.py --env dev` to stage the definition.

## Coding Style & Naming Conventions
Follow PEP 8 with four-space indentation and concise module docstrings describing each handler’s contract. Keep functions and variables in `snake_case`, reserve `CamelCase` for classes, and uppercase constants. Mirror API routes with entry-point names, isolate AWS or Globus clients behind manager classes, and prefer explicit imports to ease packaging for Lambda layers.

## Testing Guidelines
Pair changes with unit tests in `test_*.py` and behavior coverage in the relevant `*.feature` files when workflows shift. Use `pytest -k <pattern>` for focused runs but complete a full `pytest` pass before requesting review. Maintain deterministic fixtures in `conftest.py`, mock network calls, and update JSON schemas when payload contracts change.

## Commit & Pull Request Guidelines
Branch from `dev`, keep commits single-purpose, and phrase messages in the imperative mood (e.g., `adjust submissions pagination`). Open PRs against `dev`, include a brief change summary, test artifacts, and references to linked issues or Globus tickets. Secure peer review before merging; once validated in the dev environment, raise a `dev`→`main` PR for production promotion.

## Security & Configuration Tips
Store Globus credentials in environment variables or `.mdfsecrets`; never commit secrets. Use helper utilities such as `scripts/get_mdf_token.py` and `scripts/status_versions.py` when troubleshooting to avoid manual token handling. Coordinate any IAM or policy modifications under `infra/` with the platform team and verify logs via CloudWatch using environment-scoped credentials.
486 changes: 381 additions & 105 deletions README.md

Large diffs are not rendered by default.

325 changes: 325 additions & 0 deletions automate/simplified_mdf_flow.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,325 @@
"""Simplified MDF Ingest Flow - v2.

This flow handles file transfer only. Curation is now handled by the MDF server API.

Flow steps:
1. Email admin about new submission
2. Transfer files from user endpoint to MDF repository
3. Notify user of transfer completion

Curation and DOI minting are handled separately via:
- GET /curation/pending - List pending submissions
- POST /curation/:id/approve - Approve + mint DOI
- POST /curation/:id/reject - Reject with reason
"""

import action_providers
from globus_automate_flow import GlobusAutomateFlowDef


def email_submission_to_admin(sender_email, admin_email):
"""Notify admin of new submission."""
return {
"EmailSubmission": {
"Type": "Action",
"ActionUrl": "https://actions.globus.org/notification/notify",
"ExceptionOnActionFailure": False, # Continue even if email fails
"Parameters": {
"body_mimetype": "text/html",
"sender": sender_email,
"destination": admin_email,
"subject": "New MDF Dataset Submission",
"body_template": """
<html><h1>New Dataset Submitted</h1>
<p>A new dataset has been submitted to the Materials Data Facility.</p>
<table>
<tr><td>Title</td><td>$title</td></tr>
<tr><td>Source ID</td><td>$source_id</td></tr>
<tr><td>Submitter</td><td>$submitting_user_email</td></tr>
<tr><td>Organization</td><td>$organization</td></tr>
</table>
<p>Review pending submissions at: <a href="$curation_url">$curation_url</a></p>
</html>
""",
"body_variables": {
"title.$": "$.dataset_mdata.dc.titles[0].title",
"source_id.$": "$.dataset_mdata.mdf.source_id",
"submitting_user_email.$": "$.submitting_user_email",
"organization.$": "$.dataset_mdata.mdf.organization",
"curation_url.$": "$.curation_url",
},
"notification_method": "any",
"notification_priority": "high",
"send_credentials": [
{
"credential_method": "email",
"credential_type": "ses",
"credential_value.$": "$._private_email_credentials",
}
],
"__Private_Parameters": ["send_credentials"],
},
"ResultPath": "$.EmailSubmissionResult",
"Next": "CheckMetadataOnly",
},
}


def check_metadata_only():
"""Check if this is a metadata-only update (no file transfer needed)."""
return {
"CheckMetadataOnly": {
"Comment": "Skip file transfer if this is a metadata-only update",
"Type": "Choice",
"Choices": [
{
"Variable": "$.update_metadata_only",
"BooleanEquals": True,
"Next": "TransferComplete",
}
],
"Default": "CreateDatasetDir",
}
}


def file_transfer_steps():
"""Transfer files from user endpoint to MDF repository."""
return {
"CreateDatasetDir": {
"Comment": "Create the dataset directory",
"Type": "Action",
"ActionUrl": "https://transfer.actions.globus.org/mkdir",
"ExceptionOnActionFailure": False,
"Parameters": {
"endpoint_id.$": "$.user_transfer_inputs.destination_endpoint_id",
"path.$": "$.user_transfer_inputs.dataset_path",
},
"ResultPath": "$.CreateDatasetDirResult",
"Next": "CreateVersionDir",
},
"CreateVersionDir": {
"Comment": "Create the version subdirectory",
"Type": "Action",
"ActionUrl": "https://transfer.actions.globus.org/mkdir",
"ExceptionOnActionFailure": True,
"Parameters": {
"endpoint_id.$": "$.user_transfer_inputs.destination_endpoint_id",
"path.$": "$.user_transfer_inputs.transfer_items[0].destination_path",
},
"ResultPath": "$.CreateVersionDirResult",
"Catch": [
{
"ErrorEquals": ["ActionFailedException", "States.Runtime", "EndpointError"],
"ResultPath": "$.CreateVersionDirResult",
"Next": "TransferFailed",
}
],
"Next": "AddUserPermissions",
},
"AddUserPermissions": {
"Comment": "Temporarily add write permissions for the submitting user",
"Type": "Action",
"ActionUrl": "https://transfer.actions.globus.org/manage_permission",
"ExceptionOnActionFailure": False,
"Parameters": {
"operation": "CREATE",
"endpoint_id.$": "$.user_transfer_inputs.destination_endpoint_id",
"path.$": "$.user_transfer_inputs.transfer_items[0].destination_path",
"principal_type": "identity",
"principal.$": "$.user_transfer_inputs.submitting-user-id",
"permissions": "rw",
},
"ResultPath": "$.UserPermissionResult",
"Catch": [
{
"ErrorEquals": ["ActionFailedException", "States.Runtime", "EndpointError"],
"ResultPath": "$.UserPermissionResult",
"Next": "TransferFailed",
}
],
"Next": "ExecuteTransfer",
},
"ExecuteTransfer": {
"Comment": "Transfer data from user endpoint to MDF repository",
"Type": "Action",
"ActionUrl": "https://transfer.actions.globus.org/transfer",
"WaitTime": 86400, # 24 hours max
"RunAs": "SubmittingUserV2",
"Parameters": {
"source_endpoint.$": "$.user_transfer_inputs.source_endpoint_id",
"destination_endpoint.$": "$.user_transfer_inputs.destination_endpoint_id",
"label.$": "$.user_transfer_inputs.label",
"DATA.$": "$.user_transfer_inputs.transfer_items",
},
"ResultPath": "$.TransferResult",
"Next": "RemoveUserPermissions",
},
"RemoveUserPermissions": {
"Comment": "Remove temporary write permissions",
"Type": "Action",
"ActionUrl": "https://transfer.actions.globus.org/manage_permission",
"ExceptionOnActionFailure": False,
"Parameters": {
"operation": "DELETE",
"endpoint_id.$": "$.user_transfer_inputs.destination_endpoint_id",
"rule_id.$": "$.UserPermissionResult.details.access_id",
},
"ResultPath": "$.RemovePermissionResult",
"Next": "CheckTransferStatus",
},
"CheckTransferStatus": {
"Type": "Choice",
"Choices": [
{
"Variable": "$.TransferResult.status",
"StringEquals": "SUCCEEDED",
"Next": "TransferComplete",
}
],
"Default": "TransferFailed",
},
}


def completion_states(sender_email):
"""Handle transfer completion or failure."""
return {
"TransferComplete": {
"Type": "ExpressionEval",
"Parameters": {
"status": "transfer_complete",
"message.=": "'File transfer complete for ' + `$.dataset_mdata.mdf.source_id` + '. Submission is now pending curation.'",
},
"ResultPath": "$.FinalState",
"Next": "NotifyUserSuccess",
},
"TransferFailed": {
"Type": "ExpressionEval",
"Parameters": {
"status": "transfer_failed",
"message.=": "'File transfer failed for ' + `$.dataset_mdata.mdf.source_id` + '. Please check the flow logs.'",
},
"ResultPath": "$.FinalState",
"Next": "NotifyUserFailure",
},
"NotifyUserSuccess": {
"Type": "Action",
"ActionUrl": "https://actions.globus.org/notification/notify",
"ExceptionOnActionFailure": False,
"Parameters": {
"body_mimetype": "text/html",
"sender": sender_email,
"destination.$": "$.submitting_user_email",
"subject": "MDF Submission - Transfer Complete",
"body_template": """
<html>
<h1>Transfer Complete</h1>
<p>Your dataset <strong>$source_id</strong> has been transferred to the MDF repository.</p>
<p>Your submission is now pending curation. You will receive another email when it has been reviewed.</p>
<p>Thank you for contributing to the Materials Data Facility!</p>
</html>
""",
"body_variables": {
"source_id.$": "$.dataset_mdata.mdf.source_id",
},
"notification_method": "any",
"send_credentials": [
{
"credential_method": "email",
"credential_type": "ses",
"credential_value.$": "$._private_email_credentials",
}
],
"__Private_Parameters": ["send_credentials"],
},
"ResultPath": "$.NotifySuccessResult",
"WaitTime": 300,
"Next": "EndFlow",
},
"NotifyUserFailure": {
"Type": "Action",
"ActionUrl": "https://actions.globus.org/notification/notify",
"ExceptionOnActionFailure": False,
"Parameters": {
"body_mimetype": "text/html",
"sender": sender_email,
"destination.$": "$.submitting_user_email",
"subject": "MDF Submission - Transfer Failed",
"body_template": """
<html>
<h1>Transfer Failed</h1>
<p>Your dataset <strong>$source_id</strong> failed to transfer.</p>
<p>Please check your Globus endpoint permissions and try again.</p>
<p>View the <a href="https://app.globus.org/runs/$run_id/logs">flow logs</a> for details.</p>
</html>
""",
"body_variables": {
"source_id.$": "$.dataset_mdata.mdf.source_id",
"run_id.$": "$._context.run_id",
},
"notification_method": "any",
"send_credentials": [
{
"credential_method": "email",
"credential_type": "ses",
"credential_value.$": "$._private_email_credentials",
}
],
"__Private_Parameters": ["send_credentials"],
},
"ResultPath": "$.NotifyFailureResult",
"WaitTime": 300,
"Next": "EndFlow",
},
"EndFlow": {
"Type": "Pass",
"End": True,
},
}


def flow_def(
sender_email,
admin_email,
flow_permissions,
administered_by,
description="Simplified MDF Ingest Flow - handles file transfer only. Curation via API.",
):
"""Build the simplified flow definition."""
return GlobusAutomateFlowDef(
title="MDF Ingest Flow v2 (Simplified)",
subtitle="Transfer files to MDF repository",
description=description,
visible_to=flow_permissions,
runnable_by=flow_permissions,
administered_by=administered_by,
input_schema={},
flow_definition={
"StartAt": "EmailSubmission",
"States": {
**email_submission_to_admin(sender_email, admin_email),
**check_metadata_only(),
**file_transfer_steps(),
**completion_states(sender_email),
},
},
)


# What was removed from the original flow:
#
# 1. CurateSubmission - Now handled via POST /curation/:id/approve or /reject
# 2. SendCurationEmail - Admin can use the curation dashboard instead
# 3. ChooseAcceptance - Curation decisions are made via API
# 4. FailCuration - Rejection is handled via API
# 5. NeedDOI / MintDOI - DOI minting happens on approval via API
# 6. AddDoiToSearchRecord - DOI is stored in submission record
# 7. SearchIngest - Can be triggered separately after approval
#
# Benefits:
# - Simpler flow with fewer states
# - Curators can use a web dashboard instead of email links
# - DOI minting happens synchronously on approval
# - Better visibility into curation status
# - Easier to add curation workflow features (comments, history, etc.)
Loading
Loading