diff --git a/.github/workflows/test-v2.yml b/.github/workflows/test-v2.yml new file mode 100644 index 0000000..026ee95 --- /dev/null +++ b/.github/workflows/test-v2.yml @@ -0,0 +1,37 @@ +name: V2 Backend Tests + +on: + pull_request: + branches: + - prod + - v2-backend-curation + +jobs: + test: + runs-on: ubuntu-latest + timeout-minutes: 15 + + env: + STORE_BACKEND: sqlite + AUTH_MODE: dev + USE_MOCK_DATACITE: "true" + USE_MOCK_SEARCH: "true" + + steps: + - uses: actions/checkout@v4 + - name: Set up Python + uses: actions/setup-python@v4 + with: + python-version: '3.12' + + - name: Install dependencies + working-directory: aws + run: | + python -m pip install --upgrade pip + pip install -r requirements.txt 2>/dev/null || true + pip install fastapi uvicorn mangum pydantic httpx globus-sdk pytest boto3 + + - name: Run tests + working-directory: aws + run: | + PYTHONPATH=. pytest v2/test_v2_*.py -v diff --git a/.gitignore b/.gitignore index 7ccb005..46d65bd 100644 --- a/.gitignore +++ b/.gitignore @@ -26,6 +26,8 @@ travis.tar .mdfsecrets .mdfsecrets.* aws/python +aws/.aws-sam/ .idea/ secrets.env +.DS_Store diff --git a/AGENTS.md b/AGENTS.md new file mode 100644 index 0000000..0176d40 --- /dev/null +++ b/AGENTS.md @@ -0,0 +1,24 @@ +# Repository Guidelines + +## Project Structure & Module Organization +Service logic resides in `aws/`, where each Lambda-backed endpoint has a Python module (`submit.py`, `status.py`, `submissions.py`) and shared helpers live in `utils.py`. Automated flow definitions and deployment helpers sit in `automate/` (notably `minimus_mdf_flow.py` and `deploy_mdf_flow.py`). Operational scripts for tokens, submissions, and schema sync live in `scripts/`, while infrastructure templates and IAM policies are grouped in `infra/`. Test suites and BDD feature files are colocated in `aws/tests/` with payload fixtures in `aws/tests/schemas/`, and high-level background material remains in `docs/`. + +## Build, Test, and Development Commands +Target Python 3.7.10 to mirror production. Recommended setup: +- `python3 -m venv .venv && source .venv/bin/activate` — create an isolated environment. +- `pip install -r aws/requirements.txt` — Lambda runtime dependencies. +- `pip install -r aws/tests/requirements-test.txt` — pytest, pytest-bdd, and boto mocks. +- `PYTHONPATH=aws python -m pytest aws/tests --ignore=aws/tests/schemas` — run the suite locally. +For flow updates, install `automate/requirements.txt` before invoking `python automate/deploy_mdf_flow.py --env dev` to stage the definition. + +## Coding Style & Naming Conventions +Follow PEP 8 with four-space indentation and concise module docstrings describing each handler’s contract. Keep functions and variables in `snake_case`, reserve `CamelCase` for classes, and uppercase constants. Mirror API routes with entry-point names, isolate AWS or Globus clients behind manager classes, and prefer explicit imports to ease packaging for Lambda layers. + +## Testing Guidelines +Pair changes with unit tests in `test_*.py` and behavior coverage in the relevant `*.feature` files when workflows shift. Use `pytest -k ` for focused runs but complete a full `pytest` pass before requesting review. Maintain deterministic fixtures in `conftest.py`, mock network calls, and update JSON schemas when payload contracts change. + +## Commit & Pull Request Guidelines +Branch from `dev`, keep commits single-purpose, and phrase messages in the imperative mood (e.g., `adjust submissions pagination`). Open PRs against `dev`, include a brief change summary, test artifacts, and references to linked issues or Globus tickets. Secure peer review before merging; once validated in the dev environment, raise a `dev`→`main` PR for production promotion. + +## Security & Configuration Tips +Store Globus credentials in environment variables or `.mdfsecrets`; never commit secrets. Use helper utilities such as `scripts/get_mdf_token.py` and `scripts/status_versions.py` when troubleshooting to avoid manual token handling. Coordinate any IAM or policy modifications under `infra/` with the platform team and verify logs via CloudWatch using environment-scoped credentials. diff --git a/README.md b/README.md index eee1e10..5796e74 100644 --- a/README.md +++ b/README.md @@ -1,107 +1,383 @@ # MDF Connect -The Materials Data Facility Connect service is the ETL flow to deeply index datasets into MDF Search. It is not intended to be run by end-users. To submit data to the MDF, visit the [Materials Data Facility](https://materialsdatafacility.org). - -# Architecture -The MDF Connect service is a serverless REST service that is deployed on AWS. -It consists of an AWS API Gateway that uses a lambda function to authenticate -requests against GlobusAuth. If authorised, the endpoints trigger AWS lambda -functions. Each endpoint is implemented as a lambda function contained in a -python file in the [aws/](aws/) directory. The lambda functions are deployed -via GitHub actions as described in a later section. - -The API Endpoints are: -* [POST /submit](aws/submit.py): Submits a dataset to the MDF Connect service. This triggers a Globus Automate flow -* [GET /status](aws/status.py): Returns the status of a dataset submission -* [POST /submissions](aws/submissions.py): Forms a query and returns a list of submissions - -# Globus Automate Flow -The Globus Automate flow is a series of steps that are triggered by the POST -/submit endpoint. The flow is defined using a python dsl that can be found -in [automate/minimus_mdf_flow.py](automate/minimus_mdf_flow.py). At a high -level the flow: -1. Notifies the admin that a dataset has been submitted -2. Checks to see if the data files have been updated or if this is a metadata only submission -3. If there is a dataset, it starts a globus transfer -4. Once the transfer is complete it may trigger a curation step if the organization is configured to do so -5. A DOI is minted if the organization is configured to do so -6. The dataset is indexed in MDF Search -7. The user is notified of the completion of the submission - - -# Development Workflow -Changes should be made in a feature branch based off of the dev branch. Create -PR and get a friend to review your changes. Once the PR is approved, merge it -into the dev branch. The dev branch is automatically deployed to the dev -environment. Once the changes have been tested in the dev environment, create a -PR from dev to main. Once the PR is approved, merge it into main. The main -branch is automatically deployed to the prod environment. - -# Deployment -The MDF Connect service is deployed on AWS into development and production -environments. The automate flow is deployed into the Globus Automate service via -a second GitHub action. - -## Deploy the Automate Flow -Changes to the automate flow are deployed via a GitHub action, triggered by the -push of a new GitHub release. If the release is tagged as "pre-release" it will -be deployed to the dev environment, otherwise it will be deployed to the prod -environment. - -The flow IDs for dev and prod are stored in -[automate/mdf_dev_flow_info.json](automate/mdf_dev_flow_info.json) and -[automate/mdf_prod_flow_info.json](automate/mdf_prod_flow_info.json) -respectively. The flow ID is stored in the `flow_id` key. - -### Deploy a Dev Release of the Flow -1. Merge your changes into the `dev` branch -2. On the GitHub website, click on the _Release_ link on the repo home page. -3. Click on the _Draft a new release_ button -4. Fill in the tag version as `X.Y.Z-alpha.1` where X.Y.Z is the version number. You can use subsequent alpha tags if you need to make further changes. -5. Fill in the release title and description -6. Select `dev` as the Target branch -7. Check the _Set as a pre-release_ checkbox -8. Click the _Publish release_ button - -### Deploy a Prod Release of the Flow -1. Merge your changes into the `main` branch -2. On the GitHub website, click on the _Release_ link on the repo home page. -3. Click on the _Draft a new release_ button -4. Fill in the tag version as `X.Y.Z` where X.Y.Z is the version number. -5. Fill in the release title and description -6. Select `main` as the Target branch -7. Check the _Set as the latest release_ checkbox -8. Click the _Publish release_ button - -You can verify deployment of the flows in the -[Globus Automate Console](https://app.globus.org/flows/library). - - -## Deploy the MDF Connect Service -The MDF Connect service is deployed via a GitHub action. The action is triggered -by a push to the dev or main branch. The action will deploy the service to the -dev or prod environment respectively. - -## Updating Schemas -Schemas and the MDF organization database are managed in the automate branch -of the [Data Schemas Repo](https://github.com/materials-data-facility/data-schemas/tree/automate). - -The schema is deployed into the docker images used to serve up the lambda -functions. - -## Reviewing Logs -- [Dev Logs](https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logsV2:log-groups/log-group/$252Faws$252Flambda$252FMDF-Connect2-submit-dev/log-events/) -- [Prod Logs](https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logsV2:log-groups/log-group/$252Faws$252Flambda$252FMDF-Connect2-submit-prod/log-events/) - - -# Running Tests -To run the tests first make sure that you are running python 3.7.10. Then install the dependencies: - - $ cd aws/tests - $ pip3 install -r requirements-test.txt - -Now you can run the tests using the command: - - $ PYTHONPATH=.. python -m pytest --ignore schemas - -# Support + +The Materials Data Facility Connect service is the backend for submitting, curating, and publishing datasets to MDF Search. For the Python client, see [connect_client](https://github.com/materials-data-facility/connect_client). + +## v2 Backend + +The v2 backend lives in `aws/v2/` and is a complete rewrite: a single FastAPI application deployed to AWS Lambda via Mangum + SAM. It replaces the old per-endpoint Lambda functions and Terraform deployment. + +**Stack**: Python 3.12, FastAPI, Pydantic v2, DynamoDB, Globus HTTPS storage, DataCite DOI minting, Globus Search, AWS SAM. + +### What it does + +- **Dataset submission**: submit → pending_curation → approved (DOI minted) → published (indexed to Globus Search) +- **Versioning**: update existing datasets with automatic version incrementing, version history via `GET /versions/{source_id}` +- **Streaming**: create stream → upload files to Globus HTTPS → snapshot to dataset → close with DOI +- **Curation**: pending list, approve/reject, curator guards +- **Discovery**: search, dataset cards, citations (BibTeX/APA/RIS), file preview +- **Auth**: Globus token validation (prod) or `X-User-Id` headers (dev) + +### API endpoints (29 total) + +| Group | Endpoints | +|-------|-----------| +| **Submissions** | `POST /submit`, `GET /versions/{id}`, `GET /status/{id}`, `POST /status/update`, `GET /submissions` | +| **Streams** | `POST /stream/create`, `POST ../append`, `GET /stream/{id}`, `POST ../close`, `POST ../snapshot` | +| **Files** | `POST ../upload`, `POST ../upload-url`, `POST ../upload-confirm`, `POST ../download-url`, `GET ../files` | +| **Curation** | `GET /curation/pending`, `GET /curation/{id}`, `POST ../approve`, `POST ../reject` | +| **Search** | `GET /search` | +| **Cards** | `GET /card/{id}`, `GET /citation/{id}` | +| **Preview** | `GET /stream/../preview`, `GET /preview/{id}`, `GET ../files`, `GET ../files/{path}`, `GET ../sample` | +| **Health** | `GET /health` | + +### Deployment architecture + +``` +Client (mdf_agent CLI / SDK) AWS us-east-1 +───────────────────────────── ──────────────── +Bearer: auth.globus.org token ──────────> API Gateway (HttpApi) +X-Globus-Token: data token │ + ▼ + ┌─── ApiFunction (Lambda) ───┐ + │ FastAPI + Mangum │ + │ Auth, Submit, Stream, │ + │ Search, Curation, Cards │ + └──────┬─────────┬────────────┘ + │ │ + ┌──────┴───┐ ┌──┴──────────────┐ + │ DynamoDB │ │ SQS │ + │ submissions│ │ async-jobs │ + │ streams │ └──┬──────────────┘ + └───────────┘ │ + ▼ + ┌─── AsyncWorkerFunction (Lambda) ──┐ + │ DOI minting (DataCite) │ + │ Globus Search ingest │ + │ Dataset profiling │ + └────────────────────────────────────┘ +``` + +### Internal service architecture + +``` +┌─────────────────────────────────────────────────────────────────────────────┐ +│ FastAPI Application (v2/app/__init__.py) │ +│ │ +│ Middleware: CORS, request logging │ +│ │ +│ ┌─── Routers (v2/app/routers/) ─────────────────────────────────────────┐ │ +│ │ │ │ +│ │ submissions.py streams.py files.py │ │ +│ │ ├ POST /submit ├ POST /stream/create ├ POST ../upload │ │ +│ │ ├ GET /versions/{id} ├ POST ../append ├ POST ../upload-url │ │ +│ │ ├ GET /status/{id} ├ GET /stream/{id} ├ POST ../upload-confirm│ │ +│ │ ├ POST /status/update ├ POST ../close ├ POST ../download-url │ │ +│ │ └ GET /submissions └ POST ../snapshot └ GET ../files │ │ +│ │ │ │ +│ │ curation.py search.py cards.py preview.py │ │ +│ │ ├ GET /curation/pending├ GET /search├ GET /card/{id}├ GET stream.. │ │ +│ │ ├ GET /curation/{id} │ └ GET /cite/{id}├ GET dataset.. │ │ +│ │ ├ POST ../approve │ └ GET ../sample │ │ +│ │ └ POST ../reject │ │ │ +│ └───────────────────────────────────────────────────────────────────────┘ │ +│ │ │ │ │ +│ ▼ ▼ ▼ │ +│ ┌── Auth ──────┐ ┌─ Dependencies ─┐ ┌── Helpers ──────────────────────┐ │ +│ │ (auth.py) │ │ (deps.py) │ │ metadata.py DatasetMetadata │ │ +│ │ │ │ │ │ citation.py BibTeX/APA/RIS │ │ +│ │ dev mode: │ │ Singletons: │ │ search.py full-text search │ │ +│ │ X-User-Id │ │ submission │ │ dataset_card.py card builder │ │ +│ │ │ │ store │ │ preview.py file previews │ │ +│ │ production: │ │ stream store │ │ profiler.py dataset profiles │ │ +│ │ Globus │ │ storage │ │ curation.py DOI + approve │ │ +│ │ userinfo() │ │ backend │ │ datacite.py DOI minting │ │ +│ └──────────────┘ └──────┬─────────┘ └─────────────────────────────────┘ │ +│ │ │ +└──────────────────────────┼──────────────────────────────────────────────────┘ + │ + ┌────────────────┼────────────────┐ + ▼ ▼ ▼ +┌─── Store Layer ──┐ ┌─ Stream Store ─┐ ┌── Storage Layer ──────────────────┐ +│ (store.py) │ │(stream_store.py)│ │ (storage/) │ +│ │ │ │ │ │ +│ SubmissionStore │ │ StreamStore │ │ StorageBackend │ +│ (abstract) │ │ (abstract) │ │ (abstract) │ +│ │ │ │ │ │ │ │ │ +│ ├─ Dynamo │ │ ├─ Dynamo │ │ ├─ GlobusHTTPSStorage (prod) │ +│ │ SubmStore │ │ │ StrmStore│ │ │ PUT/GET to data.mdf.org │ +│ │ │ │ │ │ │ │ │ +│ └─ Sqlite │ │ └─ Sqlite │ │ ├─ S3Storage │ +│ SubmStore │ │ StrmStore│ │ │ AWS S3 bucket │ +│ │ │ │ │ │ │ +│ Operations: │ │ Operations: │ │ └─ LocalStorage (dev) │ +│ put, get, list │ │ create, get │ │ Local filesystem │ +│ upsert, update │ │ append, close │ │ │ +│ list_by_user │ │ update_meta │ │ Operations: │ +│ list_by_org │ │ list_all │ │ store_file, get_file │ +│ list_by_status │ │ │ │ get_upload_url (presigned) │ +│ update_profile │ │ │ │ get_download_url │ +│ scan_transfers │ │ │ │ list_files │ +└──────────────────┘ └────────────────┘ └──────────────────────────────────┘ + + ┌──────────────────────────────────────────────┐ + │ Async Job Dispatch (async_jobs.py) │ + │ │ + │ Job types: │ + │ ├ profile_submission (scan data files) │ + │ ├ mint_submission_doi (DataCite API) │ + │ ├ mint_stream_doi (DataCite API) │ + │ ├ publish_submission (search index + DOI) │ + │ ├ transfer_data (Globus Transfer) │ + │ └ cleanup_transfers (ACL cleanup) │ + │ │ + │ Dispatchers: │ + │ ├ InlineJobDispatcher (dev: sync) │ + │ ├ SQSJobDispatcher (prod: async) │ + │ └ SqliteJobDispatcher (test: queued) │ + └──────────────────────────────────────────────┘ +``` + +### Data flow: dataset submission to publication + +``` +Researcher CLI/SDK Backend External +────────── ─────── ─────── ──────── + │ │ │ │ + ├─ mdf publish ─────────────>│ │ │ + │ ├─ upload files (HTTPS PUT)│─────────────────────────>│ Globus + │ │ │ │ Storage + │ ├─ POST /submit ──────────>│ │ + │ │ ├─ validate metadata │ + │ │ ├─ generate source_id │ + │ │ ├─ version (1.0 or +0.1) │ + │ │ ├─ store (pending_curation) │ + │ │ ├─ enqueue profile job ────>│ Async + │ │<── {source_id, version} ─┤ │ Worker + │ │ │ │ + │ │ │ Profile job runs: │ + │ │ │ scan files, build │ + │ │ │ schema + stats │ + │ │ │ │ +Curator │ │ │ +────── │ │ │ + ├─ mdf approve ─────────────>│ │ │ + │ ├─ POST /curation/{id}/approve ─>│ │ + │ │ ├─ update status: approved │ + │ │ ├─ enqueue publish job ────>│ Async + │ │<── {success, doi} ───────┤ │ Worker + │ │ │ │ + │ │ │ Publish job runs: │ + │ │ │ mint DOI (DataCite) │ + │ │ │ index (Globus Search)│ + │ │ │ status → published │ +``` + +## Prerequisites + +```bash +# AWS SAM CLI +brew install aws-sam-cli # macOS +# or: pip install aws-sam-cli + +# AWS credentials +aws configure + +# Verify +aws sts get-caller-identity +``` + +## Environments + +| Environment | Stack | Auth | DataCite | Search | Curators | +|-------------|-------|------|----------|--------|----------| +| **dev** | `mdf-connect-v2-dev` | `X-User-Id` headers | Mock | Mock | All users | +| **staging** | `mdf-connect-v2-staging` | Globus tokens | Test API (`Globus.TEST`) | Test index | All users | +| **prod** | `mdf-connect-v2-prod` | Globus tokens | Test API (switch to real later) | Test index (switch to real later) | All users (switch to group-based later) | + +All environments are fully separate CloudFormation stacks with their own DynamoDB tables, Lambda functions, API Gateway, and SQS queues. + +## Deploying + +### Dev (no external dependencies) + +```bash +cd aws +sam build && ./deploy.sh dev +``` + +This creates a self-contained stack. No Globus credentials needed — auth uses `X-User-Id` headers, storage is local, DataCite is mocked. + +### Staging + +Requires Globus credentials stored in AWS SSM Parameter Store: + +```bash +# One-time: store Globus credentials (already done for staging) +aws ssm put-parameter --name /mdf/globus-client-id \ + --value "YOUR_CLIENT_ID" --type String --region us-east-1 +aws ssm put-parameter --name /mdf/globus-client-secret \ + --value "YOUR_CLIENT_SECRET" --type SecureString --region us-east-1 +``` + +DataCite and Search credentials are in `samconfig.toml` for staging. Then: + +```bash +cd aws +sam build && ./deploy.sh staging +``` + +### Production + +Same SSM prerequisites as staging. The prod config in `samconfig.toml` currently uses **test credentials** (DataCite test API, test search index) so the stack can be deployed and validated before switching to real credentials. + +```bash +cd aws +sam build && ./deploy.sh prod +``` + +#### Switching prod to real credentials + +When ready to go live, update `samconfig.toml` `[prod]` section: + +```toml +[prod.deploy.parameters] +parameter_overrides = "Environment=prod AuthMode=production AllowAllCurators=false DataCiteUsername=REAL_USERNAME DataCitePassword=REAL_PASSWORD DataCiteApiUrl=https://api.datacite.org DataCitePrefix=10.18126 UseMockDatacite=false SearchIndexUUID=REAL_INDEX_UUID TestSearchIndexUUID=TEST_INDEX_UUID" +``` + +Or store DataCite credentials in SSM (deploy.sh will pick them up automatically): + +```bash +aws ssm put-parameter --name /mdf/datacite-username \ + --value "REAL_USERNAME" --type String --region us-east-1 +aws ssm put-parameter --name /mdf/datacite-password \ + --value "REAL_PASSWORD" --type SecureString --region us-east-1 +aws ssm put-parameter --name /mdf/datacite-api-url \ + --value "https://api.datacite.org" --type String --region us-east-1 +aws ssm put-parameter --name /mdf/datacite-prefix \ + --value "10.18126" --type String --region us-east-1 +``` + +Then redeploy: `sam build && ./deploy.sh prod` + +### Quick deploy (code only, skips CloudFormation) + +For Lambda code changes that don't touch infrastructure: + +```bash +cd aws +./deploy.sh quick staging # or: quick prod +``` + +### Local development + +```bash +cd aws +./deploy.sh local +# Server starts at http://127.0.0.1:8080 +# Uses SQLite, local storage, mock DataCite, dev auth +``` + +## After deploying + +```bash +# Get the API URL +./deploy.sh status staging + +# Tail Lambda logs +./deploy.sh logs staging + +# Health check +curl https://YOUR_API_URL/health + +# Full teardown (removes stack, keeps DynamoDB tables) +./deploy.sh teardown dev +``` + +## SSM Parameters + +| Parameter | Required for | Description | +|-----------|-------------|-------------| +| `/mdf/globus-client-id` | staging, prod | Globus confidential app client ID | +| `/mdf/globus-client-secret` | staging, prod | Globus confidential app client secret | +| `/mdf/datacite-username` | prod (optional) | DataCite repository ID — overrides samconfig | +| `/mdf/datacite-password` | prod (optional) | DataCite repository password | +| `/mdf/datacite-api-url` | prod (optional) | `https://api.datacite.org` for real DOIs | +| `/mdf/datacite-prefix` | prod (optional) | DOI prefix (e.g., `10.18126`) | + +## Running tests + +Backend tests run in CI (GitHub Actions) on every pull request to `master` or `mdf-agent`. No AWS credentials are needed — all tests use SQLite, local storage, and mock services. + +```bash +cd aws + +# All v2 tests +python -m pytest v2/test_v2_*.py -v + +# Individual suites +python -m pytest v2/test_v2_publish_pipeline.py -v # Full publish pipeline +python -m pytest v2/test_v2_hardening.py -v # Security hardening +python -m pytest v2/test_v2_integration.py -v # Integration tests +python -m pytest v2/test_v2_versioning.py -v # Dataset versioning +python -m pytest v2/test_v2_async_jobs.py -v # Async job dispatch +``` + +## Environment variables + +For local development, set these in your shell or a `.env` file (requires `pip install python-dotenv`). The `.env` file is loaded automatically when running `python -m v2.app.main`. + +| Variable | Default | Description | +|----------|---------|-------------| +| **Core** | | | +| `STORE_BACKEND` | `dynamo` | `sqlite` (dev) or `dynamo` (prod) | +| `SQLITE_PATH` | `/tmp/mdf_connect_v2.db` | Database path when using SQLite | +| `AUTH_MODE` | `dev` | `dev` (X-User-Id headers) or `production` (Globus tokens) | +| `ASYNC_DISPATCH_MODE` | `inline` | `inline` (sync), `sqs` (prod), or `sqlite` (test) | +| `LOG_LEVEL` | `INFO` | Logging level | +| **Storage** | | | +| `STORAGE_BACKEND` | `local` | `local`, `s3`, or `globus` | +| `FILE_STORE_PATH` | `/tmp/mdf_files` | Filesystem path when `STORAGE_BACKEND=local` | +| `S3_BUCKET` | — | S3 bucket when `STORAGE_BACKEND=s3` | +| `GLOBUS_ENDPOINT_ID` | NCSA UUID | Globus endpoint when `STORAGE_BACKEND=globus` | +| `GLOBUS_HTTPS_SERVER` | `data.materialsdatafacility.org` | HTTPS hostname for Globus storage | +| **Auth** | | | +| `LOCAL_USER_ID` | `local-user` | Dev-mode user identity | +| `LOCAL_USER_EMAIL` | `local@example.com` | Dev-mode user email | +| `GLOBUS_CLIENT_ID` | — | Globus confidential app client ID (prod) | +| `GLOBUS_CLIENT_SECRET` | — | Globus confidential app client secret (prod) | +| `ALLOW_ALL_CURATORS` | `false` | `true` lets all users curate (dev only) | +| `CURATOR_USER_IDS` | — | Comma-separated Globus user IDs | +| `CURATOR_GROUP_IDS` | — | Comma-separated Globus group UUIDs | +| **DataCite** | | | +| `USE_MOCK_DATACITE` | `false` | `true` for mock DOI minting (dev/test) | +| `DATACITE_USERNAME` | — | DataCite repository ID | +| `DATACITE_PASSWORD` | — | DataCite repository password | +| `DATACITE_PREFIX` | `10.23677` | DOI prefix | +| `DATACITE_TEST_MODE` | `true` | Use DataCite test API | +| **Search** | | | +| `USE_MOCK_SEARCH` | `false` | `true` for mock search (dev/test) | +| `SEARCH_INDEX_UUID` | — | Production Globus Search index | +| `TEST_SEARCH_INDEX_UUID` | — | Test Globus Search index | +| **Limits** | | | +| `MAX_SUBMIT_METADATA_BYTES` | `262144` | Max metadata payload size | +| `MAX_SUBMIT_DATA_SOURCES` | `2000` | Max data sources per submission | +| `MAX_SUBMIT_AUTHORS` | `1000` | Max authors per submission | +| `MAX_STREAM_APPEND_COUNT` | `10000` | Max records per stream append | +| `CORS_ALLOWED_ORIGINS` | `*` | CORS allowed origins | + +## Key configuration files + +| File | Purpose | +|------|---------| +| `aws/template.yaml` | SAM/CloudFormation template — Lambda, API Gateway, DynamoDB, SQS, S3 | +| `aws/samconfig.toml` | Per-environment deploy config (dev, staging, prod) | +| `aws/deploy.sh` | Deploy script — `dev`, `staging`, `prod`, `quick`, `local`, `teardown`, `logs`, `status` | +| `aws/requirements.txt` | Python dependencies bundled into Lambda | + +## v1 (legacy) + +The v1 system (`aws/submit.py`, `aws/status.py`, `aws/automate_manager.py`, `infra/`) uses per-endpoint Lambda functions deployed via Terraform and GitHub Actions, orchestrated by Globus Automate Flows. It remains operational on the `prod` branch. The v2 backend runs on completely separate infrastructure (different stack name, tables, API Gateway) and can be deployed in parallel. + +## Support + This work was performed under financial assistance award 70NANB14H012 from U.S. Department of Commerce, National Institute of Standards and Technology as part of the [Center for Hierarchical Material Design (CHiMaD)](http://chimad.northwestern.edu). This work was performed under the following financial assistance award 70NANB19H005 from U.S. Department of Commerce, National Institute of Standards and Technology as part of the Center for Hierarchical Materials Design (CHiMaD). This work was also supported by the National Science Foundation as part of the [Midwest Big Data Hub](http://midwestbigdatahub.org) under NSF Award Number: 1636950 "BD Spokes: SPOKE: MIDWEST: Collaborative: Integrative Materials Design (IMaD): Leverage, Innovate, and Disseminate". diff --git a/automate/simplified_mdf_flow.py b/automate/simplified_mdf_flow.py new file mode 100644 index 0000000..b68ec33 --- /dev/null +++ b/automate/simplified_mdf_flow.py @@ -0,0 +1,325 @@ +"""Simplified MDF Ingest Flow - v2. + +This flow handles file transfer only. Curation is now handled by the MDF server API. + +Flow steps: +1. Email admin about new submission +2. Transfer files from user endpoint to MDF repository +3. Notify user of transfer completion + +Curation and DOI minting are handled separately via: +- GET /curation/pending - List pending submissions +- POST /curation/:id/approve - Approve + mint DOI +- POST /curation/:id/reject - Reject with reason +""" + +import action_providers +from globus_automate_flow import GlobusAutomateFlowDef + + +def email_submission_to_admin(sender_email, admin_email): + """Notify admin of new submission.""" + return { + "EmailSubmission": { + "Type": "Action", + "ActionUrl": "https://actions.globus.org/notification/notify", + "ExceptionOnActionFailure": False, # Continue even if email fails + "Parameters": { + "body_mimetype": "text/html", + "sender": sender_email, + "destination": admin_email, + "subject": "New MDF Dataset Submission", + "body_template": """ +

New Dataset Submitted

+

A new dataset has been submitted to the Materials Data Facility.

+ + + + + +
Title$title
Source ID$source_id
Submitter$submitting_user_email
Organization$organization
+

Review pending submissions at: $curation_url

+ + """, + "body_variables": { + "title.$": "$.dataset_mdata.dc.titles[0].title", + "source_id.$": "$.dataset_mdata.mdf.source_id", + "submitting_user_email.$": "$.submitting_user_email", + "organization.$": "$.dataset_mdata.mdf.organization", + "curation_url.$": "$.curation_url", + }, + "notification_method": "any", + "notification_priority": "high", + "send_credentials": [ + { + "credential_method": "email", + "credential_type": "ses", + "credential_value.$": "$._private_email_credentials", + } + ], + "__Private_Parameters": ["send_credentials"], + }, + "ResultPath": "$.EmailSubmissionResult", + "Next": "CheckMetadataOnly", + }, + } + + +def check_metadata_only(): + """Check if this is a metadata-only update (no file transfer needed).""" + return { + "CheckMetadataOnly": { + "Comment": "Skip file transfer if this is a metadata-only update", + "Type": "Choice", + "Choices": [ + { + "Variable": "$.update_metadata_only", + "BooleanEquals": True, + "Next": "TransferComplete", + } + ], + "Default": "CreateDatasetDir", + } + } + + +def file_transfer_steps(): + """Transfer files from user endpoint to MDF repository.""" + return { + "CreateDatasetDir": { + "Comment": "Create the dataset directory", + "Type": "Action", + "ActionUrl": "https://transfer.actions.globus.org/mkdir", + "ExceptionOnActionFailure": False, + "Parameters": { + "endpoint_id.$": "$.user_transfer_inputs.destination_endpoint_id", + "path.$": "$.user_transfer_inputs.dataset_path", + }, + "ResultPath": "$.CreateDatasetDirResult", + "Next": "CreateVersionDir", + }, + "CreateVersionDir": { + "Comment": "Create the version subdirectory", + "Type": "Action", + "ActionUrl": "https://transfer.actions.globus.org/mkdir", + "ExceptionOnActionFailure": True, + "Parameters": { + "endpoint_id.$": "$.user_transfer_inputs.destination_endpoint_id", + "path.$": "$.user_transfer_inputs.transfer_items[0].destination_path", + }, + "ResultPath": "$.CreateVersionDirResult", + "Catch": [ + { + "ErrorEquals": ["ActionFailedException", "States.Runtime", "EndpointError"], + "ResultPath": "$.CreateVersionDirResult", + "Next": "TransferFailed", + } + ], + "Next": "AddUserPermissions", + }, + "AddUserPermissions": { + "Comment": "Temporarily add write permissions for the submitting user", + "Type": "Action", + "ActionUrl": "https://transfer.actions.globus.org/manage_permission", + "ExceptionOnActionFailure": False, + "Parameters": { + "operation": "CREATE", + "endpoint_id.$": "$.user_transfer_inputs.destination_endpoint_id", + "path.$": "$.user_transfer_inputs.transfer_items[0].destination_path", + "principal_type": "identity", + "principal.$": "$.user_transfer_inputs.submitting-user-id", + "permissions": "rw", + }, + "ResultPath": "$.UserPermissionResult", + "Catch": [ + { + "ErrorEquals": ["ActionFailedException", "States.Runtime", "EndpointError"], + "ResultPath": "$.UserPermissionResult", + "Next": "TransferFailed", + } + ], + "Next": "ExecuteTransfer", + }, + "ExecuteTransfer": { + "Comment": "Transfer data from user endpoint to MDF repository", + "Type": "Action", + "ActionUrl": "https://transfer.actions.globus.org/transfer", + "WaitTime": 86400, # 24 hours max + "RunAs": "SubmittingUserV2", + "Parameters": { + "source_endpoint.$": "$.user_transfer_inputs.source_endpoint_id", + "destination_endpoint.$": "$.user_transfer_inputs.destination_endpoint_id", + "label.$": "$.user_transfer_inputs.label", + "DATA.$": "$.user_transfer_inputs.transfer_items", + }, + "ResultPath": "$.TransferResult", + "Next": "RemoveUserPermissions", + }, + "RemoveUserPermissions": { + "Comment": "Remove temporary write permissions", + "Type": "Action", + "ActionUrl": "https://transfer.actions.globus.org/manage_permission", + "ExceptionOnActionFailure": False, + "Parameters": { + "operation": "DELETE", + "endpoint_id.$": "$.user_transfer_inputs.destination_endpoint_id", + "rule_id.$": "$.UserPermissionResult.details.access_id", + }, + "ResultPath": "$.RemovePermissionResult", + "Next": "CheckTransferStatus", + }, + "CheckTransferStatus": { + "Type": "Choice", + "Choices": [ + { + "Variable": "$.TransferResult.status", + "StringEquals": "SUCCEEDED", + "Next": "TransferComplete", + } + ], + "Default": "TransferFailed", + }, + } + + +def completion_states(sender_email): + """Handle transfer completion or failure.""" + return { + "TransferComplete": { + "Type": "ExpressionEval", + "Parameters": { + "status": "transfer_complete", + "message.=": "'File transfer complete for ' + `$.dataset_mdata.mdf.source_id` + '. Submission is now pending curation.'", + }, + "ResultPath": "$.FinalState", + "Next": "NotifyUserSuccess", + }, + "TransferFailed": { + "Type": "ExpressionEval", + "Parameters": { + "status": "transfer_failed", + "message.=": "'File transfer failed for ' + `$.dataset_mdata.mdf.source_id` + '. Please check the flow logs.'", + }, + "ResultPath": "$.FinalState", + "Next": "NotifyUserFailure", + }, + "NotifyUserSuccess": { + "Type": "Action", + "ActionUrl": "https://actions.globus.org/notification/notify", + "ExceptionOnActionFailure": False, + "Parameters": { + "body_mimetype": "text/html", + "sender": sender_email, + "destination.$": "$.submitting_user_email", + "subject": "MDF Submission - Transfer Complete", + "body_template": """ + +

Transfer Complete

+

Your dataset $source_id has been transferred to the MDF repository.

+

Your submission is now pending curation. You will receive another email when it has been reviewed.

+

Thank you for contributing to the Materials Data Facility!

+ + """, + "body_variables": { + "source_id.$": "$.dataset_mdata.mdf.source_id", + }, + "notification_method": "any", + "send_credentials": [ + { + "credential_method": "email", + "credential_type": "ses", + "credential_value.$": "$._private_email_credentials", + } + ], + "__Private_Parameters": ["send_credentials"], + }, + "ResultPath": "$.NotifySuccessResult", + "WaitTime": 300, + "Next": "EndFlow", + }, + "NotifyUserFailure": { + "Type": "Action", + "ActionUrl": "https://actions.globus.org/notification/notify", + "ExceptionOnActionFailure": False, + "Parameters": { + "body_mimetype": "text/html", + "sender": sender_email, + "destination.$": "$.submitting_user_email", + "subject": "MDF Submission - Transfer Failed", + "body_template": """ + +

Transfer Failed

+

Your dataset $source_id failed to transfer.

+

Please check your Globus endpoint permissions and try again.

+

View the flow logs for details.

+ + """, + "body_variables": { + "source_id.$": "$.dataset_mdata.mdf.source_id", + "run_id.$": "$._context.run_id", + }, + "notification_method": "any", + "send_credentials": [ + { + "credential_method": "email", + "credential_type": "ses", + "credential_value.$": "$._private_email_credentials", + } + ], + "__Private_Parameters": ["send_credentials"], + }, + "ResultPath": "$.NotifyFailureResult", + "WaitTime": 300, + "Next": "EndFlow", + }, + "EndFlow": { + "Type": "Pass", + "End": True, + }, + } + + +def flow_def( + sender_email, + admin_email, + flow_permissions, + administered_by, + description="Simplified MDF Ingest Flow - handles file transfer only. Curation via API.", +): + """Build the simplified flow definition.""" + return GlobusAutomateFlowDef( + title="MDF Ingest Flow v2 (Simplified)", + subtitle="Transfer files to MDF repository", + description=description, + visible_to=flow_permissions, + runnable_by=flow_permissions, + administered_by=administered_by, + input_schema={}, + flow_definition={ + "StartAt": "EmailSubmission", + "States": { + **email_submission_to_admin(sender_email, admin_email), + **check_metadata_only(), + **file_transfer_steps(), + **completion_states(sender_email), + }, + }, + ) + + +# What was removed from the original flow: +# +# 1. CurateSubmission - Now handled via POST /curation/:id/approve or /reject +# 2. SendCurationEmail - Admin can use the curation dashboard instead +# 3. ChooseAcceptance - Curation decisions are made via API +# 4. FailCuration - Rejection is handled via API +# 5. NeedDOI / MintDOI - DOI minting happens on approval via API +# 6. AddDoiToSearchRecord - DOI is stored in submission record +# 7. SearchIngest - Can be triggered separately after approval +# +# Benefits: +# - Simpler flow with fewer states +# - Curators can use a web dashboard instead of email links +# - DOI minting happens synchronously on approval +# - Better visibility into curation status +# - Easier to add curation workflow features (comments, history, etc.) diff --git a/aws/DEPLOYMENT.md b/aws/DEPLOYMENT.md new file mode 100644 index 0000000..44c618d --- /dev/null +++ b/aws/DEPLOYMENT.md @@ -0,0 +1,195 @@ +# MDF Connect v2 - Deployment Guide + +## Architecture Overview + +``` +┌─────────────────────────────────────────────────────────────────────┐ +│ API Gateway │ +│ api.materialsdatafacility.org │ +└─────────────────────────────┬───────────────────────────────────────┘ + │ + ┌───────────────────┼───────────────────┐ + │ │ │ + ▼ ▼ ▼ + ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ + │ Submit │ │ Stream │ │ Search │ + │ Card │ │ Upload │ │ Citation │ + │ Status │ │ Close │ │ │ + └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ + │ │ │ + ▼ ▼ ▼ + ┌─────────────────────────────────────────────────────┐ + │ DynamoDB │ + │ submissions │ streams │ + └─────────────────────────────────────────────────────┘ + │ + ▼ + ┌─────────────────────────────────────────────────────┐ + │ S3 │ + │ Stream file storage │ + └─────────────────────────────────────────────────────┘ +``` + +## Prerequisites + +```bash +# Install AWS SAM CLI +brew install aws-sam-cli # macOS +# or: pip install aws-sam-cli + +# Configure AWS credentials +aws configure +``` + +## Quick Start + +```bash +# Local development +make local # Start server on http://127.0.0.1:8080 +make demo # Run demo script + +# Deploy to AWS +make deploy-dev # Deploy to dev (fast, no confirmation) +make deploy-prod # Deploy to production (requires confirmation) +``` + +## Deployment Environments + +| Environment | Stack Name | Usage | +|-------------|------------|-------| +| `dev` | mdf-connect-v2-dev | Development, testing | +| `staging` | mdf-connect-v2-staging | Pre-production validation | +| `prod` | mdf-connect-v2-prod | Production | + +## First-Time Setup + +1. **Create S3 buckets for SAM deployments:** + ```bash + aws s3 mb s3://mdf-sam-deployments-dev --region us-east-1 + aws s3 mb s3://mdf-sam-deployments-prod --region us-east-1 + ``` + +2. **Store Globus credentials in SSM Parameter Store:** + ```bash + aws ssm put-parameter \ + --name /mdf/globus-client-id \ + --value "YOUR_GLOBUS_CLIENT_ID" \ + --type SecureString + ``` + +3. **Initial deployment:** + ```bash + sam build + sam deploy --guided # Interactive setup + ``` + +## Making Changes + +The beauty of serverless: **code changes are just deploys**. + +```bash +# 1. Edit code locally +vim v2/submit.py + +# 2. Test locally +make local +curl http://localhost:8080/submit ... + +# 3. Deploy +make deploy-dev + +# 4. Verify in AWS +make logs-dev +``` + +## Cost Estimation + +For ~10 datasets/month + streaming: + +| Service | Estimated Cost | +|---------|---------------| +| Lambda | ~$0 (free tier: 1M requests) | +| API Gateway | ~$0 (free tier: 1M calls) | +| DynamoDB | ~$0 (free tier: 25GB, 25 WCU/RCU) | +| S3 | ~$0.02/GB stored | +| **Total** | **< $5/month** | + +## Monitoring + +```bash +# Tail logs +make logs-dev + +# Check stack status +make status-dev + +# CloudWatch metrics +FUNC_NAME=$(aws cloudformation describe-stack-resources \ + --stack-name mdf-connect-v2-dev \ + --query 'StackResources[?ResourceType==`AWS::Lambda::Function`].PhysicalResourceId' \ + --output text) +aws cloudwatch get-metric-statistics \ + --namespace AWS/Lambda \ + --metric-name Invocations \ + --dimensions Name=FunctionName,Value="$FUNC_NAME" \ + --start-time $(date -v-1d +%Y-%m-%dT%H:%M:%S) \ + --end-time $(date +%Y-%m-%dT%H:%M:%S) \ + --period 3600 \ + --statistics Sum +``` + +## Rollback + +```bash +# Automatic rollback on failure (default) +# Manual rollback to previous version: +aws cloudformation rollback-stack --stack-name mdf-connect-v2-dev +``` + +## Adding New Endpoints + +1. **Add router endpoint** in `v2/app/routers/*.py`: + ```python + from fastapi import APIRouter + + router = APIRouter() + + @router.get("/my-endpoint") + async def my_endpoint(): + return {"success": True} + ``` + +2. **Register the router** in `v2/app/__init__.py`: + ```python + from v2.app.routers import my_router + app.include_router(my_router.router) + ``` + +3. **Deploy**: + ```bash + make deploy-dev + ``` + +## Troubleshooting + +### Lambda timeout +Increase in `template.yaml`: +```yaml +Timeout: 60 # seconds +``` + +### DynamoDB throttling +Already using PAY_PER_REQUEST (auto-scaling). If issues persist, check CloudWatch metrics. + +### API Gateway 502 +Check Lambda logs: +```bash +sam logs --stack-name mdf-connect-v2-dev --tail +``` + +## Security Notes + +- All endpoints except `/card`, `/citation`, `/search`, `/status` require Globus Auth +- DynamoDB tables have `DeletionPolicy: Retain` (won't delete on stack removal) +- S3 bucket is encrypted at rest and blocks public access +- Secrets stored in SSM Parameter Store (encrypted) diff --git a/aws/GLOBUS_CONTEXT.md b/aws/GLOBUS_CONTEXT.md new file mode 100644 index 0000000..737003c --- /dev/null +++ b/aws/GLOBUS_CONTEXT.md @@ -0,0 +1,203 @@ +# Globus HTTPS Endpoints: Technical Context for MDF v2 + +This document provides context for AI agents and developers working with Globus HTTPS endpoints in the MDF v2 backend. + +## Overview + +MDF uses Globus HTTPS endpoints for file storage. These endpoints provide: +- Direct HTTPS GET/PUT/DELETE operations (no Globus Transfer needed for small files) +- Bearer token authentication via Globus Auth +- 1PB of free storage on the NCSA MDF endpoint + +## Key Identifiers + +``` +NCSA MDF Endpoint UUID: 82f1b5c6-6e9b-11e5-ba47-22000b92c6ec +HTTPS Server: data.materialsdatafacility.org +MDF Native App Client ID: 984464e2-90ab-433d-8145-ac0215d26c8e +``` + +## Authentication + +### Scopes + +To access an endpoint via HTTPS, you need a token with the HTTPS scope: + +``` +https://auth.globus.org/scopes/{endpoint_uuid}/https +``` + +For the NCSA MDF endpoint: +``` +https://auth.globus.org/scopes/82f1b5c6-6e9b-11e5-ba47-22000b92c6ec/https +``` + +### Token Acquisition + +**Interactive (Native App Flow):** +```python +from globus_sdk import NativeAppAuthClient + +client = NativeAppAuthClient("984464e2-90ab-433d-8145-ac0215d26c8e") +client.oauth2_start_flow( + requested_scopes=["https://auth.globus.org/scopes/82f1b5c6.../https"], + refresh_tokens=True, +) +url = client.oauth2_get_authorize_url() +# User visits URL, gets auth code +tokens = client.oauth2_exchange_code_for_tokens(auth_code) +access_token = tokens.by_resource_server["82f1b5c6..."]["access_token"] +``` + +**Service Account (Client Credentials):** +```python +from globus_sdk import ConfidentialAppAuthClient + +client = ConfidentialAppAuthClient(client_id, client_secret) +tokens = client.oauth2_client_credentials_tokens( + requested_scopes="https://auth.globus.org/scopes/{endpoint}/https" +) +``` + +### Token Storage + +Cached tokens are stored at: `~/.mdf/v2_https_tokens.json` + +Format: +```json +{ + "access_token": "AgdNNrB9Y...", + "refresh_token": "AgP9r...", + "expires_at_seconds": 1738456789, + "resource_server": "82f1b5c6-6e9b-11e5-ba47-22000b92c6ec" +} +``` + +## HTTPS Operations + +### Upload (PUT) + +```bash +curl -X PUT "https://data.materialsdatafacility.org/path/to/file.txt" \ + -H "Authorization: Bearer $TOKEN" \ + -H "Content-Type: text/plain" \ + -d "file contents" +``` + +**Critical:** Parent directories must exist. PUT does not auto-create directories. + +### Download (GET) + +```bash +curl "https://data.materialsdatafacility.org/path/to/file.txt" \ + -H "Authorization: Bearer $TOKEN" +``` + +Public files (if endpoint allows) can be accessed without auth. + +### Delete (DELETE) + +```bash +curl -X DELETE "https://data.materialsdatafacility.org/path/to/file.txt" \ + -H "Authorization: Bearer $TOKEN" +``` + +### Create Directory (MKCOL) + +```bash +curl -X MKCOL "https://data.materialsdatafacility.org/path/to/dir/" \ + -H "Authorization: Bearer $TOKEN" +``` + +**Note:** MKCOL may require additional permissions. Some endpoints return 403. + +## Path Strategy + +Because directories cannot be auto-created, MDF v2 uses a **flat path structure**: + +``` +{base_path}/{stream_id}_{date}_{filename} +``` + +Example: +``` +/tmp/testing/stream-abc123_20260201_data.csv +``` + +This avoids needing to create `streams/abc123/2026-02-01/` directories. + +For production with directory support: +``` +/mdf/streams/{stream_id}/{date}/{filename} +``` + +## User Token Pass-Through + +For proper authorization and audit trails, user operations should use the user's token: + +```python +# In API handler +user_token = request.headers.get("X-Globus-Token") + +# Pass to storage backend +storage.store_file(stream_id, filename, content, user_token=user_token) +``` + +This ensures: +1. **Access control:** Users can only access paths they're authorized for +2. **Audit trail:** Globus logs show the actual user who performed the action +3. **Security:** Server doesn't need blanket write access + +## Common Issues + +### 307 Redirect +HTTPS endpoints may return 307 redirects. Use `follow_redirects=True` in HTTP clients. + +### 404 on PUT +Parent directory doesn't exist. Either: +1. Create directories first with MKCOL +2. Use flat path structure (recommended) + +### 403 on MKCOL +Endpoint doesn't allow directory creation via HTTPS. Use Globus Transfer API or flat paths. + +### Token Expired +Access tokens expire (typically 48 hours). Use refresh tokens to get new access tokens. + +## Environment Variables + +| Variable | Description | Default | +|----------|-------------|---------| +| `GLOBUS_ENDPOINT_ID` | Endpoint UUID | NCSA MDF endpoint | +| `GLOBUS_BASE_PATH` | Base path on endpoint | `/tmp/testing` | +| `GLOBUS_HTTPS_SERVER` | HTTPS server hostname | `data.materialsdatafacility.org` | +| `GLOBUS_ACCESS_TOKEN` | Static access token | (none) | +| `GLOBUS_CLIENT_ID` | Client ID for creds flow | (none) | +| `GLOBUS_CLIENT_SECRET` | Client secret | (none) | +| `STORAGE_BACKEND` | Storage type | `local` | + +## Code References + +- `v2/storage/globus_https.py` - Globus storage backend implementation +- `v2/storage/base.py` - Abstract storage interface +- `test_globus_upload.py` - Interactive auth and upload test +- `~/.mdf/v2_https_tokens.json` - Token cache location + +## Integration with mdf_toolbox + +The existing `mdf_toolbox.login()` function can be used to get tokens with all required scopes: + +```python +import mdf_toolbox + +auths = mdf_toolbox.login( + services=[ + "https://auth.globus.org/scopes/82f1b5c6-6e9b-11e5-ba47-22000b92c6ec/https", + # ... other scopes + ], + app_name="MDF", + make_clients=True, +) +``` + +See Foundry's `PubAuths` class for a pattern of managing multiple endpoint tokens. diff --git a/aws/Makefile b/aws/Makefile new file mode 100644 index 0000000..2c985ca --- /dev/null +++ b/aws/Makefile @@ -0,0 +1,108 @@ +# MDF Connect v2 - Development & Deployment +# +# Quick Start: +# make local # Start local development server +# make test # Run tests +# make deploy-dev # Deploy to AWS dev environment + +.PHONY: help local test lint build deploy-dev deploy-prod clean + +# Default target +help: + @echo "MDF Connect v2 Backend" + @echo "" + @echo "Development:" + @echo " make local Start local development server (port 8080)" + @echo " make demo Run the demo script" + @echo " make test Run unit tests" + @echo " make lint Run linters" + @echo "" + @echo "Deployment:" + @echo " make build Build SAM application" + @echo " make deploy-dev Deploy to dev environment" + @echo " make deploy-staging Deploy to staging environment" + @echo " make deploy-prod Deploy to production" + @echo "" + @echo "Utilities:" + @echo " make logs-dev Tail logs from dev" + @echo " make clean Clean build artifacts" + +# =========================================================================== +# Development +# =========================================================================== + +local: + @echo "Starting local server on http://127.0.0.1:8080" + @STORE_BACKEND=sqlite AUTH_MODE=dev python -m v2.app.main + +demo: + @python v2/demo_full_workflow.py + +test: + @pytest tests/ -v --tb=short + +lint: + @ruff check v2/ + @mypy v2/ --ignore-missing-imports + +# =========================================================================== +# Deployment +# =========================================================================== + +build: + @sam build + +validate: + @sam validate --lint + +deploy-dev: build + @sam deploy --config-env dev --no-confirm-changeset + @echo "" + @echo "Dev deployment complete! API URL:" + @aws cloudformation describe-stacks \ + --stack-name mdf-connect-v2-dev \ + --query 'Stacks[0].Outputs[?OutputKey==`ApiUrl`].OutputValue' \ + --output text + +deploy-staging: build + @sam deploy --config-env staging + @echo "" + @echo "Staging deployment complete!" + +deploy-prod: build + @echo "Deploying to PRODUCTION. This will prompt for confirmation." + @sam deploy --config-env prod + +# =========================================================================== +# Monitoring +# =========================================================================== + +logs-dev: + @sam logs --stack-name mdf-connect-v2-dev --tail + +logs-prod: + @sam logs --stack-name mdf-connect-v2-prod --tail + +status-dev: + @aws cloudformation describe-stacks \ + --stack-name mdf-connect-v2-dev \ + --query 'Stacks[0].{Status:StackStatus,Outputs:Outputs}' \ + --output table + +# =========================================================================== +# Cleanup +# =========================================================================== + +clean: + @rm -rf .aws-sam/ + @find . -type d -name '__pycache__' -exec rm -rf {} + 2>/dev/null || true + @echo "Cleaned build artifacts" + +# =========================================================================== +# Layer Management +# =========================================================================== + +layer-deps: + @mkdir -p layers/dependencies/python + @pip install -r requirements.txt -t layers/dependencies/python/ + @echo "Dependencies installed to layers/dependencies/" diff --git a/aws/demo_mdf_v2.py b/aws/demo_mdf_v2.py new file mode 100644 index 0000000..7b0db0c --- /dev/null +++ b/aws/demo_mdf_v2.py @@ -0,0 +1,652 @@ +#!/usr/bin/env python3 +""" +MDF v2 Backend Demo +=================== + +Comprehensive demonstration of the MDF v2 backend capabilities: +1. Stream creation and management (SQLite metadata store) +2. File uploads to local storage +3. File uploads to Globus HTTPS endpoint (1PB free storage) +4. Dataset cards and citations +5. Search functionality + +Run with: python demo_mdf_v2.py +""" + +import base64 +import json +import os +import sys +import time +from datetime import datetime, timezone +from pathlib import Path + +# Add the aws directory to path +sys.path.insert(0, str(Path(__file__).parent)) + +# Check for required packages +try: + from rich.console import Console + from rich.panel import Panel + from rich.table import Table + from rich.progress import Progress, SpinnerColumn, TextColumn + from rich.syntax import Syntax + from rich.markdown import Markdown +except ImportError: + print("Please install rich: pip install rich") + sys.exit(1) + +console = Console() + +# ============================================================================ +# Demo Configuration +# ============================================================================ + +DEMO_FILES = [ + { + "filename": "experiment_001.csv", + "content": b"sample_id,temperature_k,pressure_mpa,yield_percent\n1,300,0.1,85.2\n2,350,0.5,91.7\n3,400,1.0,78.3\n4,450,2.0,95.1\n", + "content_type": "text/csv", + "metadata": {"experiment_type": "synthesis", "instrument": "reactor-a"}, + }, + { + "filename": "parameters.json", + "content": b'{"catalyst": "Pt/Al2O3", "flow_rate_sccm": 50, "duration_hours": 4}', + "content_type": "application/json", + "metadata": {"schema_version": "1.0"}, + }, + { + "filename": "notes.txt", + "content": b"Experiment conducted on 2026-01-31.\nObserved unexpected phase transition at 380K.\nRequires further investigation.", + "content_type": "text/plain", + "metadata": {"author": "Dr. Jane Smith"}, + }, +] + + +def setup_environment(): + """Configure environment for demo.""" + os.environ.setdefault("STORE_BACKEND", "sqlite") + os.environ.setdefault("SQLITE_PATH", "/tmp/mdf_demo.db") + os.environ.setdefault("USE_MOCK_FLOW", "true") + + # Clean up old demo database + db_path = os.environ.get("SQLITE_PATH") + if db_path and os.path.exists(db_path): + os.remove(db_path) + + +def create_demo_submission(): + """Create a demo submission record for dataset card/citation demo.""" + from v2.store import get_store + import json + + store = get_store() + now = datetime.now(timezone.utc).isoformat() + + # DataCite-style metadata structure + dataset_mdata = { + "dc": { + "titles": [{"title": "High-Throughput Perovskite Synthesis Dataset"}], + "creators": [ + {"givenName": "Jane", "familyName": "Smith", "affiliation": "Argonne National Laboratory"}, + {"givenName": "John", "familyName": "Doe", "affiliation": "University of Chicago"}, + ], + "descriptions": [{ + "description": "A comprehensive dataset of perovskite synthesis experiments conducted using autonomous laboratory workflows. Contains XRD patterns, synthesis parameters, and yield measurements for 1,247 samples.", + "descriptionType": "Abstract" + }], + "subjects": [ + {"subject": "perovskite"}, + {"subject": "synthesis"}, + {"subject": "autonomous"}, + {"subject": "materials science"}, + ], + "publisher": "Materials Data Facility", + "publicationYear": "2026", + "resourceType": {"resourceTypeGeneral": "Dataset"}, + "rightsList": [{"rights": "CC-BY-4.0"}], + }, + "mdf": { + "organization": "argonne", + "doi": "10.18126/demo-12345", + } + } + + record = { + "source_id": "demo_perovskite_synthesis_v1", + "version": "1.0", + "dataset_mdata": json.dumps(dataset_mdata), + "status": "published", + "created_at": now, + "updated_at": now, + "user_id": "demo-user", + "organization": "argonne", + "file_count": 1247, + "total_bytes": 2_500_000_000, + "data_sources": ["xrd_patterns.zip", "synthesis_params.csv", "yields.json"], + } + + store.put_submission(record) + return record + + +# ============================================================================ +# Demo Sections +# ============================================================================ + +def demo_header(): + """Display demo header.""" + console.print() + console.print(Panel.fit( + "[bold blue]MDF v2 Backend Demo[/bold blue]\n" + "[dim]Materials Data Facility - Next Generation Data Infrastructure[/dim]", + border_style="blue", + )) + console.print() + + +def demo_stream_lifecycle(): + """Demonstrate stream creation and management.""" + from v2.stream_store import get_stream_store + + console.print(Panel("[bold cyan]1. Stream Lifecycle Management[/bold cyan]", expand=False)) + console.print() + + store = get_stream_store() + now = datetime.now(timezone.utc).isoformat() + + # Create stream + import uuid + stream_id = f"stream-{uuid.uuid4().hex[:12]}" + + record = { + "stream_id": stream_id, + "title": "Autonomous Synthesis Campaign - Lab 42", + "lab_id": "argonne-lab-42", + "status": "open", + "file_count": 0, + "total_bytes": 0, + "created_at": now, + "updated_at": now, + "user_id": "demo-user", + "organization": "argonne", + "metadata": {"instrument": "robotic-synthesizer", "campaign": "perovskite-q1-2026"}, + } + + with Progress( + SpinnerColumn(), + TextColumn("[progress.description]{task.description}"), + console=console, + transient=True, + ) as progress: + progress.add_task("Creating stream...", total=None) + store.create_stream(record) + time.sleep(0.5) + + console.print(f" [green]✓[/green] Created stream: [bold]{stream_id}[/bold]") + console.print(f" Title: {record['title']}") + console.print(f" Lab ID: {record['lab_id']}") + console.print(f" Status: [green]{record['status']}[/green]") + console.print() + + return stream_id + + +def demo_local_storage(stream_id: str): + """Demonstrate local storage backend.""" + from v2.storage import get_storage_backend, reset_storage_backend + + console.print(Panel("[bold cyan]2. Local Storage Backend[/bold cyan]", expand=False)) + console.print() + + # Force local storage + reset_storage_backend() + os.environ["STORAGE_BACKEND"] = "local" + storage = get_storage_backend() + + console.print(f" Backend: [bold]{storage.backend_name}[/bold]") + console.print(f" Base path: {storage.base_path}") + console.print() + + table = Table(title="Uploaded Files (Local Storage)") + table.add_column("Filename", style="cyan") + table.add_column("Size", justify="right") + table.add_column("Checksum (MD5)", style="dim") + table.add_column("Path") + + for file_data in DEMO_FILES: + meta = storage.store_file( + stream_id=stream_id, + filename=file_data["filename"], + content=file_data["content"], + content_type=file_data["content_type"], + metadata=file_data["metadata"], + ) + table.add_row( + meta.filename, + f"{meta.size_bytes} bytes", + meta.checksum_md5[:12] + "...", + meta.path, + ) + + console.print(table) + console.print() + + # List files + files = storage.list_files(stream_id) + console.print(f" [green]✓[/green] {len(files)} files stored locally") + console.print() + + +def demo_globus_storage(stream_id: str): + """Demonstrate Globus HTTPS storage backend.""" + from v2.storage import get_storage_backend, reset_storage_backend + from v2.storage.globus_https import load_cached_token + + console.print(Panel("[bold cyan]3. Globus HTTPS Storage Backend[/bold cyan]", expand=False)) + console.print() + + # Check for Globus token + token = load_cached_token() + if not token: + console.print(" [yellow]⚠[/yellow] No Globus token found. Run test_globus_upload.py first to authenticate.") + console.print(" Skipping Globus upload demo.") + console.print() + return None + + # Force Globus storage + reset_storage_backend() + os.environ["STORAGE_BACKEND"] = "globus" + + try: + storage = get_storage_backend() + except Exception as e: + console.print(f" [red]✗[/red] Failed to initialize Globus storage: {e}") + console.print() + return None + + console.print(f" Backend: [bold]{storage.backend_name}[/bold]") + console.print(f" Endpoint: [dim]{storage.endpoint_id}[/dim]") + console.print(f" Base URL: {storage.base_url}") + console.print() + + table = Table(title="Uploaded Files (Globus HTTPS)") + table.add_column("Filename", style="cyan") + table.add_column("Size", justify="right") + table.add_column("Globus URL") + + uploaded_files = [] + + with Progress( + SpinnerColumn(), + TextColumn("[progress.description]{task.description}"), + console=console, + transient=True, + ) as progress: + for file_data in DEMO_FILES: + task = progress.add_task(f"Uploading {file_data['filename']}...", total=None) + try: + meta = storage.store_file( + stream_id=stream_id, + filename=file_data["filename"], + content=file_data["content"], + content_type=file_data["content_type"], + metadata=file_data["metadata"], + ) + uploaded_files.append(meta) + table.add_row( + meta.filename, + f"{meta.size_bytes} bytes", + meta.download_url, + ) + except Exception as e: + console.print(f" [red]✗[/red] Failed to upload {file_data['filename']}: {e}") + progress.remove_task(task) + + if uploaded_files: + console.print(table) + console.print() + console.print(f" [green]✓[/green] {len(uploaded_files)} files uploaded to Globus endpoint") + console.print() + console.print(" [dim]Files are now accessible via HTTPS (with auth) at the URLs above.[/dim]") + + console.print() + return uploaded_files + + +def demo_stream_status(stream_id: str, file_count: int, total_bytes: int): + """Demonstrate stream status tracking.""" + from v2.stream_store import get_stream_store + + console.print(Panel("[bold cyan]4. Stream Status Tracking[/bold cyan]", expand=False)) + console.print() + + store = get_stream_store() + + # Update stream with file stats + store.append_stream( + stream_id=stream_id, + file_count=file_count, + total_bytes=total_bytes, + ) + + # Get updated status + stream = store.get_stream(stream_id) + + table = Table(title="Stream Status", show_header=False) + table.add_column("Property", style="bold") + table.add_column("Value") + + table.add_row("Stream ID", stream["stream_id"]) + table.add_row("Title", stream["title"]) + table.add_row("Status", f"[green]{stream['status']}[/green]") + table.add_row("File Count", str(stream["file_count"])) + table.add_row("Total Bytes", f"{stream['total_bytes']:,}") + table.add_row("Created", stream["created_at"]) + table.add_row("Last Append", stream.get("last_append_at") or "N/A") + + console.print(table) + console.print() + + +def demo_dataset_card(): + """Demonstrate dataset preview cards.""" + from v2.dataset_card import build_dataset_card + + console.print(Panel("[bold cyan]5. Dataset Preview Card[/bold cyan]", expand=False)) + console.print() + + # Create a demo submission first + record = create_demo_submission() + card = build_dataset_card(record) + + console.print(f" [bold]{card['title']}[/bold]") + console.print(f" [dim]Source ID: {card['source_id']} | Version: {card['version']}[/dim]") + console.print() + + # Authors + authors = ", ".join(card["authors"]) if card["authors"] else "Unknown" + console.print(f" [bold]Authors:[/bold] {authors}") + console.print() + + # Description + desc = card.get("description", "") + if desc: + console.print(f" [bold]Description:[/bold]") + console.print(f" {desc[:200]}{'...' if len(desc) > 200 else ''}") + console.print() + + # Keywords + keywords = card.get("keywords", []) + if keywords: + console.print(f" [bold]Keywords:[/bold] {', '.join(keywords)}") + console.print() + + # Stats + stats = card.get("stats", {}) + console.print(f" [bold]Statistics:[/bold]") + console.print(f" Files: {stats.get('file_count', 0):,}") + console.print(f" Size: {stats.get('size_human', 'Unknown')}") + console.print(f" File types: {', '.join(stats.get('file_types', []))}") + console.print() + + # Links + console.print(f" [bold]Links:[/bold]") + for name, url in card.get("links", {}).items(): + if url: + console.print(f" {name}: {url}") + console.print() + + +def demo_citations(): + """Demonstrate citation export.""" + from v2.citation import generate_bibtex, generate_apa, generate_ris + from v2.store import get_store + + console.print(Panel("[bold cyan]6. Citation Export[/bold cyan]", expand=False)) + console.print() + + store = get_store() + record = store.get_submission("demo_perovskite_synthesis_v1", "1.0") + + # BibTeX + console.print(" [bold]BibTeX:[/bold]") + bibtex = generate_bibtex(record) + console.print(Syntax(bibtex, "bibtex", theme="monokai", line_numbers=False, word_wrap=True)) + console.print() + + # APA + console.print(" [bold]APA:[/bold]") + apa = generate_apa(record) + console.print(f" {apa}") + console.print() + + +def demo_search(): + """Demonstrate search functionality.""" + from v2.search import search_datasets, search_streams + from v2.stream_store import get_stream_store + + console.print(Panel("[bold cyan]7. Search Functionality[/bold cyan]", expand=False)) + console.print() + + # Create a few more streams for search demo + store = get_stream_store() + now = datetime.now(timezone.utc).isoformat() + + demo_streams = [ + {"title": "Iron Oxide Nanoparticle Synthesis", "organization": "mit"}, + {"title": "Perovskite Solar Cell Characterization", "organization": "stanford"}, + {"title": "Battery Electrolyte Screening", "organization": "argonne"}, + ] + + import uuid + for s in demo_streams: + store.create_stream({ + "stream_id": f"stream-{uuid.uuid4().hex[:8]}", + "title": s["title"], + "status": "open", + "file_count": 0, + "total_bytes": 0, + "created_at": now, + "updated_at": now, + "user_id": "demo-user", + "organization": s["organization"], + }) + + # Search streams + console.print(" [bold]Search: 'perovskite'[/bold]") + results = search_streams("perovskite") + + if results: + table = Table() + table.add_column("Stream ID", style="cyan") + table.add_column("Title") + table.add_column("Organization") + + for r in results[:5]: + table.add_row( + r["stream_id"][:20] + "...", + r["title"][:40], + r.get("organization", "N/A"), + ) + + console.print(table) + else: + console.print(" [dim]No results found[/dim]") + + console.print() + + # Search datasets + console.print(" [bold]Search datasets: 'synthesis'[/bold]") + results = search_datasets("synthesis") + + if results: + for r in results[:3]: + console.print(f" - {r['title'][:50]}...") + else: + console.print(" [dim]No results found[/dim]") + + console.print() + + +def demo_preview_clone(stream_id: str): + """Demonstrate preview and clone functionality.""" + from v2.preview import generate_preview + from v2.storage import get_storage_backend + + console.print(Panel("[bold cyan]8. Data Preview[/bold cyan]", expand=False)) + console.print() + + storage = get_storage_backend() + files = storage.list_files(stream_id) + + if not files: + console.print(" [dim]No files to preview (stream may be empty)[/dim]") + console.print() + return + + console.print(f" Previewing files from stream: {stream_id}") + console.print() + + for f in files[:2]: # Preview first 2 files + content = storage.get_file(f.path) + if not content: + continue + + preview = generate_preview(content, f.filename) + console.print(f" [bold]{f.filename}[/bold] ({f.size_bytes} bytes)") + + if preview["type"] == "csv": + console.print(f" Type: CSV, {preview.get('total_rows', 0)} rows") + headers = preview.get("headers", [])[:5] + console.print(f" Columns: {', '.join(headers)}") + for row in preview.get("rows", [])[:2]: + console.print(f" {row[:5]}...") + + elif preview["type"] == "json": + console.print(f" Type: JSON") + keys = preview.get("top_level_keys", [])[:5] + console.print(f" Keys: {keys}") + + elif preview["type"] == "text": + console.print(f" Type: Text, {preview.get('total_lines', 0)} lines") + for line in preview.get("lines", [])[:2]: + console.print(f" {line[:60]}...") + + console.print() + + console.print(" [dim]Use client.stream_preview(stream_id) for full previews[/dim]") + console.print(" [dim]Use client.stream_clone(stream_id, dest_dir) to download files[/dim]") + console.print() + + +def demo_api_example(): + """Show example API usage.""" + console.print(Panel("[bold cyan]9. API Usage Examples[/bold cyan]", expand=False)) + console.print() + + example_code = ''' +# Python client example +from mdf_agent import BackendClient + +client = BackendClient.from_env() + +# Create a stream +stream = client.stream_create( + title="My Experiment Data", + organization="my-lab" +) +print(f"Created: {stream['stream_id']}") + +# Upload files +with open("data.csv", "rb") as f: + result = client.stream_upload( + stream_id=stream["stream_id"], + filename="data.csv", + content=f.read(), + ) + +# Get stream status +status = client.stream_status(stream["stream_id"]) +print(f"Files: {status['file_count']}") + +# Close and publish +client.stream_close(stream["stream_id"]) +''' + + console.print(Syntax(example_code, "python", theme="monokai", line_numbers=True)) + console.print() + + +def demo_summary(): + """Display demo summary.""" + console.print(Panel.fit( + "[bold green]Demo Complete![/bold green]\n\n" + "The MDF v2 backend provides:\n" + " • Stream-based data ingestion for automated labs\n" + " • Multiple storage backends (local, Globus, S3)\n" + " • 1PB free storage on Globus HTTPS endpoints\n" + " • Dataset cards and citation export\n" + " • Full-text search across datasets and streams\n" + " • Data preview (CSV stats, JSON structure, text)\n" + " • Clone/download from Globus to local\n\n" + "[dim]Start the local server: cd cs/aws && ./deploy.sh local[/dim]", + border_style="green", + )) + console.print() + + +# ============================================================================ +# Main +# ============================================================================ + +def main(): + setup_environment() + + demo_header() + + # Stream lifecycle + stream_id = demo_stream_lifecycle() + + # Local storage + demo_local_storage(stream_id) + + # Globus storage (uses same stream_id for comparison) + globus_files = demo_globus_storage(stream_id) + + # Calculate totals + file_count = len(DEMO_FILES) + total_bytes = sum(len(f["content"]) for f in DEMO_FILES) + + # If Globus worked, double the count (uploaded to both) + if globus_files: + file_count *= 2 + total_bytes *= 2 + + # Stream status + demo_stream_status(stream_id, file_count, total_bytes) + + # Dataset card + demo_dataset_card() + + # Citations + demo_citations() + + # Search + demo_search() + + # Preview and clone (use Globus stream if available) + if globus_files: + demo_preview_clone(stream_id) + + # API examples + demo_api_example() + + # Summary + demo_summary() + + +if __name__ == "__main__": + main() diff --git a/aws/deploy.sh b/aws/deploy.sh new file mode 100755 index 0000000..e75bf2d --- /dev/null +++ b/aws/deploy.sh @@ -0,0 +1,445 @@ +#!/bin/bash +# MDF v2 Backend Deployment Script +# +# Usage: +# ./deploy.sh dev # Deploy to dev (self-contained, no external deps) +# ./deploy.sh prod # Deploy to production (requires Globus SSM params) +# ./deploy.sh local # Run local server +# ./deploy.sh teardown dev # Completely remove a dev deployment +# +# What "dev" creates (scoped entirely to this stack): +# - CloudFormation stack: mdf-connect-v2-dev +# - S3 bucket: mdf-sam-deployments-dev (deployment artifacts) +# - DynamoDB tables: mdf-submissions-dev, mdf-streams-dev +# - Lambda function: ApiFunction (inside the stack) +# - HTTP API Gateway: (inside the stack) +# +# First-time setup: +# pip install aws-sam-cli +# aws configure # Set up AWS credentials + +set -e + +SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)" +cd "$SCRIPT_DIR" + +REGION="us-east-1" + +# Colors for output +RED='\033[0;31m' +GREEN='\033[0;32m' +YELLOW='\033[1;33m' +CYAN='\033[0;36m' +NC='\033[0m' + +log() { echo -e "${GREEN}[MDF]${NC} $1"; } +warn() { echo -e "${YELLOW}[MDF]${NC} $1"; } +info() { echo -e "${CYAN}[MDF]${NC} $1"; } +error(){ echo -e "${RED}[MDF]${NC} $1"; exit 1; } + +# --------------------------------------------------------------------------- +# Dependency checks +# --------------------------------------------------------------------------- +check_deps() { + command -v sam &>/dev/null || error "AWS SAM CLI not found. Install: pip install aws-sam-cli" + command -v aws &>/dev/null || error "AWS CLI not found. Install: pip install awscli" +} + +check_aws_identity() { + log "Checking AWS credentials..." + local identity + identity=$(aws sts get-caller-identity --output json 2>/dev/null) \ + || error "AWS credentials not configured. Run: aws configure" + local account=$(echo "$identity" | python3 -c "import sys,json; print(json.load(sys.stdin)['Account'])") + local arn=$(echo "$identity" | python3 -c "import sys,json; print(json.load(sys.stdin)['Arn'])") + info "Account: $account" + info "Identity: $arn" +} + +# --------------------------------------------------------------------------- +# S3 bucket for SAM deployment artifacts +# --------------------------------------------------------------------------- +ensure_s3_bucket() { + local bucket=$1 + if aws s3api head-bucket --bucket "$bucket" 2>/dev/null; then + log "S3 bucket $bucket exists" + else + log "Creating S3 bucket $bucket..." + if [[ "$REGION" == "us-east-1" ]]; then + aws s3api create-bucket --bucket "$bucket" --region "$REGION" + else + aws s3api create-bucket --bucket "$bucket" --region "$REGION" \ + --create-bucket-configuration LocationConstraint="$REGION" + fi + # Block public access + aws s3api put-public-access-block --bucket "$bucket" \ + --public-access-block-configuration \ + "BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true" + log "Created S3 bucket $bucket (public access blocked)" + fi +} + +# --------------------------------------------------------------------------- +# Build +# --------------------------------------------------------------------------- +build() { + log "Building SAM application..." + sam build +} + +# --------------------------------------------------------------------------- +# Deploy to dev (fully self-contained) +# --------------------------------------------------------------------------- +deploy_dev() { + local stack_name="mdf-connect-v2-dev" + local s3_bucket="mdf-sam-deployments-dev" + + echo "" + info "═══════════════════════════════════════════════════" + info " MDF Connect v2 — Dev Deployment" + info "═══════════════════════════════════════════════════" + info " Stack: $stack_name" + info " Region: $REGION" + info " Auth: dev (X-User-Id headers, no Globus)" + info " Storage: local (no Globus transfers)" + info " DataCite: mock (no real DOIs)" + info " Tables: mdf-submissions-dev, mdf-streams-dev" + info "═══════════════════════════════════════════════════" + echo "" + + check_aws_identity + ensure_s3_bucket "$s3_bucket" + build + + log "Deploying stack $stack_name..." + sam deploy --config-env dev --no-fail-on-empty-changeset + + echo "" + log "Deployment complete!" + echo "" + + # Print the API URL + local api_url + api_url=$(aws cloudformation describe-stacks \ + --stack-name "$stack_name" \ + --region "$REGION" \ + --query 'Stacks[0].Outputs[?OutputKey==`ApiUrl`].OutputValue' \ + --output text 2>/dev/null || echo "") + + if [[ -n "$api_url" && "$api_url" != "None" ]]; then + info "API URL: $api_url" + echo "" + echo " Test it:" + echo " curl $api_url/health" + echo "" + echo " Submit a dataset:" + echo " curl -X POST $api_url/submit \\" + echo " -H 'Content-Type: application/json' \\" + echo " -H 'X-User-Id: test-user' \\" + echo " -d '{\"title\": \"Test Dataset\", \"authors\": [{\"name\": \"Test\"}], \"data_sources\": [\"https://example.com/data.csv\"]}'" + echo "" + echo " Tear down when done:" + echo " ./deploy.sh teardown dev" + fi +} + +# --------------------------------------------------------------------------- +# Deploy to staging/prod (requires Globus credentials) +# --------------------------------------------------------------------------- +deploy_prod() { + local env=$1 + local stack_name="mdf-connect-v2-$env" + + echo "" + info "═══════════════════════════════════════════════════" + local env_label + env_label=$(echo "$env" | awk '{print toupper(substr($0,1,1)) substr($0,2)}') + info " MDF Connect v2 — $env_label Deployment" + info "═══════════════════════════════════════════════════" + echo "" + + check_aws_identity + + # Resolve Globus credentials from SSM + log "Resolving Globus credentials from SSM..." + local globus_id globus_secret + globus_id=$(aws ssm get-parameter \ + --name "/mdf/globus-client-id" \ + --region "$REGION" \ + --query 'Parameter.Value' --output text 2>/dev/null) \ + || error "SSM parameter /mdf/globus-client-id not found. Create it first." + globus_secret=$(aws ssm get-parameter \ + --name "/mdf/globus-client-secret" \ + --region "$REGION" \ + --with-decryption \ + --query 'Parameter.Value' --output text 2>/dev/null) \ + || error "SSM parameter /mdf/globus-client-secret not found. Create it first." + log "Globus credentials resolved from SSM" + + # Resolve DataCite credentials from SSM (optional — falls back to env/defaults) + log "Resolving DataCite credentials from SSM..." + local datacite_user datacite_pass datacite_url datacite_prefix + datacite_user=$(aws ssm get-parameter \ + --name "/mdf/datacite-username" \ + --region "$REGION" \ + --query 'Parameter.Value' --output text 2>/dev/null || echo "") + datacite_pass=$(aws ssm get-parameter \ + --name "/mdf/datacite-password" \ + --region "$REGION" \ + --with-decryption \ + --query 'Parameter.Value' --output text 2>/dev/null || echo "") + datacite_url=$(aws ssm get-parameter \ + --name "/mdf/datacite-api-url" \ + --region "$REGION" \ + --query 'Parameter.Value' --output text 2>/dev/null || echo "") + datacite_prefix=$(aws ssm get-parameter \ + --name "/mdf/datacite-prefix" \ + --region "$REGION" \ + --query 'Parameter.Value' --output text 2>/dev/null || echo "") + if [[ -n "$datacite_user" ]]; then + log "DataCite credentials resolved from SSM" + else + warn "DataCite SSM parameters not found — using defaults from samconfig.toml" + fi + + ensure_s3_bucket "mdf-sam-deployments-$env" + build + + # Read base parameter_overrides from samconfig.toml and append credentials. + # CLI --parameter-overrides fully replaces samconfig values, so we must + # pass the complete set here. + local base_params + base_params=$(python3 -c " +try: + import tomllib +except ImportError: + import tomli as tomllib +import sys +with open('samconfig.toml', 'rb') as f: + cfg = tomllib.load(f) +print(cfg.get('${env}', {}).get('deploy', {}).get('parameters', {}).get('parameter_overrides', '')) +" 2>/dev/null || echo "Environment=$env AuthMode=production") + + local all_params="$base_params GlobusClientId=$globus_id GlobusClientSecret=$globus_secret" + + # Append DataCite params if resolved from SSM + [[ -n "$datacite_user" ]] && all_params="$all_params DataCiteUsername=$datacite_user" + [[ -n "$datacite_pass" ]] && all_params="$all_params DataCitePassword=$datacite_pass" + [[ -n "$datacite_url" ]] && all_params="$all_params DataCiteApiUrl=$datacite_url" + [[ -n "$datacite_prefix" ]] && all_params="$all_params DataCitePrefix=$datacite_prefix" + + log "Deploying stack $stack_name..." + sam deploy \ + --config-env "$env" \ + --no-fail-on-empty-changeset \ + --parameter-overrides "$all_params" + + log "Deployment complete!" +} + +# --------------------------------------------------------------------------- +# Teardown — completely remove a deployment +# --------------------------------------------------------------------------- +teardown() { + local env=$1 + [[ -z "$env" ]] && error "Environment required: ./deploy.sh teardown dev" + + local stack_name="mdf-connect-v2-$env" + local s3_bucket="mdf-sam-deployments-$env" + + echo "" + warn "═══════════════════════════════════════════════════" + warn " TEARDOWN: $stack_name" + warn "═══════════════════════════════════════════════════" + warn " This will delete:" + warn " - CloudFormation stack: $stack_name" + warn " - Lambda function and API Gateway" + warn " - S3 bucket: $s3_bucket" + warn "" + warn " DynamoDB tables have DeletionPolicy=Retain and" + warn " will NOT be auto-deleted. Delete manually if needed:" + warn " aws dynamodb delete-table --table-name mdf-submissions-$env" + warn " aws dynamodb delete-table --table-name mdf-streams-$env" + warn "═══════════════════════════════════════════════════" + echo "" + + read -p "Type '$env' to confirm teardown: " confirm + [[ "$confirm" != "$env" ]] && error "Aborted." + + log "Deleting CloudFormation stack $stack_name..." + sam delete \ + --stack-name "$stack_name" \ + --region "$REGION" \ + --no-prompts \ + 2>/dev/null || warn "Stack deletion may have partial failures (retained resources)" + + log "Emptying S3 bucket $s3_bucket..." + aws s3 rm "s3://$s3_bucket" --recursive 2>/dev/null || true + log "Deleting S3 bucket $s3_bucket..." + aws s3api delete-bucket --bucket "$s3_bucket" --region "$REGION" 2>/dev/null || true + + echo "" + log "Teardown complete." + info "Retained DynamoDB tables (delete manually if desired):" + info " aws dynamodb delete-table --table-name mdf-submissions-$env --region $REGION" + info " aws dynamodb delete-table --table-name mdf-streams-$env --region $REGION" +} + +# --------------------------------------------------------------------------- +# Quick deploy (code-only, no infrastructure changes) +# --------------------------------------------------------------------------- +quick_deploy() { + local env=$1 + [[ -z "$env" ]] && error "Environment required: ./deploy.sh quick dev" + + local stack_name="mdf-connect-v2-$env" + + log "Quick deploying to $env (Lambda code only)..." + + # Package the code + cd "$SCRIPT_DIR" + zip -r /tmp/mdf-lambda.zip v2/ requirements.txt \ + -x "**/__pycache__/*" "**/*.pyc" "**/.pytest_cache/*" + + # Get Lambda function name from stack + local func + func=$(aws cloudformation describe-stack-resources \ + --stack-name "$stack_name" \ + --region "$REGION" \ + --query 'StackResources[?ResourceType==`AWS::Lambda::Function`].PhysicalResourceId' \ + --output text 2>/dev/null) \ + || error "Stack $stack_name not found. Deploy first with: ./deploy.sh $env" + + for f in $func; do + log "Updating Lambda $f..." + aws lambda update-function-code \ + --function-name "$f" \ + --region "$REGION" \ + --zip-file fileb:///tmp/mdf-lambda.zip \ + --no-cli-pager > /dev/null + done + + rm /tmp/mdf-lambda.zip + log "Quick deploy complete!" +} + +# --------------------------------------------------------------------------- +# Local development server +# --------------------------------------------------------------------------- +local_server() { + log "Starting local development server on http://127.0.0.1:8080" + + export STORE_BACKEND=sqlite + export SQLITE_PATH=/tmp/mdf_connect_v2.db + export STORAGE_BACKEND=local + export USE_MOCK_DATACITE=true + export AUTH_MODE=dev + export ALLOW_ALL_CURATORS=true + export CURATOR_GROUP_IDS= + export REQUIRED_GROUP_MEMBERSHIP= + + python3 -m v2.app.main +} + +# --------------------------------------------------------------------------- +# Status +# --------------------------------------------------------------------------- +status() { + local env=${1:-dev} + local stack_name="mdf-connect-v2-$env" + + aws cloudformation describe-stacks \ + --stack-name "$stack_name" \ + --region "$REGION" \ + --query 'Stacks[0].{Status:StackStatus,Created:CreationTime,Updated:LastUpdatedTime,Outputs:Outputs[*].{Key:OutputKey,Value:OutputValue}}' \ + --output table 2>/dev/null \ + || warn "Stack $stack_name not found" +} + +# --------------------------------------------------------------------------- +# Logs +# --------------------------------------------------------------------------- +logs() { + local env=${1:-dev} + log "Tailing logs for mdf-connect-v2-$env..." + sam logs --stack-name "mdf-connect-v2-$env" --region "$REGION" --tail +} + +# --------------------------------------------------------------------------- +# Help +# --------------------------------------------------------------------------- +help() { + echo "MDF Connect v2 — Deployment" + echo "" + echo "Usage: ./deploy.sh [environment]" + echo "" + echo "Commands:" + echo " dev Deploy to dev (self-contained, no Globus needed)" + echo " staging Deploy to staging (requires Globus SSM params)" + echo " prod Deploy to production (requires Globus SSM params)" + echo " quick Quick deploy (code only, skips CloudFormation)" + echo " local Run local development server" + echo " status [env] Show stack status (default: dev)" + echo " logs [env] Tail Lambda logs (default: dev)" + echo " teardown Completely remove a deployment" + echo " build Build SAM application" + echo " help Show this help" + echo "" + echo "Dev deployment creates these AWS resources:" + echo " Stack: mdf-connect-v2-dev" + echo " Tables: mdf-submissions-dev, mdf-streams-dev" + echo " Bucket: mdf-sam-deployments-dev" + echo "" + echo "Examples:" + echo " ./deploy.sh dev # First deploy to AWS" + echo " ./deploy.sh quick dev # Push code changes only" + echo " ./deploy.sh status dev # Check deployment status" + echo " ./deploy.sh logs dev # Watch Lambda logs" + echo " ./deploy.sh teardown dev # Remove everything" + echo "" +} + +# --------------------------------------------------------------------------- +# Main +# --------------------------------------------------------------------------- +case "${1:-help}" in + local) + local_server + ;; + dev) + check_deps + deploy_dev + ;; + staging) + check_deps + deploy_prod staging + ;; + prod) + check_deps + deploy_prod prod + ;; + quick) + check_deps + quick_deploy "$2" + ;; + build) + check_deps + build + ;; + status) + status "$2" + ;; + logs) + logs "$2" + ;; + teardown) + check_deps + teardown "$2" + ;; + help|--help|-h) + help + ;; + *) + error "Unknown command: $1. Run './deploy.sh help' for usage." + ;; +esac diff --git a/aws/requirements.txt b/aws/requirements.txt index 0192ca5..b10f796 100644 --- a/aws/requirements.txt +++ b/aws/requirements.txt @@ -1,7 +1,9 @@ -jsonschema<4.0.0,>=3.2.0 -mdf-toolbox==0.6.0 -globus_automate_client==0.17.1 +fastapi>=0.100.0 +mangum>=0.17.0 +uvicorn[standard]>=0.23.0 +pydantic>=2.0 +httpx +globus-sdk>=3.0 click requests -urllib3<2 - +mdf-toolbox==0.6.0 diff --git a/aws/samconfig.toml b/aws/samconfig.toml new file mode 100644 index 0000000..53095c4 --- /dev/null +++ b/aws/samconfig.toml @@ -0,0 +1,65 @@ +# AWS SAM Configuration +# +# Usage: +# sam build && sam deploy --config-env dev # Deploy to dev +# sam build && sam deploy --config-env prod # Deploy to production +# +# First-time setup: +# sam deploy --guided # Interactive setup + +version = 0.1 + +[default.build.parameters] +use_container = false +parallel = true + +[default.validate.parameters] +lint = true + +[default.deploy.parameters] +confirm_changeset = true +capabilities = "CAPABILITY_IAM CAPABILITY_AUTO_EXPAND" +disable_rollback = false + +# =========================================================================== +# Development Environment +# =========================================================================== + +[dev] +[dev.deploy.parameters] +stack_name = "mdf-connect-v2-dev" +s3_bucket = "mdf-sam-deployments-dev" +s3_prefix = "mdf-connect-v2" +region = "us-east-1" +capabilities = "CAPABILITY_IAM CAPABILITY_AUTO_EXPAND" +parameter_overrides = "Environment=dev AuthMode=dev" +confirm_changeset = false +tags = "Project=mdf-connect Environment=dev Team=MDF" + +# =========================================================================== +# Staging Environment +# =========================================================================== + +[staging] +[staging.deploy.parameters] +stack_name = "mdf-connect-v2-staging" +s3_bucket = "mdf-sam-deployments-staging" +s3_prefix = "mdf-connect-v2" +region = "us-east-1" +capabilities = "CAPABILITY_IAM CAPABILITY_AUTO_EXPAND" +parameter_overrides = "Environment=staging AuthMode=production AllowAllCurators=true DataCiteUsername=Globus.TEST DataCitePassword=NTroFAzElE DataCiteApiUrl=https://api.test.datacite.org DataCitePrefix=10.23677 UseMockDatacite=false SearchIndexUUID=ab19b80b-0887-4337-b9f8-b8cc7feb1fdc TestSearchIndexUUID=ab19b80b-0887-4337-b9f8-b8cc7feb1fdc" +tags = "Project=mdf-connect Environment=staging Team=MDF" + +# =========================================================================== +# Production Environment +# =========================================================================== + +[prod] +[prod.deploy.parameters] +stack_name = "mdf-connect-v2-prod" +s3_bucket = "mdf-sam-deployments-prod" +s3_prefix = "mdf-connect-v2" +region = "us-east-1" +capabilities = "CAPABILITY_IAM CAPABILITY_AUTO_EXPAND" +parameter_overrides = "Environment=prod AuthMode=production AllowAllCurators=true DataCiteUsername=Globus.TEST DataCitePassword=NTroFAzElE DataCiteApiUrl=https://api.test.datacite.org DataCitePrefix=10.23677 UseMockDatacite=false SearchIndexUUID=ab19b80b-0887-4337-b9f8-b8cc7feb1fdc TestSearchIndexUUID=ab19b80b-0887-4337-b9f8-b8cc7feb1fdc" +tags = "Project=mdf-connect Environment=prod Team=MDF" diff --git a/aws/template.yaml b/aws/template.yaml new file mode 100644 index 0000000..6256cf7 --- /dev/null +++ b/aws/template.yaml @@ -0,0 +1,522 @@ +AWSTemplateFormatVersion: '2010-09-09' +Transform: AWS::Serverless-2016-10-31 +Description: MDF Connect v2 Backend - Single FastAPI Lambda via Mangum + +Parameters: + Environment: + Type: String + Default: dev + AllowedValues: [dev, staging, prod] + Description: Deployment environment + AuthMode: + Type: String + Default: dev + AllowedValues: [dev, production] + Description: "dev = X-User-Id headers (no Globus). production = Globus token introspection." + GlobusClientId: + Type: String + Default: "not-configured" + Description: "Globus OAuth client ID. Only needed when AuthMode=production." + GlobusClientSecret: + Type: String + Default: "not-configured" + NoEcho: true + Description: "Globus OAuth client secret. Only needed when AuthMode=production." + DataCiteUsername: + Type: String + Default: "not-configured" + Description: "DataCite repository ID for DOI minting." + DataCitePassword: + Type: String + Default: "not-configured" + NoEcho: true + Description: "DataCite repository password." + DataCiteApiUrl: + Type: String + Default: "https://api.test.datacite.org" + Description: "DataCite API endpoint." + DataCitePrefix: + Type: String + Default: "10.23677" + Description: "DOI prefix for DataCite minting." + SearchIndexUUID: + Type: String + Default: "not-configured" + Description: "Globus Search index UUID for v2 production." + TestSearchIndexUUID: + Type: String + Default: "not-configured" + Description: "Globus Search index UUID for v2 test/staging." + HttpApiThrottleRate: + Type: Number + Default: 100 + Description: "API Gateway steady-state request rate per second." + HttpApiThrottleBurst: + Type: Number + Default: 200 + Description: "API Gateway burst request limit." + ApiTimeoutSeconds: + Type: Number + Default: 30 + MinValue: 3 + MaxValue: 30 + Description: "API Lambda timeout in seconds (bounded to API Gateway integration timeout)." + ApiMemorySizeMb: + Type: Number + Default: 256 + MinValue: 128 + MaxValue: 10240 + Description: "API Lambda memory size in MB." + AsyncWorkerTimeoutSeconds: + Type: Number + Default: 120 + MinValue: 3 + MaxValue: 900 + Description: "Async worker Lambda timeout in seconds." + AsyncWorkerMemorySizeMb: + Type: Number + Default: 256 + MinValue: 128 + MaxValue: 10240 + Description: "Async worker Lambda memory size in MB." + ApiReservedConcurrency: + Type: Number + Default: 10 + MinValue: 1 + Description: "Reserved concurrency cap for API Lambda to control runaway spend." + AsyncWorkerReservedConcurrency: + Type: Number + Default: 3 + MinValue: 1 + Description: "Reserved concurrency cap for async worker Lambda." + AsyncWorkerBatchSize: + Type: Number + Default: 10 + MinValue: 1 + MaxValue: 10 + Description: "SQS batch size for async worker event source." + AsyncWorkerBatchWindowSeconds: + Type: Number + Default: 5 + MinValue: 0 + MaxValue: 300 + Description: "SQS maximum batching window to reduce Lambda invoke count." + AsyncQueueMessageRetentionSeconds: + Type: Number + Default: 345600 + MinValue: 60 + MaxValue: 1209600 + Description: "SQS async queue message retention in seconds." + LogRetentionDays: + Type: Number + Default: 14 + AllowedValues: [1, 3, 5, 7, 14, 30, 60, 90, 120, 180, 365, 400, 545, 731, 1096, 1827, 2192, 2557, 2922, 3288, 3653] + Description: "CloudWatch Logs retention period for Lambda log groups." + StorageBackend: + Type: String + Default: "auto" + AllowedValues: ["auto", "local", "s3", "globus"] + Description: "File storage backend. 'auto' selects local for dev, globus for prod." + UseMockDatacite: + Type: String + Default: "auto" + AllowedValues: ["auto", "true", "false"] + Description: "Mock DataCite for DOI minting. 'auto' = true for dev, false for prod." + AllowAllCurators: + Type: String + Default: "auto" + AllowedValues: ["auto", "true", "false"] + Description: "Allow all users to curate. 'auto' = true for dev, false for prod." + CuratorGroupIds: + Type: String + Default: "3ce2c53e-3752-11e8-891c-0e00fd09bf20" + Description: "Comma-separated Globus group UUIDs whose members may curate." + RequiredGroupMembership: + Type: String + Default: "cc192dca-3751-11e8-90c1-0a7c735d220a" + Description: "Globus group UUID required to submit datasets." + SesFromEmail: + Type: String + Default: "" + Description: "SES-verified sender address for email notifications. Leave blank to disable emails." + CuratorEmails: + Type: String + Default: "" + Description: "Comma-separated curator email addresses for new-submission alerts." + PortalUrl: + Type: String + Default: "https://www.materialsdatafacility.org" + Description: "Public dataset portal base URL." + CurationPortalUrl: + Type: String + Default: "https://www.materialsdatafacility.org/curation" + Description: "Curation review page URL (curators are redirected here)." + EnableEmails: + Type: String + Default: "false" + AllowedValues: ["true", "false"] + Description: "Set to 'true' to enable SES email notifications. Keep 'false' on staging until SES is configured in the production account." + +Conditions: + IsProd: !Equals [!Ref Environment, prod] + IsDevAuth: !Equals [!Ref AuthMode, dev] + UseAutoStorage: !Equals [!Ref StorageBackend, "auto"] + UseAutoMockDatacite: !Equals [!Ref UseMockDatacite, "auto"] + UseAutoAllowCurators: !Equals [!Ref AllowAllCurators, "auto"] + CreateStreamFilesBucket: !Not [!Equals [!Ref AuthMode, dev]] + EmailsEnabled: !Equals [!Ref EnableEmails, "true"] + +Globals: + Function: + Runtime: python3.12 + Timeout: 30 + MemorySize: 256 + +Resources: + # =========================================================================== + # Single Lambda Function (FastAPI + Mangum) + # =========================================================================== + + ApiFunction: + Type: AWS::Serverless::Function + Properties: + Handler: v2.app.main.handler + CodeUri: . + Description: MDF Connect v2 API (FastAPI) + Timeout: !Ref ApiTimeoutSeconds + MemorySize: !Ref ApiMemorySizeMb + ReservedConcurrentExecutions: !Ref ApiReservedConcurrency + Environment: + Variables: + ENVIRONMENT: !Ref Environment + STORE_BACKEND: dynamo + ASYNC_DISPATCH_MODE: !If [IsDevAuth, "inline", "sqs"] + ASYNC_QUEUE_URL: !Ref AsyncJobsQueue + AUTH_MODE: !Ref AuthMode + DYNAMO_SUBMISSIONS_TABLE: !Ref SubmissionsTable + DYNAMO_STREAMS_TABLE: !Ref StreamsTable + STORAGE_BACKEND: !If + - UseAutoStorage + - !If [IsDevAuth, "local", "globus"] + - !Ref StorageBackend + GLOBUS_BASE_PATH: !If [IsDevAuth, "/tmp/testing", !Sub "/tmp/${Environment}"] + S3_BUCKET: !If [IsDevAuth, "", !Sub "mdf-stream-files-${Environment}"] + S3_PREFIX: "streams/" + GLOBUS_CLIENT_ID: !Ref GlobusClientId + GLOBUS_CLIENT_SECRET: !Ref GlobusClientSecret + USE_MOCK_DATACITE: !If + - UseAutoMockDatacite + - !If [IsDevAuth, "true", "false"] + - !Ref UseMockDatacite + DATACITE_USERNAME: !Ref DataCiteUsername + DATACITE_PASSWORD: !Ref DataCitePassword + DATACITE_API_URL: !Ref DataCiteApiUrl + DATACITE_PREFIX: !Ref DataCitePrefix + SEARCH_INDEX_UUID: !Ref SearchIndexUUID + TEST_SEARCH_INDEX_UUID: !Ref TestSearchIndexUUID + USE_MOCK_SEARCH: !If [IsDevAuth, "true", "false"] + ALLOW_ALL_CURATORS: !If + - UseAutoAllowCurators + - !If [IsDevAuth, "true", "false"] + - !Ref AllowAllCurators + CURATOR_GROUP_IDS: !If [IsDevAuth, "", !Ref CuratorGroupIds] + REQUIRED_GROUP_MEMBERSHIP: !If [IsDevAuth, "", !Ref RequiredGroupMembership] + CORS_ALLOWED_ORIGINS: !If + - IsProd + - "https://materialsdatafacility.org,https://app.materialsdatafacility.org" + - "*" + LOG_LEVEL: !If [IsProd, INFO, DEBUG] + SEARCH_MAX_RESULTS: "50" + SEARCH_MAX_DATASET_SCAN: "1000" + SEARCH_MAX_STREAM_SCAN: "2000" + SES_FROM_EMAIL: !If [EmailsEnabled, !Ref SesFromEmail, ""] + CURATOR_EMAILS: !If [EmailsEnabled, !Ref CuratorEmails, ""] + PORTAL_URL: !Ref PortalUrl + CURATION_PORTAL_URL: !Ref CurationPortalUrl + SES_REGION: !Ref AWS::Region + Events: + CatchAll: + Type: HttpApi + Properties: + ApiId: !Ref HttpApi + Path: /{proxy+} + Method: ANY + Root: + Type: HttpApi + Properties: + ApiId: !Ref HttpApi + Path: / + Method: ANY + Policies: + - DynamoDBCrudPolicy: + TableName: !Ref SubmissionsTable + - DynamoDBCrudPolicy: + TableName: !Ref StreamsTable + - S3CrudPolicy: + BucketName: !Sub "mdf-stream-files-${Environment}" + - Statement: + Effect: Allow + Action: + - sqs:SendMessage + Resource: !GetAtt AsyncJobsQueue.Arn + - Statement: + Effect: Allow + Action: + - ses:SendEmail + Resource: "*" + + AsyncWorkerFunction: + Type: AWS::Serverless::Function + Properties: + Handler: v2.async_worker.lambda_handler + CodeUri: . + Description: MDF Connect v2 async job worker + Timeout: !Ref AsyncWorkerTimeoutSeconds + MemorySize: !Ref AsyncWorkerMemorySizeMb + ReservedConcurrentExecutions: !Ref AsyncWorkerReservedConcurrency + Environment: + Variables: + ENVIRONMENT: !Ref Environment + STORE_BACKEND: dynamo + ASYNC_DISPATCH_MODE: inline + AUTH_MODE: !Ref AuthMode + DYNAMO_SUBMISSIONS_TABLE: !Ref SubmissionsTable + DYNAMO_STREAMS_TABLE: !Ref StreamsTable + STORAGE_BACKEND: !If + - UseAutoStorage + - !If [IsDevAuth, "local", "globus"] + - !Ref StorageBackend + GLOBUS_BASE_PATH: !If [IsDevAuth, "/tmp/testing", !Sub "/tmp/${Environment}"] + S3_BUCKET: !If [IsDevAuth, "", !Sub "mdf-stream-files-${Environment}"] + S3_PREFIX: "streams/" + GLOBUS_CLIENT_ID: !Ref GlobusClientId + GLOBUS_CLIENT_SECRET: !Ref GlobusClientSecret + USE_MOCK_DATACITE: !If + - UseAutoMockDatacite + - !If [IsDevAuth, "true", "false"] + - !Ref UseMockDatacite + DATACITE_USERNAME: !Ref DataCiteUsername + DATACITE_PASSWORD: !Ref DataCitePassword + DATACITE_API_URL: !Ref DataCiteApiUrl + DATACITE_PREFIX: !Ref DataCitePrefix + SEARCH_INDEX_UUID: !Ref SearchIndexUUID + TEST_SEARCH_INDEX_UUID: !Ref TestSearchIndexUUID + USE_MOCK_SEARCH: !If [IsDevAuth, "true", "false"] + ALLOW_ALL_CURATORS: !If + - UseAutoAllowCurators + - !If [IsDevAuth, "true", "false"] + - !Ref AllowAllCurators + CURATOR_GROUP_IDS: !If [IsDevAuth, "", !Ref CuratorGroupIds] + REQUIRED_GROUP_MEMBERSHIP: !If [IsDevAuth, "", !Ref RequiredGroupMembership] + LOG_LEVEL: !If [IsProd, INFO, DEBUG] + SES_FROM_EMAIL: !If [EmailsEnabled, !Ref SesFromEmail, ""] + CURATOR_EMAILS: !If [EmailsEnabled, !Ref CuratorEmails, ""] + PORTAL_URL: !Ref PortalUrl + CURATION_PORTAL_URL: !Ref CurationPortalUrl + SES_REGION: !Ref AWS::Region + Events: + AsyncQueueEvent: + Type: SQS + Properties: + Queue: !GetAtt AsyncJobsQueue.Arn + BatchSize: !Ref AsyncWorkerBatchSize + MaximumBatchingWindowInSeconds: !Ref AsyncWorkerBatchWindowSeconds + TransferCleanupSchedule: + Type: Schedule + Properties: + Schedule: rate(6 hours) + Description: "Periodic cleanup of stale Globus transfers and ACL rules" + Enabled: true + Policies: + - DynamoDBCrudPolicy: + TableName: !Ref SubmissionsTable + - DynamoDBCrudPolicy: + TableName: !Ref StreamsTable + - S3CrudPolicy: + BucketName: !Sub "mdf-stream-files-${Environment}" + - SQSPollerPolicy: + QueueName: !GetAtt AsyncJobsQueue.QueueName + - Statement: + Effect: Allow + Action: + - ses:SendEmail + Resource: "*" + + # =========================================================================== + # HTTP API (v2 - cheaper than REST API v1) + # =========================================================================== + + HttpApi: + Type: AWS::Serverless::HttpApi + Properties: + StageName: !Ref Environment + Description: MDF Connect v2 API + DefaultRouteSettings: + ThrottlingRateLimit: !Ref HttpApiThrottleRate + ThrottlingBurstLimit: !Ref HttpApiThrottleBurst + CorsConfiguration: + AllowOrigins: !If + - IsProd + - - "https://materialsdatafacility.org" + - "https://app.materialsdatafacility.org" + - - "*" + AllowMethods: + - GET + - POST + - OPTIONS + AllowHeaders: + - Content-Type + - Authorization + - X-User-Id + - X-User-Email + - X-User-Name + - X-Globus-Token + AllowCredentials: !If [IsProd, true, false] + + # =========================================================================== + # S3 Bucket for Stream File Storage + # =========================================================================== + + StreamFilesBucket: + Type: AWS::S3::Bucket + Condition: CreateStreamFilesBucket + Properties: + BucketName: !Sub "mdf-stream-files-${Environment}" + PublicAccessBlockConfiguration: + BlockPublicAcls: true + IgnorePublicAcls: true + BlockPublicPolicy: true + RestrictPublicBuckets: true + + # =========================================================================== + # DynamoDB Tables + # =========================================================================== + + AsyncJobsDLQ: + Type: AWS::SQS::Queue + Properties: + QueueName: !Sub "mdf-async-jobs-dlq-${Environment}" + MessageRetentionPeriod: 1209600 + + AsyncJobsQueue: + Type: AWS::SQS::Queue + Properties: + QueueName: !Sub "mdf-async-jobs-${Environment}" + VisibilityTimeout: 180 + MessageRetentionPeriod: !Ref AsyncQueueMessageRetentionSeconds + RedrivePolicy: + deadLetterTargetArn: !GetAtt AsyncJobsDLQ.Arn + maxReceiveCount: 3 + + ApiFunctionLogGroup: + Type: AWS::Logs::LogGroup + Properties: + LogGroupName: !Sub "/aws/lambda/${ApiFunction}" + RetentionInDays: !Ref LogRetentionDays + + AsyncWorkerFunctionLogGroup: + Type: AWS::Logs::LogGroup + Properties: + LogGroupName: !Sub "/aws/lambda/${AsyncWorkerFunction}" + RetentionInDays: !Ref LogRetentionDays + + SubmissionsTable: + Type: AWS::DynamoDB::Table + DeletionPolicy: Retain + UpdateReplacePolicy: Retain + Properties: + TableName: !Sub "mdf-submissions-${Environment}" + BillingMode: PAY_PER_REQUEST + AttributeDefinitions: + - AttributeName: source_id + AttributeType: S + - AttributeName: version + AttributeType: S + - AttributeName: user_id + AttributeType: S + - AttributeName: organization + AttributeType: S + - AttributeName: updated_at + AttributeType: S + - AttributeName: status + AttributeType: S + KeySchema: + - AttributeName: source_id + KeyType: HASH + - AttributeName: version + KeyType: RANGE + GlobalSecondaryIndexes: + - IndexName: user-submissions + KeySchema: + - AttributeName: user_id + KeyType: HASH + - AttributeName: updated_at + KeyType: RANGE + Projection: + ProjectionType: ALL + - IndexName: org-submissions + KeySchema: + - AttributeName: organization + KeyType: HASH + - AttributeName: source_id + KeyType: RANGE + Projection: + ProjectionType: ALL + - IndexName: status-submissions + KeySchema: + - AttributeName: status + KeyType: HASH + - AttributeName: updated_at + KeyType: RANGE + Projection: + ProjectionType: ALL + + StreamsTable: + Type: AWS::DynamoDB::Table + DeletionPolicy: Retain + UpdateReplacePolicy: Retain + Properties: + TableName: !Sub "mdf-streams-${Environment}" + BillingMode: PAY_PER_REQUEST + AttributeDefinitions: + - AttributeName: stream_id + AttributeType: S + - AttributeName: user_id + AttributeType: S + - AttributeName: updated_at + AttributeType: S + KeySchema: + - AttributeName: stream_id + KeyType: HASH + GlobalSecondaryIndexes: + - IndexName: user-streams + KeySchema: + - AttributeName: user_id + KeyType: HASH + - AttributeName: updated_at + KeyType: RANGE + Projection: + ProjectionType: ALL + +# =========================================================================== +# Outputs +# =========================================================================== + +Outputs: + ApiUrl: + Description: HTTP API URL + Value: !Sub "https://${HttpApi}.execute-api.${AWS::Region}.amazonaws.com/${Environment}" + Export: + Name: !Sub "${AWS::StackName}-ApiUrl" + + SubmissionsTableName: + Description: DynamoDB table for submissions + Value: !Ref SubmissionsTable + + StreamsTableName: + Description: DynamoDB table for streams + Value: !Ref StreamsTable + + AsyncJobsQueueUrl: + Description: SQS queue URL for async jobs + Value: !Ref AsyncJobsQueue diff --git a/aws/test_curation.py b/aws/test_curation.py new file mode 100644 index 0000000..d64ab89 --- /dev/null +++ b/aws/test_curation.py @@ -0,0 +1,211 @@ +#!/usr/bin/env python3 +"""Test the curation workflow for MDF v2. + +This script tests: +1. Creating a submission pending curation +2. Listing pending submissions +3. Viewing submission details +4. Approving with DOI minting +5. Rejecting with reason +""" + +import json +import os +import sys +from datetime import datetime, timezone + +sys.path.insert(0, os.path.dirname(os.path.abspath(__file__))) + +import requests +from rich.console import Console +from rich.panel import Panel +from rich.table import Table + +console = Console() + +BASE_URL = "http://localhost:8080" + + +def create_test_submission(source_id: str, title: str, status: str = "pending_curation"): + """Create a test submission directly in the database.""" + from v2.store import get_store + + store = get_store() + now = datetime.now(timezone.utc).isoformat() + + record = { + "source_id": source_id, + "version": "1.0", + "versioned_source_id": f"{source_id}_v1.0", + "user_id": "test-user-123", + "user_email": "researcher@example.edu", + "organization": "Test University", + "status": status, + "dataset_mdata": json.dumps({ + "dc": { + "titles": [{"title": title}], + "creators": [ + {"name": "Jane Researcher", "affiliation": "Test University"}, + {"name": "John Scientist", "affiliation": "Research Lab"} + ], + "publisher": "Materials Data Facility", + "publicationYear": 2026, + "descriptions": [{"description": "A test dataset for curation workflow", "descriptionType": "Abstract"}], + "subjects": [{"subject": "materials science"}, {"subject": "DFT"}] + }, + "mdf": { + "source_id": source_id, + "versioned_source_id": f"{source_id}_v1.0", + "organization": "Test University" + } + }), + "test": 1, + "created_at": now, + "updated_at": now, + } + + store.put_submission(record) + return record + + +def test_curation_workflow(): + """Test the full curation workflow.""" + + console.print(Panel.fit("[bold cyan]MDF v2 Curation Workflow Test[/bold cyan]")) + + # 1. Create test submissions + console.print("\n[bold]1. Creating test submissions...[/bold]") + + # Clear any existing test data + os.environ.setdefault("STORE_BACKEND", "sqlite") + os.environ.setdefault("SQLITE_PATH", "/tmp/mdf_connect_v2.db") + + submissions = [ + create_test_submission("test_fe_al_dft", "Iron-Aluminum DFT Calculations"), + create_test_submission("test_perovskite_xrd", "Perovskite XRD Measurements"), + create_test_submission("test_polymer_md", "Polymer MD Simulations"), + ] + + for sub in submissions: + console.print(f" Created: [green]{sub['source_id']}[/green] - {sub['status']}") + + # 2. List pending submissions + console.print("\n[bold]2. Listing pending submissions...[/bold]") + response = requests.get(f"{BASE_URL}/curation/pending") + + if response.status_code != 200: + console.print(f"[red]Failed: {response.text}[/red]") + return + + data = response.json() + console.print(f" Found [cyan]{data.get('pending_count', 0)}[/cyan] pending submissions") + + if data.get("submissions"): + table = Table(title="Pending Curation") + table.add_column("Source ID", style="cyan") + table.add_column("Title") + table.add_column("Organization") + table.add_column("Submitted") + + for sub in data["submissions"]: + table.add_row( + sub.get("source_id", ""), + sub.get("title", "")[:40], + sub.get("organization", ""), + sub.get("submitted_at", "")[:19] + ) + console.print(table) + + # 3. Get details for one submission + console.print("\n[bold]3. Getting submission details...[/bold]") + response = requests.get(f"{BASE_URL}/curation/test_fe_al_dft") + + if response.status_code == 200: + data = response.json() + sub = data.get("submission", {}) + console.print(f" Source ID: [cyan]{sub.get('source_id')}[/cyan]") + console.print(f" Status: [yellow]{sub.get('status')}[/yellow]") + console.print(f" Can Approve: {data.get('can_approve')}") + console.print(f" Can Reject: {data.get('can_reject')}") + else: + console.print(f"[red]Failed: {response.text}[/red]") + + # 4. Approve one submission + console.print("\n[bold]4. Approving submission with DOI...[/bold]") + response = requests.post(f"{BASE_URL}/curation/test_fe_al_dft/approve", json={ + "notes": "Looks good! Metadata is complete.", + "mint_doi": True, + }) + + if response.status_code == 200: + data = response.json() + console.print(f" Status: [green]{data.get('status')}[/green]") + console.print(f" Approved by: {data.get('approved_by')}") + + if data.get("doi"): + doi_info = data["doi"] + console.print(f" DOI Success: {doi_info.get('success')}") + console.print(f" DOI: [cyan]{doi_info.get('doi')}[/cyan]") + else: + console.print(f"[red]Failed: {response.text}[/red]") + + # 5. Reject another submission + console.print("\n[bold]5. Rejecting submission...[/bold]") + response = requests.post(f"{BASE_URL}/curation/test_polymer_md/reject", json={ + "reason": "Missing required metadata: authors need ORCID identifiers", + "suggestions": "Please add ORCID IDs for all authors and resubmit", + }) + + if response.status_code == 200: + data = response.json() + console.print(f" Status: [red]{data.get('status')}[/red]") + console.print(f" Rejected by: {data.get('rejected_by')}") + console.print(f" Reason: {data.get('reason')}") + else: + console.print(f"[red]Failed: {response.text}[/red]") + + # 6. List pending again (should be fewer) + console.print("\n[bold]6. Checking remaining pending submissions...[/bold]") + response = requests.get(f"{BASE_URL}/curation/pending") + + if response.status_code == 200: + data = response.json() + console.print(f" Remaining pending: [cyan]{data.get('pending_count', 0)}[/cyan]") + else: + console.print(f"[red]Failed: {response.text}[/red]") + + # 7. Verify approved submission has DOI + console.print("\n[bold]7. Verifying approved submission...[/bold]") + response = requests.get(f"{BASE_URL}/curation/test_fe_al_dft") + + if response.status_code == 200: + data = response.json() + sub = data.get("submission", {}) + console.print(f" Status: [green]{sub.get('status')}[/green]") + console.print(f" DOI: [cyan]{sub.get('doi', 'N/A')}[/cyan]") + console.print(f" Published At: {sub.get('published_at', 'N/A')}") + + history = data.get("curation_history", []) + if history: + console.print(f" Curation History: {len(history)} action(s)") + for h in history: + console.print(f" - {h.get('action')} by {h.get('curator_id')} at {h.get('timestamp', '')[:19]}") + else: + console.print(f"[red]Failed: {response.text}[/red]") + + console.print("\n[bold green]Curation workflow test complete![/bold green]") + + +if __name__ == "__main__": + # Check if server is running + try: + requests.get(f"{BASE_URL}/health", timeout=2) + except requests.exceptions.ConnectionError: + console.print("[yellow]Local server not running.[/yellow]") + console.print("Run: [cyan]./deploy.sh local[/cyan]") + sys.exit(1) + + # Allow all users to curate in local dev mode + os.environ["ALLOW_ALL_CURATORS"] = "true" + + test_curation_workflow() diff --git a/aws/test_doi_minting.py b/aws/test_doi_minting.py new file mode 100644 index 0000000..5029e0b --- /dev/null +++ b/aws/test_doi_minting.py @@ -0,0 +1,165 @@ +#!/usr/bin/env python3 +"""Test DOI minting with the MDF v2 backend. + +This script tests: +1. Creating a stream +2. Appending files +3. Closing with DOI minting +""" + +import json +import os +import sys +import time + +# Add project root to path +sys.path.insert(0, os.path.dirname(os.path.abspath(__file__))) + +import requests +from rich.console import Console +from rich.panel import Panel +from rich.table import Table + +console = Console() + +BASE_URL = "http://localhost:8080" + + +def test_doi_minting(): + """Test the full stream lifecycle with DOI minting.""" + + console.print(Panel.fit("[bold cyan]MDF v2 DOI Minting Test[/bold cyan]")) + + # 1. Create a stream + console.print("\n[bold]1. Creating stream...[/bold]") + response = requests.post(f"{BASE_URL}/stream/create", json={ + "title": "Test Dataset for DOI Minting", + "lab_id": "test-lab", + "metadata": { + "description": "A test dataset to verify DOI minting works correctly", + "authors": [ + {"given_name": "Jane", "family_name": "Doe", "affiliation": "Test University"}, + {"given_name": "John", "family_name": "Smith"} + ], + "keywords": ["test", "materials", "simulation"] + } + }) + + if response.status_code != 200: + console.print(f"[red]Failed to create stream: {response.text}[/red]") + return + + data = response.json() + stream_id = data["stream_id"] + console.print(f" Stream created: [green]{stream_id}[/green]") + + # 2. Upload some test files + console.print("\n[bold]2. Uploading test files...[/bold]") + + import base64 + + test_files = [ + ("data.csv", "element,energy,bandgap\nFe,100.5,2.1\nAl,50.2,1.8\nCu,75.3,0.0", "text/csv"), + ("parameters.json", json.dumps({"method": "DFT", "basis": "PBE", "cutoff": 500}), "application/json"), + ("notes.txt", "Experimental notes for the test dataset.\nAll calculations converged successfully.", "text/plain"), + ] + + for filename, content, content_type in test_files: + response = requests.post( + f"{BASE_URL}/stream/{stream_id}/upload", + json={ + "filename": filename, + "content_base64": base64.b64encode(content.encode()).decode(), + "content_type": content_type, + }, + ) + if response.status_code == 200: + console.print(f" Uploaded: [green]{filename}[/green]") + else: + console.print(f" [red]Failed to upload {filename}: {response.text}[/red]") + + # 3. Check stream status + console.print("\n[bold]3. Stream status before close...[/bold]") + response = requests.get(f"{BASE_URL}/stream/{stream_id}") + if response.status_code == 200: + data = response.json() + stream = data.get("stream", data) # Handle both wrapped and unwrapped responses + table = Table(show_header=False, box=None) + table.add_column("Field", style="cyan") + table.add_column("Value") + table.add_row("Status", stream.get("status", "unknown")) + table.add_row("File Count", str(stream.get("file_count", 0))) + table.add_row("Total Bytes", str(stream.get("total_bytes", 0))) + console.print(table) + + # 4. Close with DOI minting + console.print("\n[bold]4. Closing stream with DOI minting...[/bold]") + response = requests.post(f"{BASE_URL}/stream/{stream_id}/close", json={ + "mint_doi": True, + "title": "Test Materials Dataset v1.0", + "description": "DFT calculations for Fe, Al, and Cu with band gap analysis", + "authors": [ + {"given_name": "Jane", "family_name": "Doe", "affiliation": "Test University"}, + {"given_name": "John", "family_name": "Smith", "affiliation": "Research Lab"} + ], + "keywords": ["DFT", "band gap", "materials science"], + "license": "CC-BY-4.0" + }) + + if response.status_code != 200: + console.print(f"[red]Failed to close stream: {response.text}[/red]") + return + + result = response.json() + + console.print("\n[bold green]Stream closed successfully![/bold green]") + + # Display DOI result + if "doi" in result: + doi_info = result["doi"] + console.print(Panel.fit( + f"[bold]DOI Minting Result[/bold]\n\n" + f"Success: [green]{doi_info.get('success', False)}[/green]\n" + f"DOI: [cyan]{doi_info.get('doi', 'N/A')}[/cyan]\n" + f"URL: {doi_info.get('url', 'N/A')}\n" + f"State: {doi_info.get('state', 'N/A')}\n" + f"Mock: {doi_info.get('mock', False)}", + title="DOI Info" + )) + + # 5. Verify stream metadata was updated + console.print("\n[bold]5. Verifying stream metadata update...[/bold]") + response = requests.get(f"{BASE_URL}/stream/{stream_id}") + if response.status_code == 200: + data = response.json() + stream = data.get("stream", data) # Handle both wrapped and unwrapped responses + metadata = stream.get("metadata", {}) + if isinstance(metadata, str): + try: + metadata = json.loads(metadata) + except Exception: + metadata = {} + + table = Table(show_header=False, box=None) + table.add_column("Field", style="cyan") + table.add_column("Value") + table.add_row("Status", stream.get("status", "unknown")) + table.add_row("DOI", str(metadata.get("doi", "N/A"))) + table.add_row("Published At", str(metadata.get("published_at", "N/A"))) + console.print(table) + + console.print("\n[bold green]DOI minting test complete![/bold green]") + return result + + +if __name__ == "__main__": + # Check if server is running + try: + requests.get(f"{BASE_URL}/health", timeout=2) + except requests.exceptions.ConnectionError: + console.print("[yellow]Local server not running. Starting it...[/yellow]") + console.print("Run: [cyan]./deploy.sh local[/cyan]") + console.print("Then run this test again.") + sys.exit(1) + + test_doi_minting() diff --git a/aws/test_globus_upload.py b/aws/test_globus_upload.py new file mode 100644 index 0000000..b153927 --- /dev/null +++ b/aws/test_globus_upload.py @@ -0,0 +1,160 @@ +#!/usr/bin/env python3 +"""Test Globus HTTPS upload to MDF endpoint. + +Uses Globus native app auth to get a token for the NCSA HTTPS endpoint, +then uploads a test file. + +Run interactively: + python test_globus_upload.py +""" + +import json +import os +import sys +from datetime import datetime + +import httpx + +try: + from globus_sdk import NativeAppAuthClient +except ImportError: + print("Please install globus-sdk: pip install globus-sdk") + sys.exit(1) + + +# MDF's registered native app client ID +MDF_CLIENT_ID = "984464e2-90ab-433d-8145-ac0215d26c8e" + +# NCSA endpoint UUID (from Foundry code) +NCSA_ENDPOINT_UUID = "82f1b5c6-6e9b-11e5-ba47-22000b92c6ec" + +# Scope for HTTPS access to NCSA endpoint +NCSA_HTTPS_SCOPE = f"https://auth.globus.org/scopes/{NCSA_ENDPOINT_UUID}/https" + +# Token storage location +TOKEN_FILE = os.path.expanduser("~/.mdf/v2_https_tokens.json") + + +def load_tokens(): + """Load cached tokens if they exist.""" + if os.path.exists(TOKEN_FILE): + with open(TOKEN_FILE) as f: + return json.load(f) + return None + + +def save_tokens(tokens): + """Save tokens to cache file.""" + os.makedirs(os.path.dirname(TOKEN_FILE), exist_ok=True) + with open(TOKEN_FILE, "w") as f: + json.dump(tokens, f, indent=2) + print(f"Tokens saved to {TOKEN_FILE}") + + +def get_tokens(): + """Get or refresh Globus tokens for NCSA HTTPS endpoint.""" + + # Try cached tokens first + cached = load_tokens() + if cached and cached.get("access_token"): + print(f"Using cached tokens from {TOKEN_FILE}") + return cached["access_token"] + + # Need to authenticate + print("\nStarting Globus authentication...") + print(f"Scope: {NCSA_HTTPS_SCOPE}\n") + + auth_client = NativeAppAuthClient(MDF_CLIENT_ID) + auth_client.oauth2_start_flow( + requested_scopes=[NCSA_HTTPS_SCOPE], + refresh_tokens=True, + ) + + authorize_url = auth_client.oauth2_get_authorize_url() + print(f"Please visit this URL:\n{authorize_url}\n") + + auth_code = input("Enter the authorization code: ").strip() + + # Exchange code for tokens + token_response = auth_client.oauth2_exchange_code_for_tokens(auth_code) + + # Get the HTTPS token for our endpoint + https_tokens = token_response.by_resource_server.get(NCSA_ENDPOINT_UUID) + if not https_tokens: + print(f"Available resource servers: {list(token_response.by_resource_server.keys())}") + raise RuntimeError(f"No token received for {NCSA_ENDPOINT_UUID}") + + # Cache tokens + save_tokens(dict(https_tokens)) + + return https_tokens["access_token"] + + +def upload_test_file(access_token: str): + """Upload a test file to MDF endpoint via HTTPS.""" + + # Test file content + timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S") + content = f"Hello from MDF v2 backend test - {timestamp}\n" + + # Upload URL - note: endpoint uses the short hostname format + base_url = "https://data.materialsdatafacility.org" + upload_path = "/tmp/testing/mdf_v2_test.txt" + url = f"{base_url}{upload_path}" + + print(f"\nUploading to: {url}") + print(f"Content: {content.strip()}") + + headers = { + "Authorization": f"Bearer {access_token}", + "Content-Type": "text/plain", + } + + # Use httpx with redirect following + with httpx.Client(follow_redirects=True, timeout=30.0) as client: + response = client.put(url, content=content.encode(), headers=headers) + + print(f"\nResponse status: {response.status_code}") + print(f"Response headers: {dict(response.headers)}") + + if response.status_code in (200, 201, 204): + print("\n✓ Upload successful!") + + # Try to read it back + print("\nReading file back...") + get_response = client.get(url, headers=headers) + print(f"GET status: {get_response.status_code}") + if get_response.status_code == 200: + print(f"Content: {get_response.text}") + else: + print(f"\n✗ Upload failed: {response.text}") + return False + + return True + + +def main(): + print("=" * 60) + print("MDF v2 Globus HTTPS Upload Test") + print("=" * 60) + + # Get authentication token + access_token = get_tokens() + print(f"\nGot access token: {access_token[:20]}...") + + # Upload test file + success = upload_test_file(access_token) + + if success: + print("\n" + "=" * 60) + print("Test completed successfully!") + print("=" * 60) + else: + print("\n" + "=" * 60) + print("Test failed - see errors above") + print("=" * 60) + sys.exit(1) + + +if __name__ == "__main__": + main() diff --git a/aws/test_preview_clone.py b/aws/test_preview_clone.py new file mode 100644 index 0000000..ae240f1 --- /dev/null +++ b/aws/test_preview_clone.py @@ -0,0 +1,271 @@ +#!/usr/bin/env python3 +""" +Test dataset preview and cloning from Globus. + +This script: +1. Creates a stream with test files on Globus +2. Tests preview functionality +3. Clones the files back from Globus to local + +Run with: python test_preview_clone.py +""" + +import json +import os +import sys +import tempfile +import uuid +from datetime import datetime, timezone +from pathlib import Path + +sys.path.insert(0, str(Path(__file__).parent)) + +os.environ.setdefault("STORE_BACKEND", "sqlite") +os.environ.setdefault("SQLITE_PATH", "/tmp/mdf_preview_test.db") +os.environ.setdefault("STORAGE_BACKEND", "globus") + +from rich.console import Console +from rich.panel import Panel +from rich.table import Table +from rich.syntax import Syntax + +console = Console() + + +def main(): + console.print(Panel.fit( + "[bold blue]MDF v2 Preview & Clone Test[/bold blue]", + border_style="blue", + )) + console.print() + + # Check for Globus token + from v2.storage.globus_https import load_cached_token + token = load_cached_token() + if not token: + console.print("[red]No Globus token found. Run test_globus_upload.py first.[/red]") + return + + # ========================================================================= + # Step 1: Create a stream and upload test files + # ========================================================================= + console.print("[bold cyan]1. Creating stream with test files on Globus...[/bold cyan]") + + from v2.stream_store import get_stream_store + from v2.storage import get_storage_backend, reset_storage_backend + + # Force Globus backend + reset_storage_backend() + storage = get_storage_backend() + + store = get_stream_store() + now = datetime.now(timezone.utc).isoformat() + stream_id = f"preview-test-{uuid.uuid4().hex[:8]}" + + store.create_stream({ + "stream_id": stream_id, + "title": "Preview & Clone Test Stream", + "status": "open", + "file_count": 0, + "total_bytes": 0, + "created_at": now, + "updated_at": now, + "user_id": "test-user", + }) + + # Create test files with various types + test_files = [ + { + "filename": "experiment_data.csv", + "content": b"""sample_id,temperature_k,pressure_mpa,yield_percent,notes +1,300,0.1,85.2,baseline +2,350,0.5,91.7,optimal +3,400,1.0,78.3,degradation observed +4,450,2.0,95.1,high pressure test +5,500,2.5,62.4,thermal decomposition +""", + "content_type": "text/csv", + }, + { + "filename": "synthesis_params.json", + "content": json.dumps({ + "experiment_id": "synth-2026-001", + "catalyst": {"type": "Pt/Al2O3", "loading_wt_percent": 5.0}, + "conditions": { + "temperature_range": [300, 500], + "pressure_range": [0.1, 2.5], + "duration_hours": 4 + }, + "results": [ + {"sample": 1, "phase": "cubic", "crystallinity": 0.92}, + {"sample": 2, "phase": "tetragonal", "crystallinity": 0.88}, + ] + }, indent=2).encode(), + "content_type": "application/json", + }, + { + "filename": "notes.txt", + "content": b"""Experiment Log - 2026-01-31 + +Sample preparation began at 09:00. +Reactor stabilized by 09:30. + +Key observations: +- Sample 2 showed unexpected color change at 350K +- Pressure fluctuation detected at 14:22 +- All samples collected successfully + +Next steps: +- XRD analysis pending +- Send samples for TEM imaging +""", + "content_type": "text/plain", + }, + ] + + uploaded_files = [] + for f in test_files: + meta = storage.store_file( + stream_id=stream_id, + filename=f["filename"], + content=f["content"], + content_type=f["content_type"], + ) + uploaded_files.append(meta) + console.print(f" Uploaded: {meta.filename} ({meta.size_bytes} bytes)") + console.print(f" URL: {meta.download_url}") + + # Update stream stats + store.append_stream( + stream_id=stream_id, + file_count=len(test_files), + total_bytes=sum(len(f["content"]) for f in test_files), + ) + + console.print(f"\n [green]Stream created: {stream_id}[/green]") + console.print() + + # ========================================================================= + # Step 2: Test preview functionality + # ========================================================================= + console.print("[bold cyan]2. Testing file previews...[/bold cyan]") + + from v2.preview import generate_preview + + for f in test_files: + console.print(f"\n [bold]{f['filename']}[/bold]") + preview = generate_preview(f["content"], f["filename"]) + + if preview["type"] == "csv": + console.print(f" Type: CSV ({preview['total_rows']} rows)") + console.print(f" Columns: {', '.join(preview['headers'])}") + + # Show column stats + for col in preview["columns"][:3]: + if col["type"] == "numeric": + console.print(f" {col['name']}: numeric, range [{col.get('min', 'N/A')}, {col.get('max', 'N/A')}]") + else: + console.print(f" {col['name']}: string, {col.get('unique_count', 'N/A')} unique values") + + # Show preview rows + console.print(" Preview:") + for row in preview["rows"][:3]: + console.print(f" {row}") + + elif preview["type"] == "json": + console.print(f" Type: JSON") + console.print(f" Top-level keys: {preview.get('top_level_keys', [])}") + console.print(f" Structure preview:") + console.print(Syntax(json.dumps(preview["structure"], indent=2)[:500], "json", theme="monokai")) + + elif preview["type"] == "text": + console.print(f" Type: Text ({preview['total_lines']} lines)") + console.print(" Preview:") + for line in preview["lines"][:5]: + console.print(f" {line}") + + console.print() + + # ========================================================================= + # Step 3: Test cloning from Globus + # ========================================================================= + console.print("[bold cyan]3. Cloning stream from Globus to local...[/bold cyan]") + + from v2.clone import clone_stream + + # Create a temp directory for cloning + with tempfile.TemporaryDirectory() as tmpdir: + console.print(f" Cloning to: {tmpdir}") + + result = clone_stream( + stream_id=stream_id, + dest_dir=tmpdir, + verbose=False, + ) + + console.print(f"\n [green]Clone complete![/green]") + console.print(f" Downloaded: {result['downloaded']} files") + console.print(f" Total bytes: {result['total_bytes']:,}") + + # Verify files + console.print("\n Verifying cloned files:") + clone_dir = Path(tmpdir) / stream_id + for f in clone_dir.iterdir(): + size = f.stat().st_size + console.print(f" {f.name}: {size} bytes") + + # Read one back to verify content + csv_path = clone_dir / "experiment_data.csv" + if csv_path.exists(): + content = csv_path.read_text() + lines = content.strip().split("\n") + console.print(f"\n Content verification (first 3 lines of CSV):") + for line in lines[:3]: + console.print(f" {line}") + + console.print() + + # ========================================================================= + # Step 4: Test cloning a single URL + # ========================================================================= + console.print("[bold cyan]4. Testing single file clone from URL...[/bold cyan]") + + from v2.clone import clone_url + + # Get one of the file URLs + single_url = uploaded_files[0].download_url + console.print(f" URL: {single_url}") + + with tempfile.TemporaryDirectory() as tmpdir: + result = clone_url( + url=single_url, + dest_dir=tmpdir, + verbose=False, + ) + + console.print(f" [green]Downloaded: {result['filename']}[/green]") + console.print(f" Size: {result['size_bytes']} bytes") + console.print(f" Path: {result['path']}") + + console.print() + + # ========================================================================= + # Summary + # ========================================================================= + console.print(Panel.fit( + f"[bold green]Test Complete![/bold green]\n\n" + f"Stream: {stream_id}\n" + f"Files on Globus: {len(uploaded_files)}\n\n" + f"Features tested:\n" + f" [green]✓[/green] File upload to Globus\n" + f" [green]✓[/green] CSV preview with statistics\n" + f" [green]✓[/green] JSON structure preview\n" + f" [green]✓[/green] Text file preview\n" + f" [green]✓[/green] Stream cloning from Globus\n" + f" [green]✓[/green] Single file clone from URL", + border_style="green", + )) + + +if __name__ == "__main__": + main() diff --git a/aws/v2/.env.example b/aws/v2/.env.example new file mode 100644 index 0000000..b9bddb8 --- /dev/null +++ b/aws/v2/.env.example @@ -0,0 +1,31 @@ +# MDF Connect v2 Backend — Local Development Environment Variables +# Copy to .env and modify as needed. python-dotenv loads these automatically. + +# Store backend: sqlite (local dev) or dynamodb (prod) +STORE_BACKEND=sqlite +SQLITE_PATH=/tmp/mdf_connect_v2.db + +# Storage backend: local (local dev) or s3 (prod) +STORAGE_BACKEND=local +LOCAL_STORAGE_PATH=/tmp/mdf_storage + +# Auth mode: dev (X-User-Id headers) or production (Globus token introspection) +AUTH_MODE=dev + +# Async job dispatch: inline (local dev) or sqs (prod) +ASYNC_DISPATCH_MODE=inline + +# Curation settings +ALLOW_ALL_CURATORS=true +CURATOR_GROUP_IDS= +REQUIRED_GROUP_MEMBERSHIP= + +# Mock external services for local dev +USE_MOCK_DATACITE=true +USE_MOCK_SEARCH=true + +# Default organization +# DEFAULT_ORGANIZATION=MDF + +# API Gateway stage prefix (Lambda only) +# ENVIRONMENT=staging diff --git a/aws/v2/.local_server.pid b/aws/v2/.local_server.pid new file mode 100644 index 0000000..af4f5f6 --- /dev/null +++ b/aws/v2/.local_server.pid @@ -0,0 +1 @@ +89537 \ No newline at end of file diff --git a/aws/v2/FRONTEND_API.md b/aws/v2/FRONTEND_API.md new file mode 100644 index 0000000..f8d4afc --- /dev/null +++ b/aws/v2/FRONTEND_API.md @@ -0,0 +1,978 @@ +# MDF v2 Frontend API Reference + +Complete API contract for building a frontend against the MDF Connect v2 backend. Every endpoint, request shape, and response shape is documented here. + +**Base URL (staging):** `https://hjccjf3eqg.execute-api.us-east-1.amazonaws.com/staging` + +--- + +## Authentication + +The backend supports two auth modes. The frontend should always send the user's Globus token as a Bearer token. Unauthenticated requests are fine for public endpoints. + +**Authenticated request:** +``` +Authorization: Bearer +``` + +**Dev mode only** (local development, no Globus): +``` +X-User-Id: +X-User-Email: +X-User-Name: +``` + +### Auth context + +When authenticated, the backend resolves: +- `user_id` — Globus identity UUID +- `user_email` — email address +- `name` — display name +- Group memberships (for curator/submitter authorization) + +### Roles + +| Role | How determined | Can do | +|------|---------------|--------| +| **Anonymous** | No auth header | Search, view published cards/citations/stats/versions | +| **Authenticated user** | Valid Globus token | Submit datasets, view own submissions | +| **Submitter** | Member of MDF submitter group | Submit new datasets | +| **Owner** | `user_id` matches submission's `user_id` | Edit own metadata, withdraw own submissions | +| **Curator** | Member of curator Globus group | Approve/reject, delete, view all submissions, admin stats | + +--- + +## URL Compatibility + +The frontend currently uses URLs like: +``` +/detail/81d55710-5bec-4e71-91b0-6f269e8da85a-1.0 +/detail/levine_abo2179_database_v2.1-1.0 +``` + +The slug format is `{source_id}-{version}`. The backend provides `GET /detail/{slug}` which parses this automatically. Source IDs come in two styles: +- **UUID-style:** `81d55710-5bec-4e71-91b0-6f269e8da85a` +- **Name-style:** `levine_abo2179_database_v2.1` + +Slugs without a `-X.Y` version suffix (e.g. `/detail/levine_abo2179_database_v2.1`) resolve to the latest published version. + +--- + +## Response Conventions + +All responses include `"success": true|false`. Error responses use standard HTTP status codes with: +```json +{"detail": "Error message"} +``` + +Unhandled errors return: +```json +{"detail": "Internal server error", "request_id": "a1b2c3d4e5f6"} +``` + +--- + +## Endpoints + +### Health + +``` +GET /health +``` +```json +{"status": "ok", "service": "mdf-v2"} +``` + +--- + +### Search + +``` +GET /search?q={query}&limit={20}&offset={0}&type={all|datasets} +``` + +No auth required. Returns published datasets only. Supports faceted filtering. + +**Query params:** + +| Param | Type | Description | +|-------|------|-------------| +| `q` (or `query`) | string | Search query (required) | +| `limit` | int | Page size (default 20, max 50) | +| `offset` | int | Pagination offset (default 0) | +| `type` | string | `all`, `datasets`, or `streams` (default `all`) | +| `year` | string | Filter by publication year, comma-separated (e.g. `2024,2025`) | +| `organization` | string | Filter by organization, comma-separated (e.g. `MDF Open`) | +| `author` | string | Filter by author name, comma-separated (e.g. `Wolverton`) | +| `keyword` | string | Filter by keyword/subject, comma-separated (e.g. `perovskite,DFT`) | +| `domain` | string | Filter by scientific domain, comma-separated (e.g. `batteries`) | + +**Response:** +```json +{ + "query": "perovskite", + "total": 23, + "offset": 0, + "results": [ + { + "type": "dataset", + "source_id": "abx3_perovs_alloys_v1.1", + "version": "1.0", + "title": "ABX3 Perovskite Alloys Dataset", + "authors": ["Chibueze Amanchukwu", "Chris Wolverton"], + "keywords": ["perovskite", "DFT", "alloys"], + "description": "A dataset of ABX3 perovskite alloy calculations...", + "publication_year": 2023, + "organization": "MDF Open", + "domains": ["materials science"], + "doi": "10.18126/abc123", + "license": "CC-BY-4.0", + "size_bytes": 10485760, + "file_count": 42, + "status": "published", + "score": 4.5, + "latest": true, + "root_version": "abx3_perovs_alloys_v1.1-1.0", + "download_url": "https://data.materialsdatafacility.org/..." + } + ], + "facets": { + "Year": [{"value": "2024", "count": 12}, {"value": "2023", "count": 8}], + "Organization": [{"value": "MDF Open", "count": 15}, {"value": "Foundry", "count": 5}], + "Authors": [{"value": "Wolverton, Chris", "count": 7}], + "Keywords": [{"value": "DFT", "count": 10}, {"value": "perovskite", "count": 8}], + "Domains": [{"value": "batteries", "count": 4}] + } +} +``` + +**Notes:** +- `score` is relevance ranking (higher = better match) +- `latest` — whether this is the most recent version +- `download_url` — direct download link (may be absent) +- `root_version` — the first version's versioned ID (for version chain navigation) +- `facets` — counts for each facet bucket, reflecting current query and filters. Render in a sidebar; on click, re-query with the corresponding filter param. +- When filters are applied, both `results` and `facets` reflect the filtered view + +**Usage pattern:** +``` +1. User searches "perovskite" + → GET /search?q=perovskite + ← results + facets {Year: [...], Organization: [...], ...} + +2. Frontend renders facet sidebar with counts + +3. User clicks "2024" under Year + → GET /search?q=perovskite&year=2024 + ← filtered results + updated facets + +4. User also clicks "MDF Open" under Organization + → GET /search?q=perovskite&year=2024&organization=MDF+Open + ← further filtered results + updated facets +``` + +--- + +### Dataset Card + +Two endpoints return identical data — use whichever matches your routing: + +``` +GET /card/{source_id}?version={optional} +GET /detail/{slug} +``` + +Both accept optional auth. When authenticated, the response includes permissions. + +**Response:** +```json +{ + "success": true, + "source_id": "levine_abo2179_database_v2.1", + "version": "1.0", + "card": { + "source_id": "levine_abo2179_database_v2.1", + "version": "1.0", + "title": "ABO2179 Electroadhesives Database", + "authors": ["Daniel Levine", "Arjun Bhorkar", "..."], + "description": "Database of electroadhesives...", + "keywords": ["electroadhesion", "soft robotics"], + "publisher": "Materials Data Facility", + "publication_year": 2023, + "organization": "MDF Open", + "methods": [], + "facility": null, + "status": "published", + "created_at": "2023-05-15T12:00:00Z", + "updated_at": "2023-05-15T12:00:00Z", + "stats": { + "file_types": ["csv", "json"], + "data_sources_count": 1, + "file_count": 0, + "total_bytes": 0, + "size_human": "0 B" + }, + "links": { + "self": "/status/levine_abo2179_database_v2.1", + "citation": "/citation/levine_abo2179_database_v2.1", + "doi": "https://doi.org/10.18126/jx14-t0v8" + }, + "download_url": "https://data.materialsdatafacility.org/...", + "archive_size": 1048576, + "data_sources": ["https://data.materialsdatafacility.org/..."], + "doi": "10.18126/jx14-t0v8", + "license": "CC-BY-4.0", + "ml": { + "data_format": "csv", + "task_type": "regression", + "n_items": 5000, + "splits": [{"type": "train", "n_items": 4000}, {"type": "test", "n_items": 1000}], + "input_keys": ["composition", "temperature"], + "target_keys": ["bandgap"], + "short_name": "perovskite_bg" + }, + "profile_summary": { + "total_files": 3, + "total_bytes": 2048000, + "formats": {"csv": 2, "json": 1}, + "tabular_summary": { + "filename": "data.csv", + "n_rows": 5000, + "columns": [{"name": "composition", "dtype": "object"}, {"name": "bandgap", "dtype": "float64"}] + }, + "sample_rows": [{"composition": "CsPbI3", "bandgap": 1.73}] + } + }, + "permissions": { + "can_edit": true, + "can_delete": false, + "can_curate": false + } +} +``` + +**`card` fields — always present:** +- `source_id`, `version`, `title`, `authors`, `description`, `keywords` +- `publisher`, `organization`, `status`, `created_at`, `updated_at` +- `stats` — file type hints, counts +- `links` — relative API links +- `data_sources` — list of data source URIs + +**`card` fields — present when available:** +- `doi` — DOI string (without `https://doi.org/` prefix) +- `download_url` — direct download URL for the dataset archive +- `archive_size` — size of the zip archive in bytes +- `license` — license name string +- `ml` — ML metadata summary (only for ML-ready datasets) +- `profile_summary` — file-level profiling data (only for profiled datasets) +- `methods`, `facility` — experimental context + +**`permissions` object:** + +Always present. All false when unauthenticated. + +| Field | Type | Meaning | +|-------|------|---------| +| `can_edit` | bool | User can edit metadata (owner or curator, status is `pending_curation`/`rejected`/`published`) | +| `can_delete` | bool | User can soft-delete (curator only) | +| `can_curate` | bool | User can approve/reject (curator and status is `pending_curation`) | + +--- + +### Citation + +``` +GET /citation/{source_id}?version={optional}&format={all|bibtex|ris|apa|datacite} +``` + +No auth required. + +**Response (format=all):** +```json +{ + "success": true, + "source_id": "levine_abo2179_database_v2.1", + "version": "1.0", + "bibtex": "@misc{levine_abo2179_database_v2.1,\n title = {...},\n ...\n}", + "ris": "TY - DATA\nTI - ...\nER - ", + "apa": "Levine, D., ... (2023). ABO2179 Electroadhesives Database. Materials Data Facility. https://doi.org/...", + "datacite": "..." +} +``` + +When a specific format is requested, only that key is included, plus `content_type` (e.g. `"application/x-bibtex"`). + +--- + +### Dataset Access Stats + +``` +GET /stats/{source_id} +``` + +No auth required. Returns aggregate metrics across all published versions. + +**Response:** +```json +{ + "success": true, + "source_id": "levine_abo2179_database_v2.1", + "view_count": 142, + "download_count": 37, + "version_count": 3, + "first_published": "2023-05-15T12:00:00Z", + "last_updated": "2026-02-28T09:30:00Z" +} +``` + +**Counter sources:** +- `view_count` — incremented on every `GET /card`, `GET /detail`, `GET /citation`, `GET /preview` hit +- `download_count` — incremented on `POST /stream/{id}/download-url` + +--- + +### Versions + +``` +GET /versions/{source_id}?limit={50}&offset={0} +``` + +Optional auth. Without auth (or for non-owner/non-curator), only published versions are shown. + +**Response:** +```json +{ + "success": true, + "source_id": "Dataset_hea_hardness", + "versions": [ + { + "version": "1.0", + "title": "HEA Hardness Dataset", + "status": "published", + "doi": "10.18126/abc123", + "created_at": "2023-01-01T00:00:00Z", + "updated_at": "2023-01-01T00:00:00Z" + }, + { + "version": "1.1", + "title": "HEA Hardness Dataset (updated)", + "status": "published", + "doi": "10.18126/abc123", + "created_at": "2023-06-01T00:00:00Z", + "updated_at": "2023-06-01T00:00:00Z" + } + ], + "total_count": 2, + "dataset_doi": "10.18126/abc123" +} +``` + +--- + +### Version Diff + +``` +GET /versions/{source_id}/diff?from={version}&to={version} +``` + +No auth required. + +**Response:** +```json +{ + "success": true, + "source_id": "Dataset_hea_hardness", + "from_version": {"version": "1.0", "status": "published", "created_at": "..."}, + "to_version": {"version": "1.1", "status": "published", "created_at": "..."}, + "diff": { + "added": {"new_keyword": ["materials"]}, + "removed": {}, + "changed": {"title": {"from": "Old Title", "to": "New Title"}}, + "unchanged": ["authors", "description", "keywords"] + } +} +``` + +--- + +### Dataset Preview / Profile + +``` +GET /preview/{source_id} → Full DatasetProfile +GET /preview/{source_id}/files → File list with metadata +GET /preview/{source_id}/files/{path} → Single file detail +GET /preview/{source_id}/sample → Sample rows from first tabular file +``` + +No auth required. Published datasets only. + +**`GET /preview/{source_id}` response:** +```json +{ + "success": true, + "profile": { + "source_id": "...", + "profiled_at": "...", + "total_files": 3, + "total_bytes": 2048000, + "formats": {"csv": 2, "json": 1}, + "files": [ + { + "path": "data.csv", + "filename": "data.csv", + "size_bytes": 1024000, + "content_type": "text/csv", + "format": "csv", + "n_rows": 5000, + "columns": [ + {"name": "composition", "dtype": "object", "count": 5000, "nulls": 0, "unique": 4800} + ], + "sample_rows": [{"composition": "CsPbI3", "bandgap": 1.73}] + } + ] + } +} +``` + +**`GET /preview/{source_id}/sample` response:** +```json +{ + "success": true, + "source_id": "...", + "filename": "data.csv", + "format": "csv", + "columns": [{"name": "composition", "dtype": "object"}], + "n_rows": 5000, + "sample_rows": [{"composition": "CsPbI3", "bandgap": 1.73}] +} +``` + +--- + +### Submission Status + +``` +GET /status/{source_id}?version={optional} +GET /status?source_id={source_id}&version={optional} +``` + +Optional auth. Unauthenticated callers only see published submissions. Owner/curator sees all statuses. + +**Response:** +```json +{ + "success": true, + "submission": { + "source_id": "mdf-abc123", + "version": "1.0", + "versioned_source_id": "mdf-abc123-1.0", + "user_id": "globus-uuid", + "user_email": "user@example.com", + "organization": "MDF Open", + "status": "published", + "dataset_mdata": { ... }, + "schema_version": "2", + "test": false, + "created_at": "2026-01-15T12:00:00Z", + "updated_at": "2026-01-15T12:00:00Z", + "published_at": "2026-01-15T12:00:00Z", + "doi": "10.18126/...", + "dataset_doi": "10.18126/...", + "view_count": 42, + "download_count": 7 + } +} +``` + +`dataset_mdata` is the full metadata object (parsed from JSON, not a string). + +--- + +### Submit a Dataset + +``` +POST /submit +Authorization: Bearer +Content-Type: application/json +``` + +Requires submitter group membership in production. + +**Request body — flat v2 metadata:** +```json +{ + "title": "My Dataset", + "authors": [ + {"name": "Jane Smith", "orcid": "0000-0002-1234-5678", "affiliations": ["MIT"]} + ], + "description": "A dataset about...", + "keywords": ["materials", "DFT"], + "data_sources": ["globus://82f1b5c6-6e9b-11e5-ba47-22000b92c6ec/path/to/data"], + "organization": "MDF Open", + "license": {"name": "CC-BY-4.0", "url": "https://creativecommons.org/licenses/by/4.0/"}, + "funding": [{"funder_name": "NSF", "award_number": "DMR-1234567"}], + "related_works": [{"identifier": "10.1234/paper", "identifier_type": "DOI", "relation_type": "IsDescribedBy"}], + "methods": ["DFT", "molecular dynamics"], + "facility": "ALCF", + "fields_of_science": ["Materials Science"], + "domains": ["batteries", "energy storage"], + "ml": { + "data_format": "csv", + "task_type": ["regression"], + "n_items": 5000, + "keys": [ + {"name": "composition", "role": "input"}, + {"name": "bandgap", "role": "target", "units": "eV"} + ], + "splits": [ + {"type": "train", "path": "train.csv", "n_items": 4000}, + {"type": "test", "path": "test.csv", "n_items": 1000} + ] + }, + "tags": ["featured"], + "extensions": {"custom_field": "value"}, + "test": false +} +``` + +**Required fields:** `title`, `authors` (at least one with `name`), `data_sources` (unless `update: true` metadata-only) + +**`data_sources` formats:** +- `globus://{collection_uuid}/path/to/data` — Globus transfer +- `https://...` — HTTP download +- `stream://{stream_id}` — internal stream reference + +**For updates** (new version of existing dataset), include: +```json +{ + "update": true, + "extensions": {"mdf_source_id": "existing_source_id"}, + ... +} +``` + +**Response:** +```json +{ + "success": true, + "source_id": "mdf-abc123", + "version": "1.0", + "versioned_source_id": "mdf-abc123-1.0", + "organization": "MDF Open" +} +``` + +New submissions land with `status: "pending_curation"`. + +--- + +### Edit Metadata + +``` +POST /submissions/{source_id}/metadata +Authorization: Bearer +``` + +Owner or curator. Only fields explicitly provided (non-null) are applied — omitted fields are left unchanged. + +**Editable fields:** + +| Field | Type | Notes | +|-------|------|-------| +| `title` | string | | +| `authors` | `[{name, orcid?, affiliations?}]` | Replaces author list | +| `description` | string | | +| `keywords` | `[string]` | Replaces keyword list | +| `license` | `{name, url?, identifier?}` | | +| `funding` | `[{funder_name, award_number?}]` | | +| `related_works` | `[{identifier, identifier_type, relation_type}]` | | +| `methods` | `[string]` | | +| `facility` | string | | +| `fields_of_science` | `[string]` | | +| `domains` | `[string]` | | +| `ml` | object | | +| `geo_locations` | `[{place}]` | | +| `tags` | `[string]` | | +| `extensions` | object | **Deep-merged** into existing extensions — existing keys not in the update are preserved | +| `version` | string | Targets a specific version (defaults to latest) | + +Not editable: `data_sources`, `organization`, `publisher`, `test`, `update`. + +**Request body (any subset of the above):** +```json +{ + "title": "Updated Title", + "keywords": ["new", "keywords"], + "domains": ["batteries"], + "version": "1.0" +} +``` + +**Behavior by current status:** +| Status | What happens | +|--------|-------------| +| `pending_curation` | Updates metadata in-place | +| `rejected` | Updates metadata in-place (fix before resubmit) | +| `published` | Creates a new minor version (1.0 → 1.1), immediately `published`, no re-curation. DataCite metadata updated, **no new DOI minted** — the existing `dataset_doi` is inherited. | +| `withdrawn`, `approved` | Returns 400 | + +**Response (pending_curation / rejected):** +```json +{ + "success": true, + "source_id": "mdf-abc123", + "version": "1.0", + "updated_fields": ["title", "keywords"] +} +``` + +**Response (published → minor version bump):** +```json +{ + "success": true, + "source_id": "mdf-abc123", + "version": "1.0", + "new_version": "1.1", + "updated_fields": ["title", "keywords"] +} +``` + +`version` is the version that was edited. `new_version` is the newly created version — only present when editing a published dataset. After a minor bump, the new version is immediately live and `GET /card/{source_id}` resolves to it. + +--- + +### Withdraw + +``` +POST /submissions/{source_id}/withdraw +Authorization: Bearer +``` + +Owner or curator. Only allowed when `status == "pending_curation"`. + +**Request body:** +```json +{"reason": "Duplicate submission", "version": "1.0"} +``` + +**Response:** +```json +{"success": true, "source_id": "...", "version": "1.0", "status": "withdrawn"} +``` + +--- + +### Resubmit + +``` +POST /submissions/{source_id}/resubmit +Authorization: Bearer +``` + +Owner or curator. Only allowed when `status == "rejected"`. + +**Request body:** +```json +{"notes": "Fixed the title and added authors", "version": "1.0"} +``` + +**Response:** +```json +{"success": true, "source_id": "...", "version": "1.0", "status": "pending_curation"} +``` + +--- + +### Soft-Delete + +``` +POST /submissions/{source_id}/delete +Authorization: Bearer +``` + +Curator only. Works on any status except already deleted. + +**Request body:** +```json +{"reason": "Spam submission", "version": "1.0"} +``` + +`reason` is required. `version` is optional (defaults to latest). + +**Response:** +```json +{"success": true, "source_id": "...", "version": "1.0", "status": "deleted"} +``` + +--- + +### List My Submissions + +``` +GET /submissions?status={filter}&include_counts={true}&limit={50}&start_key={...} +Authorization: Bearer +``` + +Returns the caller's own submissions. Add `?organization=X` to see all org submissions (curator only). + +**Query params:** +| Param | Type | Description | +|-------|------|-------------| +| `status` | string | Comma-separated filter, e.g. `published,pending_curation` | +| `include_counts` | bool | Include per-status counts over all submissions | +| `limit` | int | Page size (default 50) | +| `start_key` | string | Pagination cursor from previous response's `next_key` | +| `organization` | string | Org-wide view (curator only) | + +**Response:** +```json +{ + "success": true, + "submissions": [ + { + "source_id": "mdf-abc123", + "version": "1.0", + "status": "published", + "user_id": "...", + "organization": "MDF Open", + "dataset_mdata": { ... }, + "created_at": "...", + "updated_at": "..." + } + ], + "counts": {"pending_curation": 3, "published": 12, "rejected": 1}, + "total": 16, + "next_key": null +} +``` + +`counts` and `total` only present when `include_counts=true`. `next_key` is `null` when no more pages. + +--- + +### Curation Queue (Curator) + +``` +GET /curation/pending?limit={50}&offset={0}&organization={optional} +Authorization: Bearer +``` + +**Response:** +```json +{ + "success": true, + "pending_count": 5, + "submissions": [ + { + "source_id": "mdf-abc123", + "version": "1.0", + "title": "Some Dataset", + "organization": "MDF Open", + "submitter": "globus-user-uuid", + "submitted_at": "2026-02-28T12:00:00Z", + "file_count": 3, + "total_bytes": 2048000 + } + ], + "limit": 50, + "offset": 0 +} +``` + +### Curation Review + +``` +GET /curation/{source_id}?version={optional} +Authorization: Bearer +``` + +**Response:** +```json +{ + "success": true, + "submission": { ... }, + "curation_history": [ + {"action": "rejected", "curator_id": "...", "timestamp": "...", "reason": "Missing authors"} + ], + "current_status": "pending_curation", + "can_approve": true, + "can_reject": true +} +``` + +### Approve + +``` +POST /curation/{source_id}/approve +Authorization: Bearer +``` + +**Request body:** +```json +{ + "notes": "Looks good", + "mint_doi": true, + "metadata_updates": {"keywords": ["added-by-curator"]}, + "version": "1.0" +} +``` + +All fields optional. `mint_doi` defaults to `true`. `metadata_updates` lets the curator fix metadata inline during approval. + +**Response:** +```json +{ + "success": true, + "source_id": "...", + "version": "1.0", + "status": "published", + "approved_by": "curator-uuid", + "approved_at": "...", + "doi": {"success": true, "doi": "10.18126/..."} +} +``` + +### Reject + +``` +POST /curation/{source_id}/reject +Authorization: Bearer +``` + +**Request body:** +```json +{ + "reason": "Missing data source URLs", + "suggestions": "Please add the Globus endpoint path", + "version": "1.0" +} +``` + +`reason` is required. + +**Response:** +```json +{ + "success": true, + "source_id": "...", + "version": "1.0", + "status": "rejected", + "rejected_by": "curator-uuid", + "rejected_at": "...", + "reason": "Missing data source URLs" +} +``` + +--- + +### Admin Stats (Curator) + +``` +GET /admin/stats +Authorization: Bearer +``` + +**Response:** +```json +{ + "success": true, + "total": 922, + "by_status": { + "published": 904, + "pending_curation": 10, + "rejected": 5, + "deleted": 3 + }, + "access_totals": { + "view_count": 15230, + "download_count": 4521 + } +} +``` + +--- + +## Submission Lifecycle + +``` + ┌─────────────┐ + POST /submit │ pending │ + ───────────────>│ _curation │ + └──────┬───────┘ + │ + ┌──────────────────┬┴───────────────────┐ + │ │ │ + ▼ ▼ ▼ + ┌────────────┐ ┌─────────────┐ ┌──────────────┐ + │ approved │ │ rejected │ │ withdrawn │ + │ │ │ │ │ (by owner) │ + └──────┬──────┘ └──────┬───────┘ └──────────────┘ + │ │ + │ edit + resubmit ──> pending_curation + ▼ + ┌─────────────┐ + │ published │──── edit metadata ──> new minor version (stays published) + └─────────────┘ + + Any status ──── curator delete ──> deleted (soft-delete, recorded in history) +``` + +**Status values:** `pending_curation`, `approved`, `published`, `rejected`, `withdrawn`, `deleted` + +--- + +## CORS + +**Staging/dev:** `Access-Control-Allow-Origin: *` (no credentials) + +**Production:** Only `https://materialsdatafacility.org` and `https://app.materialsdatafacility.org` with credentials enabled. + +**Allowed headers:** `Content-Type`, `Authorization`, `X-User-Id`, `X-User-Email`, `X-User-Name`, `X-Globus-Token` + +--- + +## Error Codes + +| Status | Meaning | +|--------|---------| +| 400 | Bad request (validation error, invalid status transition) | +| 401 | Missing or invalid authentication token | +| 403 | Not authorized (not owner, not curator, not submitter group member) | +| 404 | Resource not found (or not published for unauthenticated requests) | +| 413 | Payload too large (metadata > 256KB, too many data sources/authors) | +| 500 | Internal error (check `request_id` in response) | + +--- + +## Data Model Quick Reference + +### Source ID & Versioning + +- `source_id` — unique dataset identifier (UUID or human-readable name) +- `version` — semver-style string: `"1.0"`, `"1.1"`, `"2.0"` +- `versioned_source_id` — `"{source_id}-{version}"` +- **Major bump** (1.0 → 2.0): new data sources provided +- **Minor bump** (1.0 → 1.1): metadata-only edit on a published dataset + +### Key Metadata Fields + +| Field | Type | Description | +|-------|------|-------------| +| `title` | string | Dataset title (required) | +| `authors` | `[{name, orcid?, affiliations?}]` | At least one required | +| `description` | string | Free-text description | +| `keywords` | `[string]` | Subject keywords | +| `data_sources` | `[string]` | URIs to the data (globus://, https://, stream://) | +| `organization` | string | Publishing org (default: "Materials Data Facility") | +| `doi` | string | DOI (assigned on approval) | +| `download_url` | string | Direct download link | +| `license` | `{name, url?, identifier?}` | License info | +| `ml` | object | ML-readiness metadata (see card response) | +| `domains` | `[string]` | Scientific domains | +| `methods` | `[string]` | Experimental/computational methods | +| `facility` | string | Research facility | +| `fields_of_science` | `[string]` | Broad scientific fields | +| `funding` | `[{funder_name, award_number?}]` | Funding sources | +| `related_works` | `[{identifier, identifier_type, relation_type}]` | Related publications/datasets | +| `tags` | `[string]` | Platform tags | +| `extensions` | object | Arbitrary extra metadata | diff --git a/aws/v2/__init__.py b/aws/v2/__init__.py new file mode 100644 index 0000000..76ec3e7 --- /dev/null +++ b/aws/v2/__init__.py @@ -0,0 +1 @@ +"""MDF Connect v2 Lambda handlers and helpers.""" diff --git a/aws/v2/app/__init__.py b/aws/v2/app/__init__.py new file mode 100644 index 0000000..18c85e7 --- /dev/null +++ b/aws/v2/app/__init__.py @@ -0,0 +1,40 @@ +import logging +import os + +from fastapi import FastAPI +from fastapi.middleware.cors import CORSMiddleware + +from v2.app.middleware import configure_app_middleware + +_log_level = os.environ.get("LOG_LEVEL", "INFO").upper() +logging.basicConfig(level=getattr(logging, _log_level, logging.INFO)) + +app = FastAPI(title="MDF Connect v2") + +_cors_raw = os.environ.get("CORS_ALLOWED_ORIGINS", "*") +_cors_origins = [o.strip() for o in _cors_raw.split(",") if o.strip()] if _cors_raw != "*" else ["*"] + +app.add_middleware( + CORSMiddleware, + allow_origins=_cors_origins, + allow_methods=["GET", "POST", "OPTIONS"], + allow_headers=["Content-Type", "Authorization", "X-User-Id", "X-User-Email", "X-User-Name", "X-Globus-Token"], + allow_credentials=True, +) +configure_app_middleware(app) + + +@app.get("/health") +async def health(): + return {"status": "ok", "service": "mdf-v2"} + + +from v2.app.routers import submissions, streams, files, search, cards, curation, preview # noqa: E402 + +app.include_router(submissions.router) +app.include_router(streams.router) +app.include_router(files.router) +app.include_router(search.router) +app.include_router(cards.router) +app.include_router(curation.router) +app.include_router(preview.router) diff --git a/aws/v2/app/auth.py b/aws/v2/app/auth.py new file mode 100644 index 0000000..0be652e --- /dev/null +++ b/aws/v2/app/auth.py @@ -0,0 +1,189 @@ +import os +from typing import Any, Dict, Optional + +from fastapi import Depends, Header, HTTPException, Request + +from v2.app.models import AuthContext + + +def get_auth_mode() -> str: + mode = os.environ.get("AUTH_MODE", "dev") + normalized = (mode or "dev").strip().lower() + return normalized or "dev" + + +async def get_auth( + request: Request, + x_user_id: Optional[str] = Header(None), + x_user_email: Optional[str] = Header(None), + x_user_name: Optional[str] = Header(None), + authorization: Optional[str] = Header(None), +) -> AuthContext: + if get_auth_mode() == "dev": + user_id = x_user_id or os.environ.get("LOCAL_USER_ID", "local-user") + user_email = x_user_email or os.environ.get("LOCAL_USER_EMAIL", "local@example.com") + name = x_user_name or os.environ.get("LOCAL_USER_NAME", "Local User") + return AuthContext( + user_id=user_id, + name=name, + user_email=user_email, + identities=[], + group_info={}, + dependent_token={}, + ) + + # Production mode: Globus token introspection + if not authorization: + raise HTTPException(status_code=401, detail="Missing Authorization header") + + token = authorization.replace("Bearer ", "") + if not token: + raise HTTPException(status_code=401, detail="Missing Bearer token") + + try: + import globus_sdk + + # Validate the token by calling the Globus userinfo endpoint. + # This works with any valid Globus access token regardless of + # which resource server it was issued for (unlike introspect, + # which only reports active=True for the introspecting app's + # own resource server). + try: + ac = globus_sdk.AuthClient( + authorizer=globus_sdk.AccessTokenAuthorizer(token) + ) + userinfo = ac.userinfo() + except globus_sdk.AuthAPIError: + raise HTTPException(status_code=401, detail="Invalid or expired token") + + user_id = userinfo.get("sub") + if not user_id: + raise HTTPException(status_code=401, detail="Invalid or expired token") + + # Optionally fetch group info via confidential app dependent tokens + client_id = os.environ.get("GLOBUS_CLIENT_ID") + client_secret = os.environ.get("GLOBUS_CLIENT_SECRET") + group_info = {} + dependent_token = {} + if client_id and client_secret: + try: + conf_client = globus_sdk.ConfidentialAppAuthClient(client_id, client_secret) + dependent_token = conf_client.oauth2_get_dependent_tokens(token).by_resource_server + groups_token = dependent_token.get("groups.api.globus.org", {}).get("access_token") + if groups_token: + groups_client = globus_sdk.GroupsClient( + authorizer=globus_sdk.AccessTokenAuthorizer(groups_token) + ) + groups = groups_client.get_my_groups() + group_info = { + group["id"]: {"name": group["name"], "description": group["description"]} + for group in groups + } + except Exception: + pass + + return AuthContext( + user_id=user_id, + name=userinfo.get("name"), + user_email=userinfo.get("email"), + identities=userinfo.get("identity_set", []), + group_info=group_info, + dependent_token=dependent_token, + ) + except HTTPException: + raise + except Exception as e: + raise HTTPException(status_code=401, detail=f"Authentication failed: {e}") + + +async def get_optional_auth( + request: Request, + x_user_id: Optional[str] = Header(None), + x_user_email: Optional[str] = Header(None), + x_user_name: Optional[str] = Header(None), + authorization: Optional[str] = Header(None), +) -> Optional[AuthContext]: + if not authorization and not x_user_id and get_auth_mode() != "dev": + return None + try: + return await get_auth(request, x_user_id, x_user_email, x_user_name, authorization) + except HTTPException: + return None + + +def is_curator(auth: AuthContext) -> bool: + curator_user_ids = set(os.environ.get("CURATOR_USER_IDS", "").split(",")) - {""} + curator_group_ids = set(os.environ.get("CURATOR_GROUP_IDS", "").split(",")) - {""} + + if auth.user_id in curator_user_ids: + return True + + user_groups = set((auth.group_info or {}).keys()) + if user_groups & curator_group_ids: + return True + + if os.environ.get("ALLOW_ALL_CURATORS", "").lower() in ("true", "1", "yes"): + return True + + return False + + +def _is_curator(auth: AuthContext) -> bool: + """Backward-compatible alias.""" + return is_curator(auth) + + +def ensure_submission_owner_or_curator(auth: AuthContext, submission: Dict[str, Any]) -> None: + """Only the submitter (or a curator) may mutate a submission.""" + owner_id = submission.get("user_id") + if owner_id and owner_id == auth.user_id: + return + if is_curator(auth): + return + raise HTTPException(status_code=403, detail="You do not have permission for this submission") + + +def ensure_stream_owner_or_curator(auth: AuthContext, stream: Dict[str, Any]) -> None: + """Only the stream owner (or a curator) may mutate/view a stream.""" + owner_id = stream.get("user_id") + if owner_id and owner_id == auth.user_id: + return + if is_curator(auth): + return + raise HTTPException(status_code=403, detail="You do not have permission for this stream") + + +async def require_curator( + auth: AuthContext = Depends(get_auth), +) -> AuthContext: + if not is_curator(auth): + raise HTTPException(status_code=403, detail="You do not have curator permissions") + return auth + + +def is_submitter(auth: AuthContext) -> bool: + """Check whether the user is allowed to submit datasets. + + In dev-auth mode (or when REQUIRED_GROUP_MEMBERSHIP is empty) everyone + is allowed. In production the user must belong to the submitter group. + """ + if get_auth_mode() == "dev": + return True + + required = os.environ.get("REQUIRED_GROUP_MEMBERSHIP", "").strip() + if not required: + return True + + user_groups = set((auth.group_info or {}).keys()) + return required in user_groups + + +async def require_submitter( + auth: AuthContext = Depends(get_auth), +) -> AuthContext: + if not is_submitter(auth): + raise HTTPException( + status_code=403, + detail="You must be a member of the MDF submitters group to submit datasets", + ) + return auth diff --git a/aws/v2/app/deps.py b/aws/v2/app/deps.py new file mode 100644 index 0000000..5156fa1 --- /dev/null +++ b/aws/v2/app/deps.py @@ -0,0 +1,15 @@ +from v2.store import SubmissionStore, get_store +from v2.stream_store import StreamStore, get_stream_store +from v2.storage import StorageBackend, get_storage_backend + + +def get_submission_store() -> SubmissionStore: + return get_store() + + +def get_stream_store_dep() -> StreamStore: + return get_stream_store() + + +def get_storage() -> StorageBackend: + return get_storage_backend() diff --git a/aws/v2/app/main.py b/aws/v2/app/main.py new file mode 100644 index 0000000..0a50de2 --- /dev/null +++ b/aws/v2/app/main.py @@ -0,0 +1,31 @@ +import os + +from mangum import Mangum + +from v2.app import app + +# Strip the API Gateway stage prefix (e.g. "/dev") so FastAPI sees clean paths +_stage = os.environ.get("ENVIRONMENT", "") +handler = Mangum(app, lifespan="off", api_gateway_base_path=f"/{_stage}" if _stage else "/") + +if __name__ == "__main__": + import os + import uvicorn + + # Load .env file if python-dotenv is installed + try: + from dotenv import load_dotenv + load_dotenv() + except ImportError: + pass + + os.environ.setdefault("STORE_BACKEND", "sqlite") + os.environ.setdefault("SQLITE_PATH", "/tmp/mdf_connect_v2.db") + os.environ.setdefault("ASYNC_DISPATCH_MODE", "inline") + os.environ.setdefault("AUTH_MODE", "dev") + os.environ.setdefault("ALLOW_ALL_CURATORS", "true") + os.environ.setdefault("CURATOR_GROUP_IDS", "") + os.environ.setdefault("REQUIRED_GROUP_MEMBERSHIP", "") + os.environ.setdefault("USE_MOCK_DATACITE", "true") + + uvicorn.run("v2.app:app", host="127.0.0.1", port=8080, reload=True) diff --git a/aws/v2/app/middleware.py b/aws/v2/app/middleware.py new file mode 100644 index 0000000..5c45aad --- /dev/null +++ b/aws/v2/app/middleware.py @@ -0,0 +1,174 @@ +import hashlib +import json +import logging +import os +import threading +import time +import uuid +from collections import defaultdict, deque +from typing import Deque, Dict, Tuple + +from fastapi import FastAPI, Request +from starlette.responses import JSONResponse + +logger = logging.getLogger(__name__) + +_RATE_LIMIT_STATE: Dict[str, Deque[float]] = defaultdict(deque) +_RATE_LOCK = threading.Lock() + + +def reset_middleware_state() -> None: + """Reset in-memory middleware state for tests.""" + with _RATE_LOCK: + _RATE_LIMIT_STATE.clear() + + +def _env_int(name: str, default: int) -> int: + value = os.environ.get(name) + if value is None: + return default + try: + return int(value) + except (TypeError, ValueError): + return default + + +def _request_limit_for_path(path: str) -> int: + if path.startswith("/submit"): + return _env_int("RATE_LIMIT_SUBMIT_PER_MIN", 20) + if path.startswith("/stream/create"): + return _env_int("RATE_LIMIT_STREAM_CREATE_PER_MIN", 30) + if path.startswith("/stream/") and path.endswith("/upload"): + return _env_int("RATE_LIMIT_STREAM_UPLOAD_PER_MIN", 60) + if path.startswith("/stream/"): + return _env_int("RATE_LIMIT_STREAM_MUTATION_PER_MIN", 90) + return _env_int("RATE_LIMIT_DEFAULT_PER_MIN", 120) + + +def _request_size_limit_bytes() -> int: + return _env_int("MAX_REQUEST_BYTES", 1_048_576) + + +def _rate_limit_window_seconds() -> int: + return _env_int("RATE_LIMIT_WINDOW_SECONDS", 60) + + +def _is_exempt_path(path: str) -> bool: + return path in {"/health", "/docs", "/openapi.json", "/redoc", "/docs/oauth2-redirect"} + + +def _actor_key(request: Request) -> str: + user_id = (request.headers.get("x-user-id") or "").strip() + if user_id: + return f"user:{user_id}" + authz = (request.headers.get("authorization") or "").strip() + if authz: + digest = hashlib.sha256(authz.encode("utf-8")).hexdigest()[:12] + return f"auth:{digest}" + client_host = request.client.host if request.client else "unknown" + return f"ip:{client_host}" + + +def _check_rate_limit(key: str, limit: int, window_sec: int) -> Tuple[bool, int]: + if limit <= 0: + return True, 0 + now = time.monotonic() + cutoff = now - window_sec + with _RATE_LOCK: + bucket = _RATE_LIMIT_STATE[key] + while bucket and bucket[0] <= cutoff: + bucket.popleft() + if len(bucket) >= limit: + retry_after = max(1, int(window_sec - (now - bucket[0]))) + return False, retry_after + bucket.append(now) + return True, 0 + + +def configure_app_middleware(app: FastAPI) -> None: + @app.middleware("http") + async def security_and_logging_middleware(request: Request, call_next): + request_id = request.headers.get("x-request-id") or str(uuid.uuid4()) + request.state.request_id = request_id + method = request.method.upper() + path = request.url.path + start = time.monotonic() + + if not _is_exempt_path(path): + if method in {"POST", "PUT", "PATCH"}: + max_bytes = _request_size_limit_bytes() + content_length = request.headers.get("content-length") + if content_length: + try: + if int(content_length) > max_bytes: + return JSONResponse( + status_code=413, + content={ + "success": False, + "error": f"Request body exceeds {max_bytes} bytes", + "request_id": request_id, + }, + headers={"X-Request-Id": request_id}, + ) + except ValueError: + pass + body = await request.body() + if len(body) > max_bytes: + return JSONResponse( + status_code=413, + content={ + "success": False, + "error": f"Request body exceeds {max_bytes} bytes", + "request_id": request_id, + }, + headers={"X-Request-Id": request_id}, + ) + + actor = _actor_key(request) + limit = _request_limit_for_path(path) + allowed, retry_after = _check_rate_limit( + key=f"{actor}:{path.split('/', 2)[1] if path.startswith('/') else path}", + limit=limit, + window_sec=_rate_limit_window_seconds(), + ) + if not allowed: + return JSONResponse( + status_code=429, + content={ + "success": False, + "error": "Rate limit exceeded", + "request_id": request_id, + "retry_after_seconds": retry_after, + }, + headers={ + "Retry-After": str(retry_after), + "X-Request-Id": request_id, + }, + ) + + try: + response = await call_next(request) + except Exception: + elapsed_ms = int((time.monotonic() - start) * 1000) + log_data = { + "event": "request_error", + "request_id": request_id, + "method": method, + "path": path, + "duration_ms": elapsed_ms, + } + logger.exception(json.dumps(log_data, sort_keys=True)) + raise + + elapsed_ms = int((time.monotonic() - start) * 1000) + log_data = { + "event": "request_complete", + "request_id": request_id, + "method": method, + "path": path, + "status_code": response.status_code, + "duration_ms": elapsed_ms, + } + logger.info(json.dumps(log_data, sort_keys=True)) + response.headers["X-Request-Id"] = request_id + return response diff --git a/aws/v2/app/models.py b/aws/v2/app/models.py new file mode 100644 index 0000000..00cceb4 --- /dev/null +++ b/aws/v2/app/models.py @@ -0,0 +1,106 @@ +from typing import Any, Dict, List, Optional + +from pydantic import BaseModel, Field + + +class AuthContext(BaseModel): + user_id: str + name: Optional[str] = None + user_email: Optional[str] = None + identities: Optional[list] = None + group_info: Optional[dict] = None + dependent_token: Optional[Any] = None + + +class StatusUpdateRequest(BaseModel): + source_id: str + version: str + status: str + + +class StreamCreateRequest(BaseModel): + title: str + lab_id: Optional[str] = None + organization: Optional[str] = None + metadata: Optional[dict] = None + stream_id: Optional[str] = None + + +class StreamAppendFiles(BaseModel): + filename: Optional[str] = None + size: Optional[int] = Field(default=0, ge=0, le=10 * 1024 * 1024 * 1024) + + +class StreamAppendRequest(BaseModel): + stream_id: Optional[str] = None + files: Optional[List[StreamAppendFiles]] = Field(default=None, max_length=1000) + file_count: Optional[int] = Field(default=None, ge=0, le=10000) + total_bytes: Optional[int] = Field(default=None, ge=0, le=50 * 1024 * 1024 * 1024) + last_file: Optional[dict] = None + + +class FileUploadItem(BaseModel): + filename: str + content_base64: str + content_type: Optional[str] = "application/octet-stream" + metadata: Optional[dict] = None + + +class FileUploadRequest(BaseModel): + filename: Optional[str] = None + content_base64: Optional[str] = None + content_type: Optional[str] = "application/octet-stream" + metadata: Optional[dict] = None + files: Optional[List[FileUploadItem]] = Field(default=None, max_length=100) + + +class UploadUrlRequest(BaseModel): + filename: str + content_type: Optional[str] = "application/octet-stream" + size_bytes: Optional[int] = None + expires_in: Optional[int] = 3600 + + +class ConfirmUploadRequest(BaseModel): + path: str + size_bytes: Optional[int] = 0 + checksum_md5: Optional[str] = "" + metadata: Optional[dict] = None + + +class DownloadUrlRequest(BaseModel): + path: Optional[str] = None + + +class StreamCloseRequest(BaseModel): + stream_id: Optional[str] = None + mint_doi: Optional[bool] = False + title: Optional[str] = None + description: Optional[str] = None + authors: Optional[list] = None + keywords: Optional[list] = None + license: Optional[str] = None + + +class StreamSnapshotRequest(BaseModel): + stream_id: Optional[str] = None + title: Optional[str] = None + description: Optional[str] = None + author: Optional[str] = None + source_id: Optional[str] = None + update: Optional[bool] = False + data_sources: Optional[list] = None + test: Optional[bool] = False + + +class CurationApproveRequest(BaseModel): + notes: Optional[str] = "" + mint_doi: Optional[bool] = True + metadata_updates: Optional[dict] = None + version: Optional[str] = None + + +class CurationRejectRequest(BaseModel): + reason: str + suggestions: Optional[str] = "" + version: Optional[str] = None diff --git a/aws/v2/app/routers/__init__.py b/aws/v2/app/routers/__init__.py new file mode 100644 index 0000000..e69de29 diff --git a/aws/v2/app/routers/cards.py b/aws/v2/app/routers/cards.py new file mode 100644 index 0000000..887ae8e --- /dev/null +++ b/aws/v2/app/routers/cards.py @@ -0,0 +1,154 @@ +import logging +import re +from typing import Any, Dict, Optional, Tuple + +from fastapi import APIRouter, Depends, HTTPException, Query + +from v2.app.auth import get_optional_auth, is_curator +from v2.app.deps import get_submission_store +from v2.app.models import AuthContext +from v2.citation import generate_apa, generate_bibtex, generate_datacite_xml, generate_ris +from v2.dataset_card import build_dataset_card +from v2.store import SubmissionStore + +logger = logging.getLogger(__name__) + +router = APIRouter() + +_VERSION_SUFFIX_RE = re.compile(r"^(.+)-(\d+\.\d+)$") +_EDITABLE_STATUSES = {"pending_curation", "rejected", "published"} + + +def _build_permissions(auth: Optional[AuthContext], record: Dict[str, Any]) -> Dict[str, bool]: + """Compute user permissions for a dataset record.""" + if not auth: + return {"can_edit": False, "can_delete": False, "can_curate": False} + + owner = record.get("user_id") == auth.user_id + curator = is_curator(auth) + status = record.get("status", "") + + return { + "can_edit": (owner or curator) and status in _EDITABLE_STATUSES, + "can_delete": curator, + "can_curate": curator and status == "pending_curation", + } + + +@router.get("/card/{source_id}") +async def get_card( + source_id: str, + version: Optional[str] = Query(None), + auth: Optional[AuthContext] = Depends(get_optional_auth), + store: SubmissionStore = Depends(get_submission_store), +): + if version and version.lower() == "latest": + version = None + + record = store.get(source_id, version=version) + if not record or record.get("status") != "published": + raise HTTPException(404, "Dataset not found") + + card = build_dataset_card(record) + + # Fire-and-forget view count increment + try: + store.increment_counter(source_id, record["version"], "view_count") + except Exception: + logger.debug("Failed to increment view_count for %s", source_id, exc_info=True) + + return {"success": True, "card": card, "permissions": _build_permissions(auth, record)} + + +@router.get("/citation/{source_id}") +async def get_citation( + source_id: str, + version: Optional[str] = Query(None), + format: Optional[str] = Query("all"), + store: SubmissionStore = Depends(get_submission_store), +): + record = store.get(source_id, version=version) + if not record or record.get("status") != "published": + raise HTTPException(404, "Dataset not found") + + fmt = (format or "all").lower() + + result = { + "success": True, + "source_id": source_id, + "version": record.get("version"), + } + + if fmt == "bibtex": + result["bibtex"] = generate_bibtex(record) + result["content_type"] = "application/x-bibtex" + elif fmt == "ris": + result["ris"] = generate_ris(record) + result["content_type"] = "application/x-research-info-systems" + elif fmt == "apa": + result["apa"] = generate_apa(record) + result["content_type"] = "text/plain" + elif fmt == "datacite": + result["datacite"] = generate_datacite_xml(record) + result["content_type"] = "application/xml" + else: # all + result["bibtex"] = generate_bibtex(record) + result["ris"] = generate_ris(record) + result["apa"] = generate_apa(record) + result["datacite"] = generate_datacite_xml(record) + + return result + + +def _parse_detail_slug(slug: str) -> Tuple[str, Optional[str]]: + """Parse a frontend detail slug into (source_id, version). + + Handles both formats: + - "81d55710-5bec-4e71-91b0-6f269e8da85a-1.0" → (UUID, "1.0") + - "levine_abo2179_database_v2.1-1.0" → (name, "1.0") + - "levine_abo2179_database_v2.1" → (name, None) + """ + m = _VERSION_SUFFIX_RE.match(slug) + if m: + return m.group(1), m.group(2) + return slug, None + + +@router.get("/detail/{slug}") +async def get_card_by_slug( + slug: str, + version: Optional[str] = Query(None), + auth: Optional[AuthContext] = Depends(get_optional_auth), + store: SubmissionStore = Depends(get_submission_store), +): + """Resolve a frontend URL slug to a dataset card. + + Accepts two URL styles: + - /detail/{source_id}-{version} (slug format) + - /detail/{source_id}?version=X (query param format) + + The ?version query param takes precedence over a version embedded in the slug. + """ + source_id, slug_version = _parse_detail_slug(slug) + version = version or slug_version + if version and version.lower() == "latest": + version = None + + record = store.get(source_id, version=version) + if not record or record.get("status") != "published": + raise HTTPException(404, "Dataset not found") + + card = build_dataset_card(record) + + try: + store.increment_counter(source_id, record["version"], "view_count") + except Exception: + logger.debug("Failed to increment view_count for %s", source_id, exc_info=True) + + return { + "success": True, + "source_id": source_id, + "version": record.get("version"), + "card": card, + "permissions": _build_permissions(auth, record), + } diff --git a/aws/v2/app/routers/curation.py b/aws/v2/app/routers/curation.py new file mode 100644 index 0000000..b1d8725 --- /dev/null +++ b/aws/v2/app/routers/curation.py @@ -0,0 +1,260 @@ +import json +import logging +from datetime import datetime, timezone +from typing import Any, Dict, Optional + +from fastapi import APIRouter, Depends, HTTPException, Query + +logger = logging.getLogger(__name__) + +from v2.async_jobs import enqueue_publish_job +from v2.app.auth import require_curator +from v2.app.deps import get_submission_store +from v2.app.models import AuthContext, CurationApproveRequest, CurationRejectRequest +from v2.email_utils import notify_submitter_rejected +from v2.metadata import parse_metadata +from v2.store import SubmissionStore +from v2.submission_utils import deep_merge, latest_version + +router = APIRouter() + + +def _resolve_submission_for_curation( + store: SubmissionStore, + source_id: str, + version: Optional[str], +) -> Dict[str, Any]: + if version: + submission = store.get_submission(source_id, version) + if not submission: + raise HTTPException(404, "Submission not found") + return submission + + versions = store.list_versions(source_id) + if not versions: + raise HTTPException(404, "Submission not found") + latest = latest_version(versions) + for item in versions: + if item.get("version") == latest: + return item + return versions[-1] + + +@router.get("/curation/pending") +async def list_pending( + limit: Optional[int] = Query(50), + offset: Optional[int] = Query(0), + organization: Optional[str] = Query(None), + auth: AuthContext = Depends(require_curator), + store: SubmissionStore = Depends(get_submission_store), +): + limit_val = min(int(limit or 50), 200) + offset_val = int(offset or 0) + + all_submissions = store.list_by_status(["pending_curation"], limit=1000) + + pending = [] + for sub in all_submissions: + if organization and sub.get("organization") != organization: + continue + try: + meta = parse_metadata(sub) + title = meta.title + except Exception: + title = "Untitled" + pending.append({ + "source_id": sub.get("source_id"), + "version": sub.get("version"), + "title": title, + "organization": sub.get("organization"), + "submitter": sub.get("user_id"), + "submitted_at": sub.get("created_at"), + "file_count": sub.get("file_count", 0), + "total_bytes": sub.get("total_bytes", 0), + }) + + pending.sort(key=lambda x: x.get("submitted_at", "")) + paginated = pending[offset_val:offset_val + limit_val] + + return { + "success": True, + "pending_count": len(pending), + "submissions": paginated, + "limit": limit_val, + "offset": offset_val, + } + + +@router.get("/curation/{source_id}") +async def get_curation( + source_id: str, + version: Optional[str] = Query(None), + auth: AuthContext = Depends(require_curator), + store: SubmissionStore = Depends(get_submission_store), +): + submission = _resolve_submission_for_curation(store, source_id, version) + + curation_history = submission.get("curation_history", []) + + return { + "success": True, + "submission": submission, + "curation_history": curation_history, + "current_status": submission.get("status"), + "can_approve": submission.get("status") == "pending_curation", + "can_reject": submission.get("status") == "pending_curation", + } + + +@router.post("/curation/{source_id}/approve") +async def approve( + source_id: str, + payload: CurationApproveRequest, + auth: AuthContext = Depends(require_curator), + store: SubmissionStore = Depends(get_submission_store), +): + submission = _resolve_submission_for_curation(store, source_id, payload.version) + version = submission.get("version") + + if submission.get("status") != "pending_curation": + raise HTTPException( + 400, + f"Submission is not pending curation (status: {submission.get('status')})", + ) + + curator_id = auth.user_id + now = datetime.now(timezone.utc).isoformat() + + curation_record = { + "action": "approved", + "curator_id": curator_id, + "timestamp": now, + "notes": payload.notes or "", + } + + curation_history = submission.get("curation_history") or [] + if isinstance(curation_history, str): + try: + curation_history = json.loads(curation_history) + except Exception: + curation_history = [] + if not isinstance(curation_history, list): + curation_history = [] + curation_history.append(curation_record) + + if payload.metadata_updates: + existing_metadata = submission.get("dataset_mdata", {}) + if isinstance(existing_metadata, str): + try: + existing_metadata = json.loads(existing_metadata) + except Exception: + existing_metadata = {} + # Deep merge metadata updates into existing flat metadata + deep_merge(existing_metadata, payload.metadata_updates) + submission["dataset_mdata"] = existing_metadata + + submission["status"] = "approved" + submission["curation_history"] = curation_history + submission["approved_at"] = now + submission["approved_by"] = curator_id + submission["updated_at"] = now + + store.upsert_submission(submission) + + logger.info("Submission approved source_id=%s version=%s by=%s", source_id, version, curator_id) + + result = { + "success": True, + "source_id": source_id, + "version": version, + "status": "approved", + "approved_by": curator_id, + "approved_at": now, + } + + # Always trigger publish pipeline (search ingest + status update); + # mint_doi flag only controls the DOI step + publish_job = enqueue_publish_job(source_id, version, mint_doi=payload.mint_doi) + result["publish_job"] = publish_job + if not publish_job.get("queued"): + publish_result = publish_job.get("result", {}) + if publish_result.get("doi", {}).get("success"): + result["doi"] = publish_result["doi"] + if publish_result.get("status") == "published": + result["status"] = "published" + # Refresh to get latest state after inline publish + refreshed = store.get_submission(source_id, version) + if refreshed: + result["status"] = refreshed.get("status", result["status"]) + + return result + + +@router.post("/curation/{source_id}/reject") +async def reject( + source_id: str, + payload: CurationRejectRequest, + auth: AuthContext = Depends(require_curator), + store: SubmissionStore = Depends(get_submission_store), +): + reason = payload.reason.strip() + if not reason: + raise HTTPException(400, "reason is required for rejection") + + submission = _resolve_submission_for_curation(store, source_id, payload.version) + version = submission.get("version") + + if submission.get("status") != "pending_curation": + raise HTTPException( + 400, + f"Submission is not pending curation (status: {submission.get('status')})", + ) + + curator_id = auth.user_id + now = datetime.now(timezone.utc).isoformat() + + curation_record = { + "action": "rejected", + "curator_id": curator_id, + "timestamp": now, + "reason": reason, + "suggestions": payload.suggestions or "", + } + + curation_history = submission.get("curation_history") or [] + if isinstance(curation_history, str): + try: + curation_history = json.loads(curation_history) + except Exception: + curation_history = [] + if not isinstance(curation_history, list): + curation_history = [] + curation_history.append(curation_record) + + submission["status"] = "rejected" + submission["curation_history"] = curation_history + submission["rejected_at"] = now + submission["rejected_by"] = curator_id + submission["rejection_reason"] = reason + submission["updated_at"] = now + + store.upsert_submission(submission) + + logger.info("Submission rejected source_id=%s version=%s by=%s reason=%s", source_id, version, curator_id, reason) + + try: + notify_submitter_rejected(submission, reason, payload.suggestions or "") + except Exception: + logger.warning("Failed to send rejection email for %s", source_id, exc_info=True) + + return { + "success": True, + "source_id": source_id, + "version": version, + "status": "rejected", + "rejected_by": curator_id, + "rejected_at": now, + "reason": reason, + } + + diff --git a/aws/v2/app/routers/files.py b/aws/v2/app/routers/files.py new file mode 100644 index 0000000..fec3fe0 --- /dev/null +++ b/aws/v2/app/routers/files.py @@ -0,0 +1,286 @@ +import base64 +import json +import logging +from datetime import datetime, timezone +from typing import Any, Dict, List, Optional + +from fastapi import APIRouter, Depends, Header, HTTPException + +logger = logging.getLogger(__name__) + +from v2.app.auth import ensure_stream_owner_or_curator, get_auth +from v2.app.deps import get_storage, get_stream_store_dep +from v2.app.models import ( + AuthContext, + ConfirmUploadRequest, + DownloadUrlRequest, + FileUploadRequest, + UploadUrlRequest, +) +from v2.storage import StorageBackend +from v2.stream_store import StreamStore + +router = APIRouter() + + +def _path_belongs_to_stream(storage: StorageBackend, stream_id: str, path: str) -> bool: + normalized = (path or "").replace("\\", "/").strip() + if not normalized: + return False + if ".." in normalized.split("/"): + return False + if storage.backend_name == "local": + return normalized.startswith(f"streams/{stream_id}/") + if storage.backend_name == "globus": + return normalized.startswith(f"{stream_id}_") + return True + + +@router.post("/stream/{stream_id}/upload") +async def upload_files( + stream_id: str, + payload: FileUploadRequest, + auth: AuthContext = Depends(get_auth), + stream_store: StreamStore = Depends(get_stream_store_dep), + storage: StorageBackend = Depends(get_storage), + x_globus_token: Optional[str] = Header(None), +): + # Check stream exists and is open + stream = stream_store.get_stream(stream_id) + if not stream: + raise HTTPException(400, "Stream not found") + ensure_stream_owner_or_curator(auth, stream) + if stream.get("status") != "open": + raise HTTPException(400, "Stream is not open for uploads") + + # Get user token for Globus backend + logger.info("upload_files: stream=%s, x_globus_token=%s", stream_id, "present" if x_globus_token else "NONE") + user_token = x_globus_token + if not user_token and auth.dependent_token: + dep = auth.dependent_token + if isinstance(dep, str) and dep.startswith("{"): + try: + tokens = json.loads(dep) + user_token = next(iter(tokens.values()), {}).get("access_token") if tokens else None + except Exception: + pass + elif isinstance(dep, dict) and dep: + user_token = next(iter(dep.values()), {}).get("access_token") if dep else None + + # Build file list + files_to_upload: List[Dict[str, Any]] = [] + if payload.files: + files_to_upload = [f.model_dump() for f in payload.files] + elif payload.filename: + files_to_upload = [{ + "filename": payload.filename, + "content_base64": payload.content_base64, + "content_type": payload.content_type, + "metadata": payload.metadata, + }] + else: + raise HTTPException(400, "Either 'filename' or 'files' required") + + if not files_to_upload: + raise HTTPException(400, "No files provided") + + uploaded = [] + total_bytes = 0 + errors = [] + + for file_data in files_to_upload: + filename = file_data.get("filename") + content_b64 = file_data.get("content_base64") + content_type = file_data.get("content_type", "application/octet-stream") + metadata = file_data.get("metadata", {}) + + if not filename: + errors.append({"error": "filename required"}) + continue + if not content_b64: + errors.append({"filename": filename, "error": "content_base64 required"}) + continue + + try: + content = base64.b64decode(content_b64) + except Exception as e: + errors.append({"filename": filename, "error": f"Invalid base64: {e}"}) + continue + + try: + store_kwargs = { + "stream_id": stream_id, + "filename": filename, + "content": content, + "content_type": content_type, + "metadata": metadata, + } + if user_token: + store_kwargs["user_token"] = user_token + + file_meta = storage.store_file(**store_kwargs) + uploaded.append(file_meta.to_dict()) + total_bytes += file_meta.size_bytes + except Exception as e: + errors.append({"filename": filename, "error": str(e)}) + + if not uploaded and errors: + raise HTTPException(400, f"All uploads failed: {errors}") + + # Update stream with new file count and bytes + last_file = uploaded[-1] if uploaded else None + stream_store.append_stream( + stream_id=stream_id, + file_count=len(uploaded), + total_bytes=total_bytes, + last_file=last_file, + ) + + return { + "success": True, + "stream_id": stream_id, + "storage_backend": storage.backend_name, + "uploaded": len(uploaded), + "total_bytes": total_bytes, + "files": uploaded, + "errors": errors if errors else None, + } + + +@router.post("/stream/{stream_id}/upload-url") +async def get_upload_url( + stream_id: str, + payload: UploadUrlRequest, + auth: AuthContext = Depends(get_auth), + stream_store: StreamStore = Depends(get_stream_store_dep), + storage: StorageBackend = Depends(get_storage), +): + stream = stream_store.get_stream(stream_id) + if not stream: + raise HTTPException(400, "Stream not found") + ensure_stream_owner_or_curator(auth, stream) + if stream.get("status") != "open": + raise HTTPException(400, "Stream is not open for uploads") + + expires_in = min(payload.expires_in or 3600, 86400) + + try: + upload_info = storage.get_upload_url( + stream_id=stream_id, + filename=payload.filename, + content_type=payload.content_type, + expires_in=expires_in, + ) + except ValueError as exc: + raise HTTPException(400, str(exc)) + + if not upload_info: + raise HTTPException(400, "This storage backend does not support direct uploads") + + return { + "success": True, + "stream_id": stream_id, + "storage_backend": storage.backend_name, + **upload_info, + } + + +@router.post("/stream/{stream_id}/upload-confirm") +async def confirm_upload( + stream_id: str, + payload: ConfirmUploadRequest, + auth: AuthContext = Depends(get_auth), + stream_store: StreamStore = Depends(get_stream_store_dep), + storage: StorageBackend = Depends(get_storage), +): + stream = stream_store.get_stream(stream_id) + if not stream: + raise HTTPException(400, "Stream not found") + ensure_stream_owner_or_curator(auth, stream) + if stream.get("status") != "open": + raise HTTPException(400, "Stream is not open for uploads") + if not _path_belongs_to_stream(storage, stream_id, payload.path): + raise HTTPException(400, "Upload path does not belong to this stream") + # Ensure the external upload is present before mutating stream accounting. + if storage.get_file(payload.path) is None: + raise HTTPException(400, "Uploaded file path does not exist") + + filename = payload.path.split("/")[-1] + + file_info = { + "filename": filename, + "path": payload.path, + "size_bytes": payload.size_bytes, + "checksum_md5": payload.checksum_md5, + "stored_at": datetime.now(timezone.utc).isoformat().replace("+00:00", "Z"), + "storage_backend": storage.backend_name, + "metadata": payload.metadata, + } + + stream_store.append_stream( + stream_id=stream_id, + file_count=1, + total_bytes=payload.size_bytes or 0, + last_file=file_info, + ) + + return { + "success": True, + "stream_id": stream_id, + "file": file_info, + } + + +@router.post("/stream/{stream_id}/download-url") +async def get_download_url( + stream_id: str, + payload: DownloadUrlRequest, + auth: AuthContext = Depends(get_auth), + stream_store: StreamStore = Depends(get_stream_store_dep), + storage: StorageBackend = Depends(get_storage), +): + path = payload.path if payload else None + + if not path: + raise HTTPException(400, "Could not determine file path") + stream = stream_store.get_stream(stream_id) + if not stream: + raise HTTPException(400, "Stream not found") + ensure_stream_owner_or_curator(auth, stream) + if not _path_belongs_to_stream(storage, stream_id, path): + raise HTTPException(400, "File path does not belong to this stream") + + download_url = storage.get_download_url(path) + if not download_url: + raise HTTPException(400, "File not found or download not available") + + return { + "success": True, + "stream_id": stream_id, + "path": path, + "download_url": download_url, + "storage_backend": storage.backend_name, + } + + +@router.get("/stream/{stream_id}/files") +async def list_files( + stream_id: str, + auth: AuthContext = Depends(get_auth), + stream_store: StreamStore = Depends(get_stream_store_dep), + storage: StorageBackend = Depends(get_storage), +): + stream = stream_store.get_stream(stream_id) + if not stream: + raise HTTPException(400, "Stream not found") + ensure_stream_owner_or_curator(auth, stream) + + files = storage.list_files(stream_id) + + return { + "success": True, + "stream_id": stream_id, + "storage_backend": storage.backend_name, + "file_count": len(files), + "files": [f.to_dict() for f in files], + } diff --git a/aws/v2/app/routers/preview.py b/aws/v2/app/routers/preview.py new file mode 100644 index 0000000..f8fb514 --- /dev/null +++ b/aws/v2/app/routers/preview.py @@ -0,0 +1,212 @@ +import json +from typing import Optional + +from fastapi import APIRouter, Depends, HTTPException, Query + +from v2.app.auth import ensure_stream_owner_or_curator, get_auth +from v2.app.deps import get_storage, get_stream_store_dep, get_submission_store +from v2.app.models import AuthContext +from v2.preview import generate_preview +from v2.storage import StorageBackend +from v2.store import SubmissionStore +from v2.stream_store import StreamStore + +router = APIRouter() + + +# ── Stream-level preview (existing) ──────────────────────────────── + +@router.get("/stream/{stream_id}/preview") +async def preview_stream( + stream_id: str, + auth: AuthContext = Depends(get_auth), + stream_store: StreamStore = Depends(get_stream_store_dep), + storage: StorageBackend = Depends(get_storage), +): + stream = stream_store.get_stream(stream_id) + if not stream: + raise HTTPException(400, "Stream not found") + ensure_stream_owner_or_curator(auth, stream) + + files = storage.list_files(stream_id) + + previews = [] + for f in files: + content = storage.get_file(f.path) + if content: + preview = generate_preview(content, f.filename) + previews.append({ + "filename": f.filename, + "path": f.path, + "size_bytes": f.size_bytes, + "preview": preview, + }) + + return { + "success": True, + "stream_id": stream_id, + "file_count": len(previews), + "previews": previews, + } + + +@router.get("/stream/{stream_id}/files/{filename}/preview") +async def preview_file( + stream_id: str, + filename: str, + max_rows: Optional[int] = Query(20), + max_lines: Optional[int] = Query(50), + auth: AuthContext = Depends(get_auth), + stream_store: StreamStore = Depends(get_stream_store_dep), + storage: StorageBackend = Depends(get_storage), +): + stream = stream_store.get_stream(stream_id) + if not stream: + raise HTTPException(400, "Stream not found") + ensure_stream_owner_or_curator(auth, stream) + + # Try to find the file + files = storage.list_files(stream_id) + file_meta = None + for f in files: + if f.filename == filename or f.path.endswith(filename): + file_meta = f + break + + if not file_meta: + path = f"streams/{stream_id}/{filename}" + content = storage.get_file(path) + if content is None: + raise HTTPException(400, f"File not found: {filename}") + else: + content = storage.get_file(file_meta.path) + if content is None: + raise HTTPException(400, f"Could not read file: {filename}") + + preview = generate_preview( + content=content, + filename=filename, + max_rows=max_rows or 20, + max_lines=max_lines or 50, + ) + + return { + "success": True, + "stream_id": stream_id, + "filename": filename, + "preview": preview, + } + + +# ── Dataset-level preview (new) ──────────────────────────────────── + +def _get_profile(source_id: str, store: SubmissionStore) -> Optional[dict]: + """Load the stored DatasetProfile for a source_id.""" + record = store.get(source_id) + if not record: + return None + profile = record.get("dataset_profile") + if profile is None: + return None + if isinstance(profile, str): + try: + profile = json.loads(profile) + except Exception: + return None + return profile + + +@router.get("/preview/{source_id}") +async def dataset_preview( + source_id: str, + store: SubmissionStore = Depends(get_submission_store), +): + """Return the stored DatasetProfile for a dataset.""" + profile = _get_profile(source_id, store) + if not profile: + raise HTTPException(404, "No profile found for this dataset") + return {"success": True, "profile": profile} + + +@router.get("/preview/{source_id}/files") +async def dataset_files( + source_id: str, + store: SubmissionStore = Depends(get_submission_store), +): + """List all files in the dataset with metadata.""" + profile = _get_profile(source_id, store) + if not profile: + raise HTTPException(404, "No profile found for this dataset") + + files = [] + for fp in profile.get("files", []): + files.append({ + "path": fp.get("path"), + "filename": fp.get("filename"), + "size_bytes": fp.get("size_bytes"), + "content_type": fp.get("content_type"), + "format": fp.get("format"), + }) + + return {"success": True, "source_id": source_id, "files": files} + + +@router.get("/preview/{source_id}/files/{path:path}") +async def dataset_file_detail( + source_id: str, + path: str, + store: SubmissionStore = Depends(get_submission_store), +): + """Get detailed profile of a specific file in the dataset.""" + profile = _get_profile(source_id, store) + if not profile: + raise HTTPException(404, "No profile found for this dataset") + + for fp in profile.get("files", []): + if fp.get("path") == path or fp.get("filename") == path: + return {"success": True, "source_id": source_id, "file": fp} + + raise HTTPException(404, f"File not found in profile: {path}") + + +@router.get("/preview/{source_id}/sample") +async def dataset_sample( + source_id: str, + store: SubmissionStore = Depends(get_submission_store), +): + """Quick sample data from the first tabular file in the dataset.""" + profile = _get_profile(source_id, store) + if not profile: + raise HTTPException(404, "No profile found for this dataset") + + # Find the first file with sample_rows + for fp in profile.get("files", []): + sample_rows = fp.get("sample_rows", []) + if sample_rows: + return { + "success": True, + "source_id": source_id, + "filename": fp.get("filename"), + "format": fp.get("format"), + "columns": fp.get("columns", []), + "n_rows": fp.get("n_rows"), + "sample_rows": sample_rows, + } + + # Fallback: return preview_lines from first text file + for fp in profile.get("files", []): + preview_lines = fp.get("preview_lines", []) + if preview_lines: + return { + "success": True, + "source_id": source_id, + "filename": fp.get("filename"), + "format": fp.get("format"), + "preview_lines": preview_lines, + } + + return { + "success": True, + "source_id": source_id, + "message": "No previewable data found", + } diff --git a/aws/v2/app/routers/search.py b/aws/v2/app/routers/search.py new file mode 100644 index 0000000..7254b08 --- /dev/null +++ b/aws/v2/app/routers/search.py @@ -0,0 +1,105 @@ +import logging +import os +from typing import Dict, List, Optional + +from fastapi import APIRouter, HTTPException, Query + +logger = logging.getLogger(__name__) + +from v2.search import search_all + +router = APIRouter() + +# Maps query param names to Globus Search field names +FILTER_FIELD_MAP = { + "year": "dc.year", + "organization": "mdf.organization", + "author": "dc.creators.name", + "keyword": "dc.subjects", + "domain": "mdf.domains", +} + + +def _env_int(name: str, default: int, minimum: int = 1) -> int: + value = os.environ.get(name) + if value is None: + return default + try: + parsed = int(value) + except (TypeError, ValueError): + return default + return max(minimum, parsed) + + +MAX_SEARCH_RESULTS = _env_int("SEARCH_MAX_RESULTS", 50) + + +def _parse_filters( + year: Optional[str], + organization: Optional[str], + author: Optional[str], + keyword: Optional[str], + domain: Optional[str], +) -> Optional[Dict[str, List[str]]]: + """Parse comma-separated filter query params into a filters dict.""" + raw = { + "year": year, + "organization": organization, + "author": author, + "keyword": keyword, + "domain": domain, + } + filters = {} + for param, value in raw.items(): + if value: + values = [v.strip() for v in value.split(",") if v.strip()] + if values: + filters[FILTER_FIELD_MAP[param]] = values + return filters or None + + +@router.get("/search") +async def search_endpoint( + q: Optional[str] = Query(None), + query: Optional[str] = Query(None), + type: Optional[str] = Query("all"), + limit: Optional[int] = Query(20), + offset: Optional[int] = Query(0), + year: Optional[str] = Query(None), + organization: Optional[str] = Query(None), + author: Optional[str] = Query(None), + keyword: Optional[str] = Query(None), + domain: Optional[str] = Query(None), +): + search_query = q or query + if not search_query: + raise HTTPException(400, "query (q) is required") + + search_type = type or "all" + include_datasets = search_type in ("all", "datasets", "dataset") + include_streams = search_type in ("all", "streams", "stream") + + try: + limit_val = int(limit) if limit else 20 + except ValueError: + limit_val = 20 + limit_val = max(1, min(limit_val, MAX_SEARCH_RESULTS)) + + try: + offset_val = int(offset) if offset else 0 + except ValueError: + offset_val = 0 + offset_val = max(0, offset_val) + + filters = _parse_filters(year, organization, author, keyword, domain) + + results = search_all( + query=search_query, + include_datasets=include_datasets, + include_streams=include_streams, + limit=limit_val, + offset=offset_val, + filters=filters, + ) + + return results diff --git a/aws/v2/app/routers/streams.py b/aws/v2/app/routers/streams.py new file mode 100644 index 0000000..903e2f4 --- /dev/null +++ b/aws/v2/app/routers/streams.py @@ -0,0 +1,248 @@ +import json +import os +import uuid +from datetime import datetime, timezone +from decimal import Decimal +from typing import Any, Dict, Optional + +from fastapi import APIRouter, Depends, HTTPException + +from v2.async_jobs import enqueue_profile_job, enqueue_stream_doi_job +from v2.app.auth import ensure_stream_owner_or_curator, get_auth +from v2.app.deps import get_stream_store_dep, get_submission_store +from v2.app.models import ( + AuthContext, + StreamAppendRequest, + StreamCloseRequest, + StreamCreateRequest, + StreamSnapshotRequest, +) +from v2.doi_utils import mint_doi_for_stream +from v2.store import SubmissionStore +from v2.stream_store import StreamStore +from v2.submission_utils import generate_source_id, increment_version, latest_version + +router = APIRouter() + +MAX_STREAM_APPEND_COUNT = int(os.environ.get("MAX_STREAM_APPEND_COUNT", "10000")) +MAX_STREAM_APPEND_BYTES = int(os.environ.get("MAX_STREAM_APPEND_BYTES", str(50 * 1024 * 1024 * 1024))) + + +@router.post("/stream/create") +async def stream_create( + payload: StreamCreateRequest, + auth: AuthContext = Depends(get_auth), + store: StreamStore = Depends(get_stream_store_dep), +): + now = datetime.now(timezone.utc).isoformat().replace("+00:00", "Z") + stream_id = payload.stream_id or f"stream-{uuid.uuid4().hex}" + + record = { + "stream_id": stream_id, + "lab_id": payload.lab_id, + "title": payload.title, + "status": "open", + "file_count": 0, + "total_bytes": 0, + "last_append_at": None, + "created_at": now, + "updated_at": now, + "user_id": auth.user_id, + "organization": payload.organization, + "metadata": payload.metadata, + } + + store.create_stream(record) + return {"success": True, "stream_id": stream_id, "stream": record} + + +@router.post("/stream/{stream_id}/append") +async def stream_append( + stream_id: str, + payload: StreamAppendRequest, + auth: AuthContext = Depends(get_auth), + store: StreamStore = Depends(get_stream_store_dep), +): + stream = store.get_stream(stream_id) + if not stream: + raise HTTPException(400, "Stream not found") + ensure_stream_owner_or_curator(auth, stream) + + # Handle files list or direct file_count/total_bytes + if payload.files: + file_count = len(payload.files) + total_bytes = 0 + for entry in payload.files: + try: + total_bytes += int(entry.size or 0) + except Exception: + continue + last_file = payload.files[-1].model_dump() if payload.files else None + else: + try: + file_count = int(payload.file_count) if payload.file_count is not None else 0 + except Exception: + file_count = 0 + try: + total_bytes = int(payload.total_bytes) if payload.total_bytes is not None else 0 + except Exception: + total_bytes = 0 + last_file = payload.last_file + + if file_count <= 0 and total_bytes <= 0: + raise HTTPException(400, "Provide files list or file_count/total_bytes") + if file_count > MAX_STREAM_APPEND_COUNT: + raise HTTPException(413, f"file_count exceeds max {MAX_STREAM_APPEND_COUNT}") + if total_bytes > MAX_STREAM_APPEND_BYTES: + raise HTTPException(413, f"total_bytes exceeds max {MAX_STREAM_APPEND_BYTES}") + + updated = store.append_stream(stream_id, file_count, total_bytes, last_file=last_file) + if not updated: + raise HTTPException(400, "Stream not found") + + return {"success": True, "stream": updated} + + +@router.get("/stream/{stream_id}") +async def stream_status( + stream_id: str, + auth: AuthContext = Depends(get_auth), + store: StreamStore = Depends(get_stream_store_dep), +): + record = store.get_stream(stream_id) + if not record: + return {"success": False, "error": "Stream not found"} + ensure_stream_owner_or_curator(auth, record) + return {"success": True, "stream": record} + + +@router.post("/stream/{stream_id}/close") +async def stream_close( + stream_id: str, + payload: StreamCloseRequest, + auth: AuthContext = Depends(get_auth), + store: StreamStore = Depends(get_stream_store_dep), +): + stream = store.get_stream(stream_id) + if not stream: + raise HTTPException(400, "Stream not found") + ensure_stream_owner_or_curator(auth, stream) + if stream.get("status") == "closed": + raise HTTPException(400, "Stream is already closed") + + mint_doi = payload.mint_doi or False + + record = store.close_stream(stream_id) + if not record: + raise HTTPException(500, "Failed to close stream") + + result = { + "success": True, + "stream_id": stream_id, + "status": "closed", + "stream": record, + } + + if mint_doi: + doi_job = enqueue_stream_doi_job(stream_id, payload.model_dump()) + result["doi_job"] = doi_job + if not doi_job.get("queued"): + result["doi"] = doi_job.get("result") + refreshed_stream = store.get_stream(stream_id) + if refreshed_stream: + result["stream"] = refreshed_stream + + return result + + +@router.post("/stream/{stream_id}/snapshot") +async def stream_snapshot( + stream_id: str, + payload: StreamSnapshotRequest, + auth: AuthContext = Depends(get_auth), + stream_store: StreamStore = Depends(get_stream_store_dep), + sub_store: SubmissionStore = Depends(get_submission_store), +): + stream = stream_store.get_stream(stream_id) + if not stream: + raise HTTPException(400, "Stream not found") + ensure_stream_owner_or_curator(auth, stream) + + stream_meta = stream.get("metadata") or {} + if isinstance(stream_meta, str): + try: + stream_meta = json.loads(stream_meta) + except Exception: + stream_meta = {} + + user_id = stream.get("user_id") or auth.user_id + title = payload.title or stream.get("title") or f"Stream {stream_id}" + + source_id = payload.source_id or stream_id + update = bool(payload.update) + + existing_versions = sub_store.list_versions(source_id) + + if update and not existing_versions: + raise HTTPException(400, "Update requested but no prior submission found") + + if not update and existing_versions: + source_id = generate_source_id(prefix=source_id) + existing_versions = [] + + latest = latest_version(existing_versions) + version = increment_version(latest) if update else "1.0" + versioned_source_id = f"{source_id}-{version}" + + # Build flat v2 metadata + operator = stream_meta.get("operator") if isinstance(stream_meta, dict) else None + dataset = { + "title": title, + "authors": [{"name": payload.author or operator or user_id}], + "description": payload.description or f"Stream snapshot: {stream_id}", + "data_sources": payload.data_sources or [f"stream://{stream_id}"], + "organization": stream.get("organization"), + "tags": stream_meta.get("tags", []) if isinstance(stream_meta, dict) else [], + "test": payload.test or False, + "update": update, + "extensions": { + "stream": { + "stream_id": stream_id, + "file_count": stream.get("file_count"), + "total_bytes": stream.get("total_bytes"), + } + }, + } + + now = datetime.now(timezone.utc).isoformat().replace("+00:00", "Z") + record = { + "source_id": source_id, + "version": version, + "versioned_source_id": versioned_source_id, + "user_id": user_id, + "user_email": None, + "organization": stream.get("organization"), + "status": "pending_curation", + "dataset_mdata": json.dumps(dataset, default=lambda o: int(o) if isinstance(o, Decimal) else str(o)), + "test": dataset.get("test", False), + "created_at": now, + "updated_at": now, + } + + sub_store.put_submission(record) + + profile_job = enqueue_profile_job(source_id, version, stream_id) + + return { + "success": True, + "source_id": source_id, + "version": version, + "versioned_source_id": versioned_source_id, + "stream_id": stream_id, + "profile_job": profile_job, + } + + +def _mint_doi_for_stream(stream: Dict, overrides: Dict) -> Dict: + # Kept for backwards compatibility with existing imports. + return mint_doi_for_stream(stream, overrides) diff --git a/aws/v2/app/routers/submissions.py b/aws/v2/app/routers/submissions.py new file mode 100644 index 0000000..7751347 --- /dev/null +++ b/aws/v2/app/routers/submissions.py @@ -0,0 +1,1036 @@ +import json +import logging +import os +import re +import uuid +from datetime import datetime, timezone +from typing import Any, Dict, List, Optional +from urllib.parse import urlparse + +from fastapi import APIRouter, Depends, HTTPException, Query + +from v2.async_jobs import enqueue_profile_job, enqueue_publish_job, enqueue_transfer_job +from v2.transfer import check_transfer_status, cleanup_transfer_acl +from v2.app.auth import ( + ensure_submission_owner_or_curator, + get_auth, + get_optional_auth, + is_curator, + require_curator, + require_submitter, +) +from v2.app.deps import get_submission_store +from v2.app.models import ( + AuthContext, + DeleteSubmissionRequest, + MetadataEditRequest, + ResubmitRequest, + StatusUpdateRequest, + WithdrawRequest, +) +from v2.config import DEFAULT_ORGANIZATION +from v2.email_utils import notify_curators_new_submission +from v2.metadata import DatasetMetadata, migrate_v1_payload +from v2.store import SubmissionStore, parse_pagination_key, serialize_pagination_key +from v2.submission_utils import deep_merge, generate_source_id, increment_version, latest_version + +logger = logging.getLogger(__name__) + +router = APIRouter() + +MAX_SUBMIT_METADATA_BYTES = int(os.environ.get("MAX_SUBMIT_METADATA_BYTES", "262144")) +MAX_SUBMIT_DATA_SOURCES = int(os.environ.get("MAX_SUBMIT_DATA_SOURCES", "2000")) +MAX_SUBMIT_AUTHORS = int(os.environ.get("MAX_SUBMIT_AUTHORS", "1000")) + + +_UUID_RE = re.compile( + r"^[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$", re.IGNORECASE +) + + +def _validate_data_sources(data_sources: List[str]) -> List[str]: + """Validate data source URL formats. Returns list of error messages.""" + errors: List[str] = [] + for i, src in enumerate(data_sources): + if src.startswith("globus://"): + # Must have UUID-like segment and non-empty path + rest = src[len("globus://"):] + slash_idx = rest.find("/") + if slash_idx <= 0: + errors.append(f"data_sources[{i}]: globus:// URI missing path: {src}") + continue + collection_id = rest[:slash_idx] + path = rest[slash_idx:] + if not _UUID_RE.match(collection_id): + errors.append(f"data_sources[{i}]: globus:// URI has invalid collection UUID: {collection_id}") + if not path or path == "/": + errors.append(f"data_sources[{i}]: globus:// URI has empty path") + elif src.startswith("https://") or src.startswith("http://"): + parsed = urlparse(src) + if not parsed.hostname: + errors.append(f"data_sources[{i}]: malformed URL (no hostname): {src}") + elif src.startswith("stream://"): + stream_id = src[len("stream://"):] + if not stream_id.strip(): + errors.append(f"data_sources[{i}]: stream:// URI has empty ID") + # Other formats (absolute paths, etc.) pass through + return errors + + +def _is_v1_payload(metadata: dict) -> bool: + """Detect old dc/mdf/custom format.""" + dc = metadata.get("dc") + return isinstance(dc, dict) and ("titles" in dc or "creators" in dc) + + +def _source_id_from_metadata(metadata: Dict[str, Any]) -> Optional[str]: + """Extract source_id from either v1 or v2 payload.""" + # v1 format + mdf = metadata.get("mdf", {}) + sid = mdf.get("source_id") or mdf.get("source_name") + if sid: + return sid + # v2 format: check extensions + ext = metadata.get("extensions", {}) + return ext.get("mdf_source_id") or ext.get("mdf_source_name") + + +def _normalize_record(record: Dict[str, Any]) -> Dict[str, Any]: + if not record: + return {} + if "dataset_mdata" in record and isinstance(record["dataset_mdata"], str): + try: + record["dataset_mdata"] = json.loads(record["dataset_mdata"]) + except Exception: + pass + return record + + +def _extract_dependent_transfer_token(auth) -> Optional[str]: + """Extract the user's Globus Transfer token from dependent tokens. + + The server performs a dependent token exchange on the user's auth token + to obtain tokens for downstream services (groups, transfer, etc.). + """ + dep = auth.dependent_token + if not dep: + return None + + # dependent_token is a dict keyed by resource server + transfer_entry = dep.get("transfer.api.globus.org") + if not transfer_entry: + return None + + if isinstance(transfer_entry, dict): + return transfer_entry.get("access_token") + return getattr(transfer_entry, "access_token", None) + + +def _flip_latest_on_prior(store: SubmissionStore, prior_record: Dict[str, Any]) -> None: + """Set latest=false in the prior version's metadata.""" + mdata = prior_record.get("dataset_mdata") + if isinstance(mdata, str): + try: + mdata = json.loads(mdata) + except Exception: + return + if not isinstance(mdata, dict): + return + mdata["latest"] = False + prior_record["dataset_mdata"] = json.dumps(mdata) + prior_record["updated_at"] = datetime.now(timezone.utc).isoformat().replace("+00:00", "Z") + store.upsert_submission(prior_record) + + +def _resolve_submission( + store: SubmissionStore, + source_id: str, + version: Optional[str], +) -> Dict[str, Any]: + """Resolve a submission by source_id and optional version (defaults to latest).""" + if version: + record = store.get_submission(source_id, version) + if not record: + raise HTTPException(404, "Submission not found") + return record + versions = store.list_versions(source_id) + if not versions: + raise HTTPException(404, "Submission not found") + latest_ver = latest_version(versions) + for item in versions: + if item.get("version") == latest_ver: + return item + return versions[-1] + + +def _parse_mdata(record: Dict[str, Any]) -> dict: + """Extract dataset_mdata as a new dict from a submission record. + + Always returns a deep copy so callers can mutate without affecting the + original record (important since SQLite _row_to_dict auto-parses JSON). + """ + import copy + mdata = record.get("dataset_mdata", {}) + if isinstance(mdata, str): + try: + mdata = json.loads(mdata) + except Exception: + mdata = {} + if not isinstance(mdata, dict): + return {} + return copy.deepcopy(mdata) + + +@router.post("/submissions/{source_id}/metadata") +async def edit_metadata( + source_id: str, + payload: MetadataEditRequest, + auth: AuthContext = Depends(get_auth), + store: SubmissionStore = Depends(get_submission_store), +): + submission = _resolve_submission(store, source_id, payload.version) + ensure_submission_owner_or_curator(auth, submission) + + status = submission.get("status") + if status not in ("pending_curation", "rejected", "published"): + raise HTTPException(400, f"Cannot edit metadata when status is '{status}'") + + # Build updates from non-None payload fields (excluding version) + editable_fields = [ + "title", "authors", "description", "keywords", "license", "funding", + "related_works", "methods", "facility", "fields_of_science", "domains", + "ml", "geo_locations", "tags", "extensions", + ] + updates = {} + for field in editable_fields: + value = getattr(payload, field) + if value is not None: + updates[field] = value + + if not updates: + raise HTTPException(400, "No metadata fields provided") + + existing_mdata = _parse_mdata(submission) + deep_merge(existing_mdata, updates) + + # Re-validate through Pydantic to catch schema errors + try: + DatasetMetadata.model_validate(existing_mdata) + except Exception as exc: + raise HTTPException(400, f"Invalid metadata after edit: {exc}") + + now = datetime.now(timezone.utc).isoformat().replace("+00:00", "Z") + version = submission.get("version") + + if status == "published": + # Auto-create a minor version bump + new_version = increment_version(version, major=False) + new_versioned_id = "{}-{}".format(source_id, new_version) + + existing_mdata["version"] = new_version + existing_mdata["latest"] = True + existing_mdata["previous_version"] = "{}-{}".format(source_id, version) + existing_mdata["root_version"] = existing_mdata.get("root_version") or "{}-{}".format(source_id, "1.0") + + new_record = { + "source_id": source_id, + "version": new_version, + "versioned_source_id": new_versioned_id, + "user_id": auth.user_id, + "user_email": auth.user_email, + "organization": submission.get("organization"), + "status": "published", + "dataset_mdata": json.dumps(existing_mdata), + "schema_version": submission.get("schema_version", "2"), + "test": submission.get("test", False), + "created_at": now, + "updated_at": now, + "published_at": now, + } + + # Inherit dataset_doi + dataset_doi = submission.get("dataset_doi") or submission.get("doi") + if dataset_doi: + new_record["dataset_doi"] = dataset_doi + + # Flip latest=False on the prior version + _flip_latest_on_prior(store, submission) + + store.put_submission(new_record) + + # Enqueue publish job to update DataCite metadata and re-index in search + try: + enqueue_publish_job(source_id, new_version, mint_doi=False) + except Exception: + logger.warning("Failed to enqueue publish job for metadata edit %s v%s", source_id, new_version, exc_info=True) + + return { + "success": True, + "source_id": source_id, + "version": version, + "new_version": new_version, + "updated_fields": list(updates.keys()), + } + + # pending_curation or rejected: update in-place + submission["dataset_mdata"] = json.dumps(existing_mdata) + submission["updated_at"] = now + store.upsert_submission(submission) + + return { + "success": True, + "source_id": source_id, + "version": version, + "updated_fields": list(updates.keys()), + } + + +@router.post("/submissions/{source_id}/withdraw") +async def withdraw( + source_id: str, + payload: WithdrawRequest, + auth: AuthContext = Depends(get_auth), + store: SubmissionStore = Depends(get_submission_store), +): + submission = _resolve_submission(store, source_id, payload.version) + ensure_submission_owner_or_curator(auth, submission) + + status = submission.get("status") + if status != "pending_curation": + raise HTTPException(400, f"Can only withdraw submissions with status 'pending_curation', current status is '{status}'") + + now = datetime.now(timezone.utc).isoformat().replace("+00:00", "Z") + version = submission.get("version") + + curation_history = submission.get("curation_history") or [] + if isinstance(curation_history, str): + try: + curation_history = json.loads(curation_history) + except Exception: + curation_history = [] + if not isinstance(curation_history, list): + curation_history = [] + curation_history.append({ + "action": "withdrawn", + "user_id": auth.user_id, + "timestamp": now, + "reason": payload.reason or "", + }) + + submission["status"] = "withdrawn" + submission["curation_history"] = curation_history + submission["updated_at"] = now + store.upsert_submission(submission) + + # If this was latest=True, restore latest on prior version + mdata = _parse_mdata(submission) + if mdata.get("latest") and mdata.get("previous_version"): + prev_vsid = mdata["previous_version"] + # Extract version from "source_id-version" + parts = prev_vsid.rsplit("-", 1) + if len(parts) == 2: + prev_version = parts[1] + prev_record = store.get_submission(source_id, prev_version) + if prev_record: + prev_mdata = _parse_mdata(prev_record) + prev_mdata["latest"] = True + prev_record["dataset_mdata"] = json.dumps(prev_mdata) + prev_record["updated_at"] = now + store.upsert_submission(prev_record) + + return { + "success": True, + "source_id": source_id, + "version": version, + "status": "withdrawn", + } + + +@router.post("/submissions/{source_id}/resubmit") +async def resubmit( + source_id: str, + payload: ResubmitRequest, + auth: AuthContext = Depends(get_auth), + store: SubmissionStore = Depends(get_submission_store), +): + submission = _resolve_submission(store, source_id, payload.version) + ensure_submission_owner_or_curator(auth, submission) + + status = submission.get("status") + if status != "rejected": + raise HTTPException(400, f"Can only resubmit submissions with status 'rejected', current status is '{status}'") + + now = datetime.now(timezone.utc).isoformat().replace("+00:00", "Z") + version = submission.get("version") + + curation_history = submission.get("curation_history") or [] + if isinstance(curation_history, str): + try: + curation_history = json.loads(curation_history) + except Exception: + curation_history = [] + if not isinstance(curation_history, list): + curation_history = [] + curation_history.append({ + "action": "resubmitted", + "user_id": auth.user_id, + "timestamp": now, + "notes": payload.notes or "", + }) + + submission["status"] = "pending_curation" + submission["curation_history"] = curation_history + submission["updated_at"] = now + store.upsert_submission(submission) + + return { + "success": True, + "source_id": source_id, + "version": version, + "status": "pending_curation", + } + + +@router.post("/submissions/{source_id}/delete") +async def delete_submission( + source_id: str, + payload: DeleteSubmissionRequest, + auth: AuthContext = Depends(require_curator), + store: SubmissionStore = Depends(get_submission_store), +): + submission = _resolve_submission(store, source_id, payload.version) + version = submission.get("version") + + if submission.get("status") == "deleted": + raise HTTPException(400, "Submission is already deleted") + + now = datetime.now(timezone.utc).isoformat().replace("+00:00", "Z") + + curation_history = submission.get("curation_history") or [] + if isinstance(curation_history, str): + try: + curation_history = json.loads(curation_history) + except Exception: + curation_history = [] + if not isinstance(curation_history, list): + curation_history = [] + curation_history.append({ + "action": "deleted", + "user_id": auth.user_id, + "timestamp": now, + "reason": payload.reason, + }) + + submission["status"] = "deleted" + submission["deleted_at"] = now + submission["deleted_by"] = auth.user_id + submission["curation_history"] = curation_history + submission["updated_at"] = now + store.upsert_submission(submission) + + logger.info("Submission deleted source_id=%s version=%s by=%s reason=%s", + source_id, version, auth.user_id, payload.reason) + + return { + "success": True, + "source_id": source_id, + "version": version, + "status": "deleted", + } + + +@router.get("/versions/{source_id}/diff") +async def version_diff( + source_id: str, + from_version: str = Query(..., alias="from"), + to_version: str = Query(..., alias="to"), + store: SubmissionStore = Depends(get_submission_store), +): + from_record = store.get_submission(source_id, from_version) + if not from_record: + raise HTTPException(404, f"Version {from_version} not found") + to_record = store.get_submission(source_id, to_version) + if not to_record: + raise HTTPException(404, f"Version {to_version} not found") + + from_mdata = _parse_mdata(from_record) + to_mdata = _parse_mdata(to_record) + + # Skip system/versioning fields from diff + skip_fields = {"version", "latest", "previous_version", "root_version", "update", "test"} + + all_keys = (set(from_mdata.keys()) | set(to_mdata.keys())) - skip_fields + added = {} + removed = {} + changed = {} + unchanged = [] + + for key in sorted(all_keys): + in_from = key in from_mdata + in_to = key in to_mdata + if in_to and not in_from: + added[key] = to_mdata[key] + elif in_from and not in_to: + removed[key] = from_mdata[key] + elif from_mdata[key] != to_mdata[key]: + changed[key] = {"from": from_mdata[key], "to": to_mdata[key]} + else: + unchanged.append(key) + + return { + "success": True, + "source_id": source_id, + "from_version": { + "version": from_version, + "status": from_record.get("status"), + "created_at": from_record.get("created_at"), + }, + "to_version": { + "version": to_version, + "status": to_record.get("status"), + "created_at": to_record.get("created_at"), + }, + "diff": { + "added": added, + "removed": removed, + "changed": changed, + "unchanged": unchanged, + }, + } + + +@router.post("/submit") +async def submit( + metadata: dict, + auth: AuthContext = Depends(require_submitter), + store: SubmissionStore = Depends(get_submission_store), +): + user_id = auth.user_id + user_email = auth.user_email + + if not metadata: + raise HTTPException(400, "POST data empty or not JSON") + + try: + serialized = json.dumps(metadata, allow_nan=False) + except Exception: + raise HTTPException(400, "Submission may not contain NaN or Infinity") + if len(serialized.encode("utf-8")) > MAX_SUBMIT_METADATA_BYTES: + raise HTTPException(413, f"Submission metadata exceeds {MAX_SUBMIT_METADATA_BYTES} bytes") + + # Auto-detect and migrate v1 format + if _is_v1_payload(metadata): + metadata = migrate_v1_payload(metadata) + + # Validate through Pydantic (fills defaults, validates types) + try: + validated = DatasetMetadata.model_validate(metadata) + except Exception as exc: + raise HTTPException(400, f"Invalid metadata: {exc}") + + flat = validated.model_dump() + if len(flat.get("data_sources", [])) > MAX_SUBMIT_DATA_SOURCES: + raise HTTPException(413, f"Too many data_sources (max {MAX_SUBMIT_DATA_SOURCES})") + if len(flat.get("authors", [])) > MAX_SUBMIT_AUTHORS: + raise HTTPException(413, f"Too many authors (max {MAX_SUBMIT_AUTHORS})") + + data_source_errors = _validate_data_sources(flat.get("data_sources", [])) + if data_source_errors: + raise HTTPException(400, f"Invalid data_sources: {'; '.join(data_source_errors)}") + + if not flat.get("data_sources") and not metadata.get("update_metadata_only") and not flat.get("update"): + raise HTTPException(400, "You must provide data_sources before submission") + + organization = flat.get("organization") or DEFAULT_ORGANIZATION + if isinstance(organization, list): + organization = organization[0] + + is_test = flat.get("test", False) + update = flat.get("update", False) + + source_id = _source_id_from_metadata(metadata) + existing_versions = [] + if source_id: + existing_versions = store.list_versions(source_id) + + if update and not source_id: + raise HTTPException(400, "Missing source_id for update submission") + + if not update and not source_id: + source_id = generate_source_id() + + if not update and source_id and existing_versions: + source_id = "{}-{}".format(source_id, uuid.uuid4().hex[:8]) + existing_versions = [] + + latest_ver = latest_version(existing_versions) + has_new_data = bool(flat.get("data_sources")) + version = increment_version(latest_ver, major=has_new_data) if update else "1.0" + + if update and not latest_ver: + raise HTTPException(400, "Update requested but no prior submission found") + + versioned_source_id = "{}-{}".format(source_id, version) + + # Propagate dataset_doi from prior published versions + inherited_dataset_doi = None + previous_version_id = None + root_version_id = None + prior_record = None + if update and existing_versions: + for v in existing_versions: + ddoi = v.get("dataset_doi") or v.get("doi") + if ddoi and v.get("status") == "published": + inherited_dataset_doi = ddoi + break + # Build version chain: previous_version = the latest existing version's id + previous_version_id = "{}-{}".format(source_id, latest_ver) + # Root version: inherit from prior, or use the earliest version + prior_record = next( + (v for v in existing_versions if v.get("version") == latest_ver), None + ) + if prior_record: + prior_mdata = prior_record.get("dataset_mdata") + if isinstance(prior_mdata, str): + try: + prior_mdata = json.loads(prior_mdata) + except Exception: + prior_mdata = {} + if isinstance(prior_mdata, dict): + root_version_id = prior_mdata.get("root_version") + if not root_version_id: + # Earliest version is the root + earliest_ver = sorted( + existing_versions, key=lambda v: v.get("version", "0") + )[0] + root_version_id = "{}-{}".format(source_id, earliest_ver.get("version", "1.0")) + + # Inherit data_sources from prior version for metadata-only updates + if update and not has_new_data and prior_record: + prior_mdata = prior_record.get("dataset_mdata") + if isinstance(prior_mdata, str): + try: + prior_mdata = json.loads(prior_mdata) + except Exception: + prior_mdata = {} + if isinstance(prior_mdata, dict): + flat["data_sources"] = prior_mdata.get("data_sources", []) + + # Populate versioning fields in metadata + flat["version"] = version + flat["latest"] = True + if update: + flat["previous_version"] = previous_version_id + flat["root_version"] = root_version_id + else: + flat["root_version"] = versioned_source_id + + now = datetime.now(timezone.utc).isoformat().replace("+00:00", "Z") + record = { + "source_id": source_id, + "version": version, + "versioned_source_id": versioned_source_id, + "user_id": user_id, + "user_email": user_email, + "organization": organization, + "status": "pending_curation", + "dataset_mdata": json.dumps(flat), + "schema_version": "2", + "test": is_test, + "created_at": now, + "updated_at": now, + } + if inherited_dataset_doi: + record["dataset_doi"] = inherited_dataset_doi + + # Flip latest=false on prior version's metadata + if update and prior_record: + try: + _flip_latest_on_prior(store, prior_record) + except Exception: + logger.warning("Failed to flip latest on prior version %s", latest_ver, exc_info=True) + + try: + store.put_submission(record) + except Exception as exc: + logger.exception("Failed to store submission") + raise HTTPException(500, "Internal error while storing submission") + + try: + notify_curators_new_submission(record) + except Exception: + logger.warning("Failed to send new-submission email for %s", source_id, exc_info=True) + + response = { + "success": True, + "source_id": source_id, + "version": version, + "versioned_source_id": versioned_source_id, + "organization": organization, + } + + # Profile jobs for stream-backed sources (inline or async queue depending on mode) + data_sources = flat.get("data_sources", []) + profile_jobs = [] + for source in data_sources: + if source.startswith("stream://"): + stream_id = source.replace("stream://", "", 1) + try: + profile_jobs.append(enqueue_profile_job(source_id, version, stream_id)) + except Exception: + logger.debug("Profile job dispatch failed for %s", source_id, exc_info=True) + if profile_jobs: + response["profile_jobs"] = profile_jobs + + # Transfer jobs for data on external Globus endpoints + from v2.transfer import NCSA_MDF_COLLECTION_UUID, extract_transfer_sources + + transfer_sources = extract_transfer_sources(data_sources) + if transfer_sources: + # Extract user's transfer token from dependent tokens + user_transfer_token = _extract_dependent_transfer_token(auth) + if user_transfer_token: + try: + transfer_result = enqueue_transfer_job( + source_id=source_id, + version=version, + data_sources=data_sources, + user_transfer_token=user_transfer_token, + user_identity_id=auth.user_id, + ) + response["transfer_job"] = transfer_result + except Exception: + logger.debug("Transfer job dispatch failed for %s", source_id, exc_info=True) + else: + response["transfer_warning"] = ( + "Data sources reference external Globus endpoints but no transfer token " + "was available. Run 'mdf login' to authenticate with transfer scope." + ) + + return response + + +@router.get("/versions/{source_id}") +async def list_versions( + source_id: str, + limit: int = Query(50, ge=1, le=500), + offset: int = Query(0, ge=0), + auth: Optional[AuthContext] = Depends(get_optional_auth), + store: SubmissionStore = Depends(get_submission_store), +): + versions = store.list_versions(source_id) + if not versions: + return {"success": False, "error": "No versions found for this source_id"} + + # If unauthenticated (or not owner/curator), only show published versions + is_privileged = False + if auth: + owner_id = versions[0].get("user_id") if versions else None + is_privileged = (owner_id and owner_id == auth.user_id) or is_curator(auth) + if not is_privileged: + versions = [v for v in versions if v.get("status") == "published"] + if not versions: + return {"success": False, "error": "No versions found for this source_id"} + + sorted_versions = sorted(versions, key=lambda x: x.get("version", "0")) + total_count = len(sorted_versions) + paginated = sorted_versions[offset:offset + limit] + + result_versions = [] + for v in paginated: + mdata = v.get("dataset_mdata") + if isinstance(mdata, str): + try: + mdata = json.loads(mdata) + except Exception: + mdata = {} + if not isinstance(mdata, dict): + mdata = {} + + result_versions.append({ + "version": v.get("version"), + "title": mdata.get("title", ""), + "status": v.get("status", ""), + "doi": v.get("dataset_doi") or v.get("doi") or mdata.get("doi"), + "created_at": v.get("created_at", ""), + "updated_at": v.get("updated_at", ""), + }) + + # Find the dataset-level DOI (from any published version) + dataset_doi = None + for v in versions: + ddoi = v.get("dataset_doi") or v.get("doi") + if ddoi and v.get("status") == "published": + dataset_doi = ddoi + break + + return { + "success": True, + "source_id": source_id, + "versions": result_versions, + "total_count": total_count, + "dataset_doi": dataset_doi, + } + + +@router.get("/stats/{source_id}") +async def dataset_stats( + source_id: str, + store: SubmissionStore = Depends(get_submission_store), +): + """Public access/download stats for a published dataset.""" + versions = store.list_versions(source_id) + published = [v for v in versions if v.get("status") == "published"] + if not published: + raise HTTPException(404, "No published dataset found") + + total_views = 0 + total_downloads = 0 + first_published = None + last_updated = None + + for v in published: + total_views += int(v.get("view_count") or 0) + total_downloads += int(v.get("download_count") or 0) + pa = v.get("published_at") or v.get("created_at") + ua = v.get("updated_at") + if pa and (first_published is None or pa < first_published): + first_published = pa + if ua and (last_updated is None or ua > last_updated): + last_updated = ua + + return { + "success": True, + "source_id": source_id, + "view_count": total_views, + "download_count": total_downloads, + "version_count": len(published), + "first_published": first_published, + "last_updated": last_updated, + } + + +@router.get("/status/{source_id}") +async def get_status( + source_id: str, + version: Optional[str] = Query(None), + auth: Optional[AuthContext] = Depends(get_optional_auth), + store: SubmissionStore = Depends(get_submission_store), +): + if version: + record = store.get_submission(source_id, version) + if not record: + return {"success": False, "error": "Submission not found"} + else: + versions = store.list_versions(source_id) + if not versions: + return {"success": False, "error": "Submission not found"} + latest_ver = latest_version(versions) + record = next((item for item in versions if item.get("version") == latest_ver), versions[-1]) + + # If unauthenticated (or not owner/curator), only allow published + is_privileged = False + if auth: + owner_id = record.get("user_id") + is_privileged = (owner_id and owner_id == auth.user_id) or is_curator(auth) + if not is_privileged and record.get("status") != "published": + return {"success": False, "error": "Submission not found"} + + # Inline transfer status check — single Globus API call (~200ms) + if record.get("transfer_status") == "active": + _inline_transfer_check(record, store) + + normalized = _normalize_record(record) + result = {"success": True, "submission": normalized} + + # Include transfer status in response when present + if record.get("transfer_status"): + result["transfer"] = { + "status": record.get("transfer_status"), + "bytes_transferred": record.get("transfer_bytes_transferred", 0), + "files_transferred": record.get("transfer_files_transferred", 0), + "destination": record.get("transfer_destination", ""), + } + + return result + + +def _inline_transfer_check(record: Dict[str, Any], store: SubmissionStore) -> None: + """Check Globus transfer status inline and update the submission record.""" + task_ids = record.get("transfer_task_ids", []) + if not task_ids: + return + + all_succeeded = True + any_failed = False + total_bytes = 0 + total_files = 0 + + for task_id in task_ids: + try: + status = check_transfer_status(task_id) + total_bytes += status.get("bytes_transferred", 0) + total_files += status.get("files_transferred", 0) + + if status["status"] == "SUCCEEDED": + continue + elif status["status"] in ("FAILED", "INACTIVE"): + any_failed = True + all_succeeded = False + else: + all_succeeded = False + except Exception: + logger.debug("Inline transfer check failed for task %s", task_id, exc_info=True) + all_succeeded = False + + record["transfer_bytes_transferred"] = total_bytes + record["transfer_files_transferred"] = total_files + + now = datetime.now(timezone.utc).isoformat().replace("+00:00", "Z") + + if all_succeeded or any_failed: + record["transfer_status"] = "succeeded" if all_succeeded else "failed" + # Clean up ACL rules + for acl_id in record.get("transfer_acl_rule_ids", []): + try: + cleanup_transfer_acl(acl_id) + except Exception: + logger.debug("ACL cleanup failed for %s", acl_id, exc_info=True) + record["updated_at"] = now + store.upsert_submission(record) + else: + # Still active — persist progress + record["updated_at"] = now + store.upsert_submission(record) + + +@router.get("/status") +async def get_status_all( + source_id: Optional[str] = Query(None), + version: Optional[str] = Query(None), + auth: Optional[AuthContext] = Depends(get_optional_auth), + store: SubmissionStore = Depends(get_submission_store), +): + if not source_id: + raise HTTPException(400, "Missing source_id") + if version: + record = store.get_submission(source_id, version) + if not record: + return {"success": False, "error": "Submission not found"} + else: + versions = store.list_versions(source_id) + if not versions: + return {"success": False, "error": "Submission not found"} + latest_ver = latest_version(versions) + record = next((item for item in versions if item.get("version") == latest_ver), versions[-1]) + + # Apply same access control as GET /status/{source_id} + is_privileged = False + if auth: + owner_id = record.get("user_id") + is_privileged = (owner_id and owner_id == auth.user_id) or is_curator(auth) + if not is_privileged and record.get("status") != "published": + return {"success": False, "error": "Submission not found"} + + return {"success": True, "submission": _normalize_record(record)} + + +@router.post("/status/update") +async def update_status( + payload: StatusUpdateRequest, + auth: AuthContext = Depends(get_auth), + store: SubmissionStore = Depends(get_submission_store), +): + if not is_curator(auth): + raise HTTPException(403, "Only curators may update submission status") + + ALLOWED_STATUSES = { + "pending_curation", "approved", "published", "rejected", + } + + if payload.status not in ALLOWED_STATUSES: + raise HTTPException( + 400, + "status must be one of: {}".format(", ".join(sorted(ALLOWED_STATUSES))), + ) + + record = store.get_submission(payload.source_id, payload.version) + if not record: + raise HTTPException(404, "Submission not found") + ensure_submission_owner_or_curator(auth, record) + + store.update_status(payload.source_id, payload.version, payload.status) + + return { + "success": True, + "source_id": payload.source_id, + "version": payload.version, + "status": payload.status, + } + + +@router.get("/submissions") +async def list_submissions( + organization: Optional[str] = Query(None), + status: Optional[str] = Query(None), + include_counts: bool = Query(False), + limit: Optional[int] = Query(50), + start_key: Optional[str] = Query(None), + auth: AuthContext = Depends(get_auth), + store: SubmissionStore = Depends(get_submission_store), +): + user_id = auth.user_id + if not user_id: + raise HTTPException(400, "Missing user identity") + + try: + limit = int(limit) if limit else 50 + except Exception: + limit = 50 + + # Parse status filter + status_filter = set() + if status: + status_filter = {s.strip() for s in status.split(",") if s.strip()} + + # When include_counts is requested, fetch a larger batch to compute counts + fetch_limit = max(limit, 1000) if include_counts else limit + + parsed_key = parse_pagination_key(start_key) if not include_counts else None + + if organization: + if not is_curator(auth): + raise HTTPException(403, "Organization-wide listing requires curator permissions") + items, last_key = store.list_by_org(organization, limit=fetch_limit, start_key=parsed_key) + else: + items, last_key = store.list_by_user(user_id, limit=fetch_limit, start_key=parsed_key) + + for item in items: + if isinstance(item.get("dataset_mdata"), str): + try: + item["dataset_mdata"] = json.loads(item["dataset_mdata"]) + except Exception: + pass + + # Compute counts before filtering (over all fetched items) + response: Dict[str, Any] = {"success": True} + if include_counts: + counts: Dict[str, int] = {} + for item in items: + s = item.get("status", "unknown") + counts[s] = counts.get(s, 0) + 1 + response["counts"] = counts + response["total"] = len(items) + + # Apply status filter + if status_filter: + items = [item for item in items if item.get("status") in status_filter] + + # Apply pagination limit after filtering + if include_counts: + items = items[:limit] + last_key = None + + response["submissions"] = items + response["next_key"] = serialize_pagination_key(last_key) + return response diff --git a/aws/v2/async_jobs.py b/aws/v2/async_jobs.py new file mode 100644 index 0000000..372236c --- /dev/null +++ b/aws/v2/async_jobs.py @@ -0,0 +1,556 @@ +import json +import logging +import os +import sqlite3 +from datetime import datetime, timezone +from typing import Any, Dict, List, Optional + +from v2.config import AWS_REGION + +logger = logging.getLogger(__name__) + +JOB_PROFILE_SUBMISSION = "profile_submission" +JOB_MINT_STREAM_DOI = "mint_stream_doi" +JOB_MINT_SUBMISSION_DOI = "mint_submission_doi" +JOB_PUBLISH_SUBMISSION = "publish_submission" +JOB_TRANSFER_DATA = "transfer_data" +JOB_CLEANUP_TRANSFERS = "cleanup_transfers" + + +def _utc_now() -> str: + return datetime.now(timezone.utc).isoformat().replace("+00:00", "Z") + + +def _sqlite_path() -> str: + return os.environ.get("ASYNC_SQLITE_PATH", os.environ.get("SQLITE_PATH", "/tmp/mdf_connect_v2.db")) + + +class JobDispatcher: + def dispatch(self, job_type: str, payload: Dict[str, Any]) -> Dict[str, Any]: + raise NotImplementedError + + +class InlineJobDispatcher(JobDispatcher): + def dispatch(self, job_type: str, payload: Dict[str, Any]) -> Dict[str, Any]: + result = process_job(job_type, payload) + return { + "mode": "inline", + "queued": False, + "job_type": job_type, + "result": result, + } + + +class SQSJobDispatcher(JobDispatcher): + def __init__(self, queue_url: str): + import boto3 + + self.queue_url = queue_url + self.client = boto3.client("sqs", region_name=AWS_REGION) + + def dispatch(self, job_type: str, payload: Dict[str, Any]) -> Dict[str, Any]: + message = { + "job_type": job_type, + "payload": payload, + "created_at": _utc_now(), + } + resp = self.client.send_message( + QueueUrl=self.queue_url, + MessageBody=json.dumps(message), + ) + return { + "mode": "sqs", + "queued": True, + "job_type": job_type, + "message_id": resp.get("MessageId"), + } + + +class SqliteJobDispatcher(JobDispatcher): + def __init__(self, db_path: Optional[str] = None): + self.path = db_path or _sqlite_path() + self._init_schema() + + def _connect(self) -> sqlite3.Connection: + conn = sqlite3.connect(self.path, check_same_thread=False) + conn.row_factory = sqlite3.Row + return conn + + def _init_schema(self) -> None: + conn = self._connect() + try: + with conn: + conn.execute( + """ + CREATE TABLE IF NOT EXISTS async_jobs ( + id INTEGER PRIMARY KEY AUTOINCREMENT, + job_type TEXT NOT NULL, + payload TEXT NOT NULL, + status TEXT NOT NULL, + attempts INTEGER NOT NULL DEFAULT 0, + result TEXT, + error TEXT, + created_at TEXT NOT NULL, + updated_at TEXT NOT NULL + ) + """ + ) + conn.execute( + "CREATE INDEX IF NOT EXISTS idx_async_jobs_status ON async_jobs(status, id)" + ) + finally: + conn.close() + + def dispatch(self, job_type: str, payload: Dict[str, Any]) -> Dict[str, Any]: + now = _utc_now() + conn = self._connect() + try: + with conn: + cur = conn.execute( + """ + INSERT INTO async_jobs (job_type, payload, status, attempts, created_at, updated_at) + VALUES (?, ?, 'pending', 0, ?, ?) + """, + (job_type, json.dumps(payload), now, now), + ) + job_id = int(cur.lastrowid) + return { + "mode": "sqlite", + "queued": True, + "job_type": job_type, + "job_id": job_id, + } + finally: + conn.close() + + def claim_pending_jobs(self, limit: int = 20) -> List[Dict[str, Any]]: + conn = self._connect() + jobs: List[Dict[str, Any]] = [] + try: + with conn: + rows = conn.execute( + """ + SELECT * FROM async_jobs + WHERE status = 'pending' + ORDER BY id ASC + LIMIT ? + """, + (limit,), + ).fetchall() + for row in rows: + conn.execute( + """ + UPDATE async_jobs + SET status = 'processing', + attempts = attempts + 1, + updated_at = ? + WHERE id = ? AND status = 'pending' + """, + (_utc_now(), row["id"]), + ) + jobs.append(dict(row)) + return jobs + finally: + conn.close() + + def mark_completed(self, job_id: int, result: Dict[str, Any]) -> None: + conn = self._connect() + try: + with conn: + conn.execute( + """ + UPDATE async_jobs + SET status = 'completed', + result = ?, + updated_at = ? + WHERE id = ? + """, + (json.dumps(result), _utc_now(), job_id), + ) + finally: + conn.close() + + def mark_failed(self, job_id: int, error: str) -> None: + conn = self._connect() + try: + with conn: + conn.execute( + """ + UPDATE async_jobs + SET status = 'failed', + error = ?, + updated_at = ? + WHERE id = ? + """, + (error, _utc_now(), job_id), + ) + finally: + conn.close() + + +def get_job_dispatcher() -> JobDispatcher: + mode = os.environ.get("ASYNC_DISPATCH_MODE", "inline").lower() + if mode == "sqs": + queue_url = os.environ.get("ASYNC_QUEUE_URL") + if not queue_url: + raise ValueError("ASYNC_QUEUE_URL is required for ASYNC_DISPATCH_MODE=sqs") + return SQSJobDispatcher(queue_url=queue_url) + if mode == "sqlite": + return SqliteJobDispatcher() + return InlineJobDispatcher() + + +def enqueue_profile_job(source_id: str, version: str, stream_id: str) -> Dict[str, Any]: + payload = {"source_id": source_id, "version": version, "stream_id": stream_id} + return get_job_dispatcher().dispatch(JOB_PROFILE_SUBMISSION, payload) + + +def enqueue_stream_doi_job(stream_id: str, overrides: Dict[str, Any]) -> Dict[str, Any]: + payload = {"stream_id": stream_id, "overrides": overrides} + return get_job_dispatcher().dispatch(JOB_MINT_STREAM_DOI, payload) + + +def enqueue_submission_doi_job(source_id: str, version: str) -> Dict[str, Any]: + payload = {"source_id": source_id, "version": version} + return get_job_dispatcher().dispatch(JOB_MINT_SUBMISSION_DOI, payload) + + +def enqueue_publish_job(source_id: str, version: str, mint_doi: bool = True) -> Dict[str, Any]: + payload = {"source_id": source_id, "version": version, "mint_doi": mint_doi} + return get_job_dispatcher().dispatch(JOB_PUBLISH_SUBMISSION, payload) + + +def enqueue_transfer_job( + source_id: str, + version: str, + data_sources: List[str], + user_transfer_token: str, + user_identity_id: str, +) -> Dict[str, Any]: + payload = { + "source_id": source_id, + "version": version, + "data_sources": data_sources, + "user_transfer_token": user_transfer_token, + "user_identity_id": user_identity_id, + } + return get_job_dispatcher().dispatch(JOB_TRANSFER_DATA, payload) + + +def enqueue_cleanup_transfers_job() -> Dict[str, Any]: + return get_job_dispatcher().dispatch(JOB_CLEANUP_TRANSFERS, {}) + + +def process_job(job_type: str, payload: Dict[str, Any]) -> Dict[str, Any]: + if job_type == JOB_PROFILE_SUBMISSION: + return _process_profile_submission(payload) + if job_type == JOB_MINT_STREAM_DOI: + return _process_mint_stream_doi(payload) + if job_type == JOB_MINT_SUBMISSION_DOI: + return _process_mint_submission_doi(payload) + if job_type == JOB_PUBLISH_SUBMISSION: + return _process_publish_submission(payload) + if job_type == JOB_TRANSFER_DATA: + return _process_transfer_data(payload) + if job_type == JOB_CLEANUP_TRANSFERS: + return _process_cleanup_transfers(payload) + raise ValueError(f"Unknown job type: {job_type}") + + +def _process_profile_submission(payload: Dict[str, Any]) -> Dict[str, Any]: + from v2.profiler import build_dataset_profile + from v2.storage import get_storage_backend + from v2.store import get_store + + source_id = payload["source_id"] + version = payload["version"] + stream_id = payload["stream_id"] + + storage = get_storage_backend() + store = get_store() + profile = build_dataset_profile(stream_id, storage) + store.update_profile(source_id, version, profile.model_dump_json()) + return { + "success": True, + "source_id": source_id, + "version": version, + "stream_id": stream_id, + "total_files": profile.total_files, + "total_bytes": profile.total_bytes, + } + + +def _process_mint_stream_doi(payload: Dict[str, Any]) -> Dict[str, Any]: + from v2.doi_utils import mint_doi_for_stream + from v2.stream_store import get_stream_store + + stream_id = payload["stream_id"] + overrides = payload.get("overrides") or {} + + stream_store = get_stream_store() + stream = stream_store.get_stream(stream_id) + if not stream: + return {"success": False, "error": f"Stream not found: {stream_id}"} + + doi_result = mint_doi_for_stream(stream, overrides) + if doi_result.get("success"): + stream_store.update_stream_metadata( + stream_id, + { + "doi": doi_result.get("doi"), + "published_at": _utc_now(), + }, + ) + return doi_result + + +def _process_mint_submission_doi(payload: Dict[str, Any]) -> Dict[str, Any]: + from v2.curation import _mint_doi_for_submission + from v2.store import get_store + + source_id = payload["source_id"] + version = payload["version"] + + store = get_store() + submission = store.get_submission(source_id, version) + if not submission: + return {"success": False, "error": f"Submission not found: {source_id} v{version}"} + + all_versions = store.list_versions(source_id) + doi_result = _mint_doi_for_submission(submission, all_versions=all_versions, mint_doi=True) + if doi_result.get("success"): + if doi_result.get("doi"): + submission["doi"] = doi_result["doi"] + if doi_result.get("dataset_doi"): + submission["dataset_doi"] = doi_result["dataset_doi"] + submission["status"] = "published" + submission["published_at"] = _utc_now() + submission["updated_at"] = _utc_now() + store.upsert_submission(submission) + return doi_result + + +def _process_transfer_data(payload: Dict[str, Any]) -> Dict[str, Any]: + """Initiate Globus transfers for data sources on external endpoints.""" + from v2.store import get_store + from v2.transfer import extract_transfer_sources, initiate_transfer + + source_id = payload["source_id"] + version = payload["version"] + data_sources = payload["data_sources"] + user_transfer_token = payload["user_transfer_token"] + user_identity_id = payload["user_identity_id"] + + store = get_store() + submission = store.get_submission(source_id, version) + if not submission: + return {"success": False, "error": f"Submission not found: {source_id} v{version}"} + + transfer_sources = extract_transfer_sources(data_sources) + if not transfer_sources: + return {"success": True, "message": "No external transfers needed"} + + results = [] + for src in transfer_sources: + try: + result = initiate_transfer( + source_endpoint=src["source_endpoint"], + source_path=src["source_path"], + source_id=source_id, + version=version, + user_transfer_token=user_transfer_token, + user_identity_id=user_identity_id, + ) + results.append(result) + except Exception as exc: + logger.exception("Transfer initiation failed for %s", src["uri"]) + results.append({"error": str(exc), "uri": src["uri"]}) + + # Store transfer state in the submission record + successful = [r for r in results if "task_id" in r] + if successful: + submission["transfer_task_ids"] = [r["task_id"] for r in successful] + submission["transfer_acl_rule_ids"] = [r.get("acl_rule_id") for r in successful if r.get("acl_rule_id")] + submission["transfer_status"] = "active" + submission["transfer_destination"] = successful[0].get("destination_path", "") + submission["transfer_initiated_at"] = successful[0].get("initiated_at", _utc_now()) + submission["updated_at"] = _utc_now() + store.upsert_submission(submission) + + return { + "success": len(successful) > 0, + "source_id": source_id, + "version": version, + "transfers_initiated": len(successful), + "transfers_failed": len(results) - len(successful), + "results": results, + } + + +def _process_cleanup_transfers(payload: Dict[str, Any]) -> Dict[str, Any]: + """Scan for submissions with active transfers and clean up completed/stale ones.""" + from v2.store import get_store + from v2.transfer import cleanup_stale_transfers + + store = get_store() + + # Scan for all submissions with active transfers. + # DynamoDB scan is fine here — runs 4x/day, table is small. + active_submissions = store.scan_by_transfer_status("active") + if not active_submissions: + return {"success": True, "checked": 0, "cleaned": 0} + + modified = cleanup_stale_transfers(active_submissions) + + # Persist any modified submissions + cleaned = 0 + for sub in modified: + sub["updated_at"] = _utc_now() + store.upsert_submission(sub) + if sub.get("transfer_status") != "active": + cleaned += 1 + + return { + "success": True, + "checked": len(active_submissions), + "cleaned": cleaned, + } + + +def _update_prior_versions_search(store, search_client, source_id: str, current_version: str, all_versions: list) -> None: + """Re-ingest prior versions into search with latest=false.""" + for v_record in all_versions: + v = v_record.get("version") + if v == current_version: + continue + if v_record.get("status") != "published": + continue + # Ensure the prior version's metadata has latest=false + mdata = v_record.get("dataset_mdata") + if isinstance(mdata, str): + try: + mdata = json.loads(mdata) + except Exception: + continue + if isinstance(mdata, dict) and mdata.get("latest") is not False: + mdata["latest"] = False + v_record["dataset_mdata"] = json.dumps(mdata) + v_record["updated_at"] = _utc_now() + store.upsert_submission(v_record) + # Re-ingest into search with latest=false + search_client.ingest(v_record, version_count=len(all_versions)) + logger.info("Updated prior version %s v%s search entry with latest=false", source_id, v) + + +def _process_publish_submission(payload: Dict[str, Any]) -> Dict[str, Any]: + """Publish a submission: mint DOI (optional), ingest into search, update status.""" + from v2.curation import _mint_doi_for_submission + from v2.search_client import get_search_client + from v2.store import get_store + + source_id = payload["source_id"] + version = payload["version"] + mint_doi = payload.get("mint_doi", True) + + store = get_store() + submission = store.get_submission(source_id, version) + if not submission: + return {"success": False, "error": f"Submission not found: {source_id} v{version}"} + + # Look up all versions for DOI versioning context + all_versions = store.list_versions(source_id) + + result: Dict[str, Any] = {"source_id": source_id, "version": version} + + # Step 1: DOI handling (mint new or update existing dataset DOI) + # Always call _mint_doi_for_submission — it handles both mint_doi=True + # (mint new DOI) and mint_doi=False (update dataset DOI metadata only) + try: + doi_result = _mint_doi_for_submission( + submission, all_versions=all_versions, mint_doi=mint_doi, + ) + result["doi"] = doi_result + if doi_result.get("success"): + if doi_result.get("doi"): + submission["doi"] = doi_result["doi"] + if doi_result.get("dataset_doi"): + submission["dataset_doi"] = doi_result["dataset_doi"] + else: + logger.warning("DOI handling failed for %s: %s", source_id, doi_result.get("error")) + except Exception: + logger.exception("DOI handling error for %s", source_id) + result["doi"] = {"success": False, "error": "DOI handling exception"} + + # Step 2: Ingest into Globus Search + try: + search_client = get_search_client() + search_result = search_client.ingest(submission, version_count=len(all_versions)) + result["search_ingest"] = search_result + if not search_result.get("success"): + logger.warning("Search ingest failed for %s: %s", source_id, search_result.get("error")) + except Exception: + logger.exception("Search ingest error for %s", source_id) + result["search_ingest"] = {"success": False, "error": "Search ingest exception"} + + # Step 2b: If this is a new version, re-ingest prior version with latest=false + if len(all_versions) > 1: + try: + _update_prior_versions_search(store, search_client, source_id, version, all_versions) + except Exception: + logger.warning("Failed to update prior version search entries for %s", source_id, exc_info=True) + + # Step 3: Update status to published + now = _utc_now() + submission["status"] = "published" + submission["published_at"] = now + submission["updated_at"] = now + store.upsert_submission(submission) + + # Notify submitter their dataset is live + try: + from v2.email_utils import notify_submitter_approved + notify_submitter_approved(submission) + except Exception: + logger.warning("Failed to send approval email for %s", source_id, exc_info=True) + + result["success"] = True + result["status"] = "published" + result["published_at"] = now + return result + + +def run_sqlite_worker_once(limit: int = 20) -> Dict[str, Any]: + dispatcher = SqliteJobDispatcher() + jobs = dispatcher.claim_pending_jobs(limit=limit) + processed = 0 + failed = 0 + for job in jobs: + job_id = int(job["id"]) + try: + payload = json.loads(job["payload"]) + result = process_job(job["job_type"], payload) + dispatcher.mark_completed(job_id, result) + processed += 1 + except Exception as exc: + logger.exception("Async job failed: id=%s", job_id) + dispatcher.mark_failed(job_id, str(exc)) + failed += 1 + return { + "success": True, + "processed": processed, + "failed": failed, + "total_claimed": len(jobs), + } + + +def handle_sqs_event(event: Dict[str, Any]) -> Dict[str, Any]: + failures: List[Dict[str, str]] = [] + for record in event.get("Records", []): + message_id = record.get("messageId", "") + try: + body = json.loads(record.get("body") or "{}") + process_job(body["job_type"], body["payload"]) + except Exception: + logger.exception("Failed processing SQS async job message_id=%s", message_id) + failures.append({"itemIdentifier": message_id}) + return {"batchItemFailures": failures} diff --git a/aws/v2/async_worker.py b/aws/v2/async_worker.py new file mode 100644 index 0000000..574f14b --- /dev/null +++ b/aws/v2/async_worker.py @@ -0,0 +1,42 @@ +import argparse +import json +import logging +import time +from typing import Any, Dict + +from v2.async_jobs import handle_sqs_event, run_sqlite_worker_once + +logger = logging.getLogger(__name__) + + +def lambda_handler(event: Dict[str, Any], context: Any) -> Dict[str, Any]: + # EventBridge scheduled events have "source": "aws.events" + if event.get("source") == "aws.events": + from v2.async_jobs import JOB_CLEANUP_TRANSFERS, process_job + + logger.info("Handling EventBridge scheduled event: %s", event.get("detail-type")) + return process_job(JOB_CLEANUP_TRANSFERS, {}) + + return handle_sqs_event(event) + + +def main() -> None: + parser = argparse.ArgumentParser(description="Run MDF async job worker") + parser.add_argument("--once", action="store_true", help="Run one pass and exit") + parser.add_argument("--limit", type=int, default=20, help="Max pending sqlite jobs to process per pass") + parser.add_argument("--interval", type=float, default=2.0, help="Sleep interval between passes") + args = parser.parse_args() + + if args.once: + print(json.dumps(run_sqlite_worker_once(limit=args.limit), indent=2)) + return + + while True: + result = run_sqlite_worker_once(limit=args.limit) + print(json.dumps(result)) + if result["total_claimed"] == 0: + time.sleep(args.interval) + + +if __name__ == "__main__": + main() diff --git a/aws/v2/citation.py b/aws/v2/citation.py new file mode 100644 index 0000000..6e34028 --- /dev/null +++ b/aws/v2/citation.py @@ -0,0 +1,267 @@ +"""Citation export for MDF v2 datasets. + +Generates citations in multiple formats: +- BibTeX (for LaTeX papers) +- RIS (for EndNote, Zotero, Mendeley) +- APA (plain text) +- DataCite XML (for DOI registration) +""" + +import re +from datetime import datetime +from typing import Any, Dict, List, Optional +from xml.etree import ElementTree as ET + +from v2.metadata import DatasetMetadata, Author, parse_metadata +from v2.store import get_store + + +def _clean_bibtex_value(value: str) -> str: + """Escape special characters for BibTeX.""" + if not value: + return "" + replacements = [ + ("&", r"\&"), + ("%", r"\%"), + ("$", r"\$"), + ("#", r"\#"), + ("_", r"\_"), + ("{", r"\{"), + ("}", r"\}"), + ("~", r"\textasciitilde{}"), + ("^", r"\textasciicircum{}"), + ] + for old, new in replacements: + value = value.replace(old, new) + return value + + +def _make_bibtex_key(source_id: str, year: str) -> str: + """Generate a BibTeX citation key.""" + key = re.sub(r"[^a-zA-Z0-9]", "_", source_id) + return f"{key}_{year}" if year else key + + +def _author_family_given(author: Author) -> tuple: + """Extract (family, given) from an Author, auto-parsing if needed.""" + family = author.family_name or "" + given = author.given_name or "" + if not family and not given and author.name: + if "," in author.name: + parts = author.name.split(",", 1) + family = parts[0].strip() + given = parts[1].strip() + else: + parts = author.name.rsplit(" ", 1) + if len(parts) == 2: + given = parts[0].strip() + family = parts[1].strip() + else: + family = author.name + return family, given + + +def _format_authors_bibtex(authors: List[Author]) -> str: + """Format authors for BibTeX (Last, First and Last, First).""" + names = [] + for a in authors: + family, given = _author_family_given(a) + if family and given: + names.append(f"{family}, {given}") + else: + names.append(a.name) + return " and ".join(names) if names else "Unknown" + + +def _format_authors_ris(authors: List[Author]) -> List[str]: + """Format authors for RIS (one AU tag per author).""" + names = [] + for a in authors: + family, given = _author_family_given(a) + if family and given: + names.append(f"{family}, {given}") + else: + names.append(a.name) + return names if names else ["Unknown"] + + +def _format_authors_apa(authors: List[Author]) -> str: + """Format authors for APA style.""" + formatted = [] + for a in authors: + family, given = _author_family_given(a) + if family and given: + initials = ". ".join([n[0] for n in given.split() if n]) + "." + formatted.append(f"{family}, {initials}") + else: + formatted.append(a.name) + + if not formatted: + return "Unknown" + elif len(formatted) == 1: + return formatted[0] + elif len(formatted) == 2: + return f"{formatted[0]} & {formatted[1]}" + else: + return ", ".join(formatted[:-1]) + f", & {formatted[-1]}" + + +def generate_bibtex(record: Dict[str, Any]) -> str: + """Generate BibTeX citation.""" + meta = parse_metadata(record) + + year = str(meta.publication_year or datetime.now().year) + source_id = record.get("source_id", "unknown") + version = record.get("version", "1.0") + doi = record.get("doi") or "" + + key = _make_bibtex_key(source_id, year) + authors = _format_authors_bibtex(meta.authors) + + lines = [ + f"@dataset{{{key},", + f" author = {{{_clean_bibtex_value(authors)}}},", + f" title = {{{{{_clean_bibtex_value(meta.title)}}}}},", + f" year = {{{year}}},", + f" publisher = {{{_clean_bibtex_value(meta.publisher)}}},", + f" version = {{{version}}},", + ] + + if doi: + lines.append(f" doi = {{{doi}}},") + lines.append(f" url = {{https://doi.org/{doi}}},") + + lines.append(f" note = {{MDF Source ID: {source_id}}},") + lines.append("}") + + return "\n".join(lines) + + +def generate_ris(record: Dict[str, Any]) -> str: + """Generate RIS citation (for EndNote, Zotero, Mendeley).""" + meta = parse_metadata(record) + + year = str(meta.publication_year or datetime.now().year) + source_id = record.get("source_id", "unknown") + doi = record.get("doi") or "" + + lines = [ + "TY - DATA", + f"TI - {meta.title}", + ] + + for author in _format_authors_ris(meta.authors): + lines.append(f"AU - {author}") + + lines.extend([ + f"PY - {year}", + f"PB - {meta.publisher}", + ]) + + if doi: + lines.append(f"DO - {doi}") + lines.append(f"UR - https://doi.org/{doi}") + + if meta.description: + lines.append(f"AB - {meta.description}") + + for kw in meta.keywords: + lines.append(f"KW - {kw}") + + lines.append(f"N1 - MDF Source ID: {source_id}") + lines.append("ER - ") + + return "\n".join(lines) + + +def generate_apa(record: Dict[str, Any]) -> str: + """Generate APA style citation (plain text).""" + meta = parse_metadata(record) + + year = str(meta.publication_year or datetime.now().year) + version = record.get("version", "1.0") + doi = record.get("doi") + + authors = _format_authors_apa(meta.authors) + + citation = f"{authors} ({year}). {meta.title} (Version {version}) [Data set]. {meta.publisher}." + + if doi: + citation += f" https://doi.org/{doi}" + + return citation + + +def generate_datacite_xml(record: Dict[str, Any]) -> str: + """Generate DataCite XML for DOI registration.""" + meta = parse_metadata(record) + doi = record.get("doi") or "10.xxxxx/pending" + + root = ET.Element("resource") + root.set("xmlns", "http://datacite.org/schema/kernel-4") + root.set("xmlns:xsi", "http://www.w3.org/2001/XMLSchema-instance") + root.set("xsi:schemaLocation", "http://datacite.org/schema/kernel-4 http://schema.datacite.org/meta/kernel-4/metadata.xsd") + + # Identifier + identifier = ET.SubElement(root, "identifier") + identifier.set("identifierType", "DOI") + identifier.text = doi + + # Creators + creators_elem = ET.SubElement(root, "creators") + for a in meta.authors: + creator = ET.SubElement(creators_elem, "creator") + family, given = _author_family_given(a) + name_elem = ET.SubElement(creator, "creatorName") + if family and given: + name_elem.text = f"{family}, {given}" + else: + name_elem.text = a.name + if given: + given_elem = ET.SubElement(creator, "givenName") + given_elem.text = given + if family: + family_elem = ET.SubElement(creator, "familyName") + family_elem.text = family + for aff in a.affiliations: + affil = ET.SubElement(creator, "affiliation") + affil.text = aff + + # Titles + titles_elem = ET.SubElement(root, "titles") + title = ET.SubElement(titles_elem, "title") + title.text = meta.title + + # Publisher + publisher = ET.SubElement(root, "publisher") + publisher.text = meta.publisher + + # Publication Year + pub_year = ET.SubElement(root, "publicationYear") + pub_year.text = str(meta.publication_year or datetime.now().year) + + # Resource Type + resource_type = ET.SubElement(root, "resourceType") + resource_type.set("resourceTypeGeneral", "Dataset") + resource_type.text = meta.resource_type or "Dataset" + + # Descriptions + if meta.description: + descriptions_elem = ET.SubElement(root, "descriptions") + desc = ET.SubElement(descriptions_elem, "description") + desc.set("descriptionType", "Abstract") + desc.text = meta.description + + # Subjects + if meta.keywords: + subjects_elem = ET.SubElement(root, "subjects") + for kw in meta.keywords: + subj = ET.SubElement(subjects_elem, "subject") + subj.text = kw + + # Version + version_elem = ET.SubElement(root, "version") + version_elem.text = str(record.get("version", "1.0")) + + ET.indent(root) + return ET.tostring(root, encoding="unicode", xml_declaration=True) diff --git a/aws/v2/clone.py b/aws/v2/clone.py new file mode 100644 index 0000000..4a825ee --- /dev/null +++ b/aws/v2/clone.py @@ -0,0 +1,290 @@ +"""Dataset/stream cloning for MDF v2. + +Allows researchers to clone datasets or streams to their local machine, +pulling files from Globus HTTPS endpoints as needed. +""" + +import os +import json +from pathlib import Path +from typing import Any, Dict, List, Optional +from datetime import datetime, timezone +from urllib.parse import urlparse + +import httpx + +from v2.storage import get_storage_backend +from v2.storage.globus_https import load_cached_token +from v2.stream_store import get_stream_store + + +class StreamCloner: + """Clone streams from MDF to local filesystem.""" + + def __init__( + self, + dest_dir: str, + token: Optional[str] = None, + verbose: bool = True, + ): + """Initialize cloner. + + Args: + dest_dir: Destination directory for cloned files + token: Globus access token (or will use cached) + verbose: Print progress messages + """ + self.dest_dir = Path(dest_dir) + self.token = token or load_cached_token() + self.verbose = verbose + self._client = httpx.Client(timeout=60.0, follow_redirects=True) + + def log(self, msg: str): + """Print message if verbose.""" + if self.verbose: + print(msg) + + def clone_stream( + self, + stream_id: str, + include_metadata: bool = True, + file_filter: Optional[str] = None, + ) -> Dict[str, Any]: + """Clone a stream to local directory. + + Args: + stream_id: The stream ID to clone + include_metadata: Save stream metadata as JSON + file_filter: Optional glob pattern to filter files + + Returns: + Dict with clone results + """ + import fnmatch + + # Get stream info + stream_store = get_stream_store() + stream = stream_store.get_stream(stream_id) + + if not stream: + raise ValueError(f"Stream not found: {stream_id}") + + # Create destination directory + stream_dir = self.dest_dir / stream_id + stream_dir.mkdir(parents=True, exist_ok=True) + + self.log(f"Cloning stream: {stream_id}") + self.log(f" Title: {stream.get('title', 'Untitled')}") + self.log(f" Destination: {stream_dir}") + + # Save metadata + if include_metadata: + meta_path = stream_dir / "stream_metadata.json" + with open(meta_path, "w") as f: + json.dump(stream, f, indent=2, default=str) + self.log(f" Saved metadata: {meta_path}") + + # Get file list from storage + storage = get_storage_backend() + files = storage.list_files(stream_id) + + if file_filter: + files = [f for f in files if fnmatch.fnmatch(f.filename, file_filter)] + + self.log(f" Files to download: {len(files)}") + + # Download files + downloaded = [] + errors = [] + total_bytes = 0 + + for file_meta in files: + try: + local_path = stream_dir / file_meta.filename + self.log(f" Downloading: {file_meta.filename}") + + # Get content from storage + if storage.backend_name == "globus": + # Download directly via HTTPS + content = self._download_globus(file_meta.download_url) + else: + content = storage.get_file(file_meta.path) + + if content: + local_path.write_bytes(content) + downloaded.append({ + "filename": file_meta.filename, + "path": str(local_path), + "size_bytes": len(content), + }) + total_bytes += len(content) + else: + errors.append({ + "filename": file_meta.filename, + "error": "Could not download file", + }) + + except Exception as e: + errors.append({ + "filename": file_meta.filename, + "error": str(e), + }) + + self.log(f" Downloaded: {len(downloaded)} files ({total_bytes:,} bytes)") + if errors: + self.log(f" Errors: {len(errors)}") + + return { + "success": True, + "stream_id": stream_id, + "destination": str(stream_dir), + "downloaded": len(downloaded), + "total_bytes": total_bytes, + "files": downloaded, + "errors": errors if errors else None, + } + + def _download_globus(self, url: str) -> Optional[bytes]: + """Download file from Globus HTTPS endpoint.""" + if not self.token: + raise ValueError("No Globus token available. Run test_globus_upload.py to authenticate.") + self._validate_globus_url(url) + + headers = {"Authorization": f"Bearer {self.token}"} + response = self._client.get(url, headers=headers) + response.raise_for_status() + return response.content + + def _validate_globus_url(self, url: str) -> None: + parsed = urlparse(url) + if parsed.scheme != "https": + raise ValueError("Only https URLs are allowed for clone operations") + + configured = os.environ.get("GLOBUS_HTTPS_SERVER", "data.materialsdatafacility.org") + allow_list = os.environ.get("GLOBUS_ALLOWED_HOSTS", "") + hosts = {configured} + hosts.update({h.strip() for h in allow_list.split(",") if h.strip()}) + if parsed.hostname not in hosts: + raise ValueError(f"Refusing to send token to untrusted host: {parsed.hostname}") + + def clone_from_url( + self, + url: str, + filename: Optional[str] = None, + ) -> Dict[str, Any]: + """Clone a single file from a Globus URL. + + Args: + url: The Globus HTTPS URL + filename: Optional override for filename + + Returns: + Dict with download result + """ + if not filename: + filename = url.rsplit("/", 1)[-1] + + self.dest_dir.mkdir(parents=True, exist_ok=True) + local_path = self.dest_dir / filename + + self.log(f"Downloading: {url}") + self.log(f" To: {local_path}") + + content = self._download_globus(url) + local_path.write_bytes(content) + + self.log(f" Size: {len(content):,} bytes") + + return { + "success": True, + "url": url, + "filename": filename, + "path": str(local_path), + "size_bytes": len(content), + } + + def close(self): + """Close HTTP client.""" + self._client.close() + + +def clone_stream( + stream_id: str, + dest_dir: str = ".", + include_metadata: bool = True, + file_filter: Optional[str] = None, + verbose: bool = True, +) -> Dict[str, Any]: + """Convenience function to clone a stream. + + Args: + stream_id: The stream ID to clone + dest_dir: Destination directory + include_metadata: Save stream metadata as JSON + file_filter: Optional glob pattern to filter files + verbose: Print progress messages + + Returns: + Dict with clone results + """ + cloner = StreamCloner(dest_dir=dest_dir, verbose=verbose) + try: + return cloner.clone_stream( + stream_id=stream_id, + include_metadata=include_metadata, + file_filter=file_filter, + ) + finally: + cloner.close() + + +def clone_url( + url: str, + dest_dir: str = ".", + filename: Optional[str] = None, + verbose: bool = True, +) -> Dict[str, Any]: + """Convenience function to clone a file from URL. + + Args: + url: The Globus HTTPS URL + dest_dir: Destination directory + filename: Optional override for filename + verbose: Print progress messages + + Returns: + Dict with download result + """ + cloner = StreamCloner(dest_dir=dest_dir, verbose=verbose) + try: + return cloner.clone_from_url(url=url, filename=filename) + finally: + cloner.close() + + +# CLI interface +if __name__ == "__main__": + import argparse + + parser = argparse.ArgumentParser(description="Clone MDF streams to local directory") + parser.add_argument("stream_id", help="Stream ID to clone") + parser.add_argument("-d", "--dest", default=".", help="Destination directory") + parser.add_argument("-f", "--filter", help="File filter pattern (e.g., '*.csv')") + parser.add_argument("--no-metadata", action="store_true", help="Skip metadata file") + parser.add_argument("-q", "--quiet", action="store_true", help="Quiet mode") + + args = parser.parse_args() + + result = clone_stream( + stream_id=args.stream_id, + dest_dir=args.dest, + include_metadata=not args.no_metadata, + file_filter=args.filter, + verbose=not args.quiet, + ) + + if result["errors"]: + print(f"\nCompleted with {len(result['errors'])} errors") + exit(1) + else: + print(f"\nClone complete: {result['downloaded']} files") diff --git a/aws/v2/config.py b/aws/v2/config.py new file mode 100644 index 0000000..a76aef1 --- /dev/null +++ b/aws/v2/config.py @@ -0,0 +1,16 @@ +import os + + +AWS_REGION = os.environ.get("AWS_REGION", "us-east-1") +DYNAMO_SUBMISSIONS_TABLE = os.environ.get("DYNAMO_SUBMISSIONS_TABLE", "mdf-connect-v2-submissions") +DYNAMO_STREAMS_TABLE = os.environ.get("DYNAMO_STREAMS_TABLE", "mdf-connect-v2-streams") +DYNAMO_ENDPOINT_URL = os.environ.get("DYNAMO_ENDPOINT_URL") + +SEARCH_INDEX_UUID = os.environ.get("SEARCH_INDEX_UUID") +TEST_SEARCH_INDEX_UUID = os.environ.get("TEST_SEARCH_INDEX_UUID") + + +GSI_USER_INDEX = os.environ.get("GSI_USER_INDEX", "user-submissions") +GSI_ORG_INDEX = os.environ.get("GSI_ORG_INDEX", "org-submissions") + +DEFAULT_ORGANIZATION = os.environ.get("DEFAULT_ORGANIZATION", "MDF Open") diff --git a/aws/v2/curation.py b/aws/v2/curation.py new file mode 100644 index 0000000..a4b873e --- /dev/null +++ b/aws/v2/curation.py @@ -0,0 +1,214 @@ +"""Curation handlers for MDF v2. + +Provides API endpoints for curators to review, approve, or reject submissions. +This replaces the Globus Automate weboption-based curation workflow. + +Curation states: +- pending_curation: Awaiting curator review +- approved: Curator approved, ready for DOI/indexing +- rejected: Curator rejected with reason +- published: DOI minted and indexed + +Endpoints: +- GET /curation/pending - List submissions awaiting curation +- GET /curation/{source_id} - Get submission details for curation +- POST /curation/{source_id}/approve - Approve a submission +- POST /curation/{source_id}/reject - Reject a submission +""" + +import json +import os +from datetime import datetime, timezone +from typing import Any, Dict, List, Optional + +from v2.metadata import parse_metadata, to_datacite +from v2.store import get_store as get_submission_store + + +# Curators can be defined by user ID or group +CURATOR_USER_IDS = set( + os.environ.get("CURATOR_USER_IDS", "").split(",") +) - {""} + +CURATOR_GROUP_IDS = set( + os.environ.get("CURATOR_GROUP_IDS", "").split(",") +) - {""} + + +def _is_curator(auth: Dict[str, Any]) -> bool: + """Check if the authenticated user is a curator.""" + user_id = auth.get("user_id", "") + + if user_id in CURATOR_USER_IDS: + return True + + group_info = auth.get("group_info", "{}") + if isinstance(group_info, str): + try: + group_info = json.loads(group_info) + except Exception: + group_info = {} + + user_groups = set(group_info.keys()) if isinstance(group_info, dict) else set() + if user_groups & CURATOR_GROUP_IDS: + return True + + if os.environ.get("ALLOW_ALL_CURATORS", "").lower() in ("true", "1", "yes"): + return True + + return False + + +def _extract_title(submission: Dict[str, Any]) -> str: + """Extract title from submission metadata.""" + meta = parse_metadata(submission) + return meta.title + + +def _extract_doi_metadata(submission: Dict[str, Any]) -> Dict[str, Any]: + """Extract DataCite-compatible metadata dict from a submission.""" + meta = parse_metadata(submission) + source_id = submission.get("source_id", "unknown") + + doi_payload = to_datacite( + meta, + doi="", + url=f"https://materialsdatafacility.org/detail/{source_id}", + source_id=source_id, + created_at=submission.get("created_at"), + published_at=submission.get("published_at"), + ) + + attrs = doi_payload["data"]["attributes"] + doi_metadata = { + "titles": attrs.get("titles", []), + "creators": attrs.get("creators", []), + "publisher": attrs.get("publisher", "Materials Data Facility"), + "publication_year": attrs.get("publicationYear", datetime.now().year), + } + if attrs.get("descriptions"): + doi_metadata["descriptions"] = attrs["descriptions"] + if attrs.get("subjects"): + doi_metadata["subjects"] = attrs["subjects"] + if attrs.get("rightsList"): + doi_metadata["rightsList"] = attrs["rightsList"] + if attrs.get("fundingReferences"): + doi_metadata["fundingReferences"] = attrs["fundingReferences"] + if attrs.get("relatedIdentifiers"): + doi_metadata["relatedIdentifiers"] = attrs["relatedIdentifiers"] + + return doi_metadata + + +def _find_dataset_doi(all_versions: List[Dict[str, Any]]) -> Optional[str]: + """Find the dataset (concept) DOI from prior published versions.""" + for v in all_versions: + ddoi = v.get("dataset_doi") + if ddoi: + return ddoi + # Fallback: look for doi on any published version + for v in all_versions: + if v.get("doi") and v.get("status") == "published": + return v["doi"] + return None + + +def _mint_doi_for_submission( + submission: Dict[str, Any], + all_versions: Optional[List[Dict[str, Any]]] = None, + mint_doi: bool = True, +) -> Dict[str, Any]: + """Mint or update a DOI for an approved submission. + + Version-aware logic: + - First version (no prior DOI): mint dataset DOI, store as both doi and dataset_doi + - Subsequent + mint_doi=True: mint version-specific DOI with -v suffix, + add IsVersionOf relation, update dataset DOI metadata + HasVersion + - Subsequent + mint_doi=False: no new DOI, but update dataset DOI metadata + on DataCite to reflect this version + """ + from v2.datacite import get_datacite_client + + try: + client = get_datacite_client() + source_id = submission.get("source_id", "unknown") + version = submission.get("version", "1.0") + doi_metadata = _extract_doi_metadata(submission) + + dataset_doi = _find_dataset_doi(all_versions or []) + + if not dataset_doi: + if not mint_doi: + # First version with mint_doi=False: no DOI at all + client.close() + return {"success": True, "doi": None, "dataset_doi": None} + # First version: mint the dataset DOI + result = client.mint_doi( + source_id=source_id, + metadata=doi_metadata, + publish=True, + related_identifiers=doi_metadata.get("relatedIdentifiers"), + ) + if result.get("success"): + result["dataset_doi"] = result["doi"] + client.close() + return result + + # Subsequent version + if mint_doi: + # Mint a version-specific DOI with -v{version} suffix + version_suffix = client._generate_suffix(source_id) + f"-v{version}" + related = [ + { + "relatedIdentifier": dataset_doi, + "relatedIdentifierType": "DOI", + "relationType": "IsVersionOf", + } + ] + # Merge external relations (e.g. cross-publish provenance) + for ri in doi_metadata.get("relatedIdentifiers", []): + related.append(ri) + result = client.mint_doi( + source_id=source_id, + metadata=doi_metadata, + publish=True, + doi_suffix=version_suffix, + related_identifiers=related, + ) + + # Also update the dataset DOI metadata to reflect latest version + # and add HasVersion pointing to the new version DOI + if result.get("success"): + version_doi = result["doi"] + has_version = [ + { + "relatedIdentifier": version_doi, + "relatedIdentifierType": "DOI", + "relationType": "HasVersion", + } + ] + client.update_metadata( + doi=dataset_doi, + metadata=doi_metadata, + related_identifiers=has_version, + ) + result["dataset_doi"] = dataset_doi + + client.close() + return result + else: + # No new DOI — just update the dataset DOI metadata on DataCite + update_result = client.update_metadata( + doi=dataset_doi, + metadata=doi_metadata, + ) + client.close() + return { + "success": update_result.get("success", False), + "dataset_doi": dataset_doi, + "doi": None, + "metadata_updated": True, + } + + except Exception as e: + return {"success": False, "error": str(e)} diff --git a/aws/v2/datacite.py b/aws/v2/datacite.py new file mode 100644 index 0000000..c5a412f --- /dev/null +++ b/aws/v2/datacite.py @@ -0,0 +1,374 @@ +"""DataCite DOI minting for MDF v2. + +Handles DOI registration with DataCite for published datasets. + +Configuration: + DATACITE_API_URL: DataCite API endpoint (default: test API) + DATACITE_USERNAME: Repository ID (e.g., "MDF.MDF") + DATACITE_PASSWORD: Repository password + DATACITE_PREFIX: DOI prefix (e.g., "10.18126") +""" + +import os +from datetime import datetime, timezone +from typing import Any, Dict, List, Optional +from urllib.parse import quote + +import httpx + + +# DataCite API endpoints +DATACITE_TEST_API = "https://api.test.datacite.org" +DATACITE_PROD_API = "https://api.datacite.org" + + +class DataCiteClient: + """Client for DataCite DOI registration.""" + + def __init__( + self, + username: Optional[str] = None, + password: Optional[str] = None, + prefix: Optional[str] = None, + api_url: Optional[str] = None, + test_mode: bool = True, + ): + """Initialize DataCite client. + + Args: + username: DataCite repository ID + password: DataCite password + prefix: DOI prefix (e.g., "10.18126") + api_url: API endpoint (defaults based on test_mode) + test_mode: Use test API (default True for safety) + """ + self.username = username or os.environ.get("DATACITE_USERNAME") + self.password = password or os.environ.get("DATACITE_PASSWORD") + self.prefix = prefix or os.environ.get("DATACITE_PREFIX", "10.23677") + + if api_url: + self.api_url = api_url + else: + self.api_url = os.environ.get( + "DATACITE_API_URL", + DATACITE_TEST_API if test_mode else DATACITE_PROD_API + ) + + self._client = httpx.Client( + timeout=30.0, + auth=(self.username, self.password) if self.username and self.password else None, + ) + + def mint_doi( + self, + source_id: str, + metadata: Dict[str, Any], + url: Optional[str] = None, + publish: bool = True, + doi_suffix: Optional[str] = None, + related_identifiers: Optional[List[Dict[str, str]]] = None, + ) -> Dict[str, Any]: + """Mint a new DOI for a dataset. + + Args: + source_id: MDF source ID (used to generate DOI suffix) + metadata: DataCite metadata (titles, creators, etc.) + url: Landing page URL (defaults to MDF URL) + publish: Whether to publish immediately (vs draft) + doi_suffix: Override DOI suffix (e.g. for version-specific DOIs) + related_identifiers: DataCite relatedIdentifiers list + + Returns: + Dict with doi, url, state + """ + # Generate DOI + suffix = doi_suffix or self._generate_suffix(source_id) + doi = f"{self.prefix}/{suffix}" + + # Default landing page URL + if not url: + url = f"https://materialsdatafacility.org/detail/{source_id}" + + # Build DataCite payload + payload = self._build_payload(doi, url, metadata, publish, related_identifiers) + + # Check if DOI already exists + existing = self.get_doi(doi) + if existing and existing.get("data"): + # Update existing DOI + return self._update_doi(doi, payload) + else: + # Create new DOI + return self._create_doi(payload) + + def update_metadata( + self, + doi: str, + metadata: Dict[str, Any], + url: Optional[str] = None, + related_identifiers: Optional[List[Dict[str, str]]] = None, + ) -> Dict[str, Any]: + """Update metadata on an existing DOI without minting a new one. + + Used when a new version inherits the dataset DOI and we want to + update the DataCite record to reflect the latest version's metadata. + """ + payload = self._build_payload( + doi, url or "", metadata, publish=True, related_identifiers=related_identifiers, + ) + # Remove url from payload if not provided (don't overwrite) + if not url: + payload["data"]["attributes"].pop("url", None) + return self._update_doi(doi, payload) + + def get_doi(self, doi: str) -> Optional[Dict[str, Any]]: + """Get DOI metadata.""" + try: + response = self._client.get(f"{self.api_url}/dois/{quote(doi, safe='')}") + if response.status_code == 200: + return response.json() + return None + except Exception: + return None + + def _create_doi(self, payload: Dict[str, Any]) -> Dict[str, Any]: + """Create a new DOI.""" + response = self._client.post( + f"{self.api_url}/dois", + json=payload, + headers={"Content-Type": "application/vnd.api+json"}, + ) + + if response.status_code in (200, 201): + data = response.json() + return { + "success": True, + "doi": data["data"]["id"], + "url": data["data"]["attributes"].get("url"), + "state": data["data"]["attributes"].get("state"), + } + else: + return { + "success": False, + "error": response.text, + "status_code": response.status_code, + } + + def _update_doi(self, doi: str, payload: Dict[str, Any]) -> Dict[str, Any]: + """Update an existing DOI.""" + response = self._client.put( + f"{self.api_url}/dois/{quote(doi, safe='')}", + json=payload, + headers={"Content-Type": "application/vnd.api+json"}, + ) + + if response.status_code in (200, 201): + data = response.json() + return { + "success": True, + "doi": data["data"]["id"], + "url": data["data"]["attributes"].get("url"), + "state": data["data"]["attributes"].get("state"), + "updated": True, + } + else: + return { + "success": False, + "error": response.text, + "status_code": response.status_code, + } + + def _generate_suffix(self, source_id: str) -> str: + """Generate DOI suffix from source ID.""" + # Clean source_id for DOI + suffix = source_id.replace("_", "-").lower() + # Ensure it's valid for DOI + valid_chars = "abcdefghijklmnopqrstuvwxyz0123456789-." + suffix = "".join(c if c in valid_chars else "-" for c in suffix) + return suffix + + def _build_payload( + self, + doi: str, + url: str, + metadata: Dict[str, Any], + publish: bool = True, + related_identifiers: Optional[List[Dict[str, str]]] = None, + ) -> Dict[str, Any]: + """Build DataCite API payload.""" + # Extract metadata fields + titles = metadata.get("titles", []) + if not titles: + title = metadata.get("title", "Untitled Dataset") + titles = [{"title": title}] + + creators = metadata.get("creators", []) + if not creators: + authors = metadata.get("authors", []) + for author in authors: + if isinstance(author, str): + creators.append({"name": author}) + elif isinstance(author, dict): + name = f"{author.get('given_name', '')} {author.get('family_name', '')}".strip() + if not name: + name = author.get("name", "Unknown") + creator = {"name": name} + if author.get("affiliation"): + creator["affiliation"] = [{"name": author["affiliation"]}] + creators.append(creator) + + if not creators: + creators = [{"name": "Materials Data Facility"}] + + # Build attributes + attributes = { + "doi": doi, + "url": url, + "titles": titles, + "creators": creators, + "publisher": metadata.get("publisher", "Materials Data Facility"), + "publicationYear": int(metadata.get("publication_year", datetime.now().year)), + "types": {"resourceTypeGeneral": "Dataset"}, + "schemaVersion": "http://datacite.org/schema/kernel-4", + } + + # Add optional fields + if metadata.get("descriptions"): + attributes["descriptions"] = metadata["descriptions"] + elif metadata.get("description"): + attributes["descriptions"] = [{"description": metadata["description"], "descriptionType": "Abstract"}] + + if metadata.get("subjects"): + attributes["subjects"] = metadata["subjects"] + elif metadata.get("keywords"): + attributes["subjects"] = [{"subject": kw} for kw in metadata["keywords"]] + + if metadata.get("version"): + attributes["version"] = str(metadata["version"]) + + if metadata.get("rightsList"): + attributes["rightsList"] = metadata["rightsList"] + elif metadata.get("license"): + attributes["rightsList"] = [{"rights": metadata["license"]}] + + if metadata.get("fundingReferences"): + attributes["fundingReferences"] = metadata["fundingReferences"] + + if related_identifiers: + attributes["relatedIdentifiers"] = related_identifiers + + # State: draft, registered, or findable + if publish: + attributes["event"] = "publish" + + return { + "data": { + "type": "dois", + "attributes": attributes, + } + } + + def test_connection(self) -> Dict[str, Any]: + """Test connectivity to DataCite API.""" + try: + response = self._client.get(f"{self.api_url}/heartbeat") + return {"success": response.status_code == 200, "status_code": response.status_code} + except Exception as exc: + return {"success": False, "error": str(exc)} + + def close(self): + """Close HTTP client.""" + self._client.close() + + +class MockDataCiteClient: + """Mock DataCite client for testing.""" + + def __init__(self, prefix: str = "10.99999"): + self.prefix = prefix + self._dois: Dict[str, Dict] = {} + + def _generate_suffix(self, source_id: str) -> str: + suffix = source_id.replace("_", "-").lower() + valid_chars = "abcdefghijklmnopqrstuvwxyz0123456789-." + return "".join(c if c in valid_chars else "-" for c in suffix) + + def mint_doi( + self, + source_id: str, + metadata: Dict[str, Any], + url: Optional[str] = None, + publish: bool = True, + doi_suffix: Optional[str] = None, + related_identifiers: Optional[List[Dict[str, str]]] = None, + ) -> Dict[str, Any]: + suffix = doi_suffix or source_id.replace("_", "-").lower() + doi = f"{self.prefix}/{suffix}" + + if not url: + url = f"https://materialsdatafacility.org/detail/{source_id}" + + self._dois[doi] = { + "doi": doi, + "url": url, + "metadata": metadata, + "related_identifiers": related_identifiers or [], + "state": "findable" if publish else "draft", + "created_at": datetime.now(timezone.utc).isoformat().replace("+00:00", "Z"), + } + + return { + "success": True, + "doi": doi, + "url": url, + "state": "findable" if publish else "draft", + "mock": True, + } + + def update_metadata( + self, + doi: str, + metadata: Dict[str, Any], + url: Optional[str] = None, + related_identifiers: Optional[List[Dict[str, str]]] = None, + ) -> Dict[str, Any]: + existing = self._dois.get(doi, {}) + existing["metadata"] = metadata + if url: + existing["url"] = url + if related_identifiers: + existing["related_identifiers"] = related_identifiers + self._dois[doi] = existing + return { + "success": True, + "doi": doi, + "url": existing.get("url"), + "state": existing.get("state", "findable"), + "updated": True, + "mock": True, + } + + def get_doi(self, doi: str) -> Optional[Dict[str, Any]]: + return self._dois.get(doi) + + def close(self): + pass + + +def get_datacite_client(test_mode: bool = None) -> DataCiteClient: + """Get configured DataCite client. + + Uses mock client if credentials not configured. + """ + if test_mode is None: + test_mode = os.environ.get("DATACITE_TEST_MODE", "true").lower() == "true" + + username = os.environ.get("DATACITE_USERNAME") + password = os.environ.get("DATACITE_PASSWORD") + + # Use mock if no credentials + use_mock = os.environ.get("USE_MOCK_DATACITE", "").lower() == "true" + if use_mock or not (username and password): + return MockDataCiteClient() + + return DataCiteClient(test_mode=test_mode) diff --git a/aws/v2/dataset_card.py b/aws/v2/dataset_card.py new file mode 100644 index 0000000..07e9e9b --- /dev/null +++ b/aws/v2/dataset_card.py @@ -0,0 +1,192 @@ +"""Dataset preview cards for MDF v2. + +Provides compact, human-readable summaries of datasets for quick preview +without downloading the full dataset or metadata. +""" + +from typing import Any, Dict, List, Optional + +from v2.metadata import parse_metadata +from v2.store import get_store + + +def _parse_size(total_bytes: int) -> str: + """Format bytes as human-readable size.""" + if total_bytes is None: + return "Unknown" + for unit in ["B", "KB", "MB", "GB", "TB"]: + if total_bytes < 1024: + return f"{total_bytes:.1f} {unit}" if unit != "B" else f"{total_bytes} {unit}" + total_bytes /= 1024 + return f"{total_bytes:.1f} PB" + + +def _extract_file_types(data_sources: List[str]) -> List[str]: + """Extract file type hints from data sources.""" + extensions = set() + for source in (data_sources or []): + if "." in source: + ext = source.rsplit(".", 1)[-1].lower() + if len(ext) <= 5 and ext.isalnum(): + extensions.add(ext) + return sorted(extensions) if extensions else ["unknown"] + + +def build_dataset_card(record: Dict[str, Any]) -> Dict[str, Any]: + """Build a preview card from a submission record. + + Returns a compact summary suitable for display in search results, + dashboards, or quick previews. + """ + meta = parse_metadata(record) + + description = meta.description or "" + + card = { + "source_id": record.get("source_id"), + "version": record.get("version"), + "title": meta.title, + "authors": [a.name for a in meta.authors], + "description": description, + "keywords": meta.keywords, + "publisher": meta.publisher, + "publication_year": meta.publication_year, + "organization": record.get("organization") or meta.organization, + "methods": meta.methods, + "facility": meta.facility, + "status": record.get("status"), + "created_at": record.get("created_at"), + "updated_at": record.get("updated_at"), + # Quick stats + "stats": { + "file_types": _extract_file_types(meta.data_sources), + "data_sources_count": len(meta.data_sources), + "file_count": record.get("file_count", 0), + "total_bytes": record.get("total_bytes", 0), + "size_human": _parse_size(record.get("total_bytes", 0)), + }, + # Links + "links": { + "self": f"/status/{record.get('source_id')}", + "citation": f"/citation/{record.get('source_id')}", + } + } + + # Download info (for clone/download) + if meta.download_url: + card["download_url"] = meta.download_url + if meta.archive_size: + card["archive_size"] = meta.archive_size + card["data_sources"] = list(meta.data_sources) + + # ML summary when present + if meta.ml: + ml_summary = { + "data_format": meta.ml.data_format, + "task_type": meta.ml.task_type, + "n_items": meta.ml.n_items, + } + if meta.ml.splits: + ml_summary["splits"] = [ + {"type": s.type, "n_items": s.n_items} for s in meta.ml.splits + ] + if meta.ml.keys: + inputs = [k.name for k in meta.ml.keys if k.role == "input"] + targets = [k.name for k in meta.ml.keys if k.role == "target"] + if inputs: + ml_summary["input_keys"] = inputs + if targets: + ml_summary["target_keys"] = targets + if meta.ml.short_name: + ml_summary["short_name"] = meta.ml.short_name + card["ml"] = ml_summary + + # License + if meta.license: + card["license"] = meta.license.name + + # DOI + doi = record.get("doi") + if doi: + card["doi"] = doi + card["links"]["doi"] = f"https://doi.org/{doi}" + + # External provenance (cross-published datasets) + if meta.external: + provenance = {"source": meta.external.source} + if meta.external.doi: + provenance["doi"] = meta.external.doi + provenance["doi_url"] = f"https://doi.org/{meta.external.doi}" + if meta.external.url: + provenance["url"] = meta.external.url + provenance["notice"] = f"Originally published at {meta.external.source}" + card["external_provenance"] = provenance + + # Profile summary when available + profile = record.get("dataset_profile") + if profile: + import json as _json + if isinstance(profile, str): + try: + profile = _json.loads(profile) + except Exception: + profile = None + if isinstance(profile, dict): + ps = { + "total_files": profile.get("total_files"), + "total_bytes": profile.get("total_bytes"), + "formats": profile.get("formats", {}), + } + # Tabular summary from first file with columns + for fp in profile.get("files", []): + cols = fp.get("columns", []) + if cols: + ps["tabular_summary"] = { + "filename": fp.get("filename"), + "n_rows": fp.get("n_rows"), + "columns": [{"name": c.get("name"), "dtype": c.get("dtype")} for c in cols], + } + ps["sample_rows"] = fp.get("sample_rows", [])[:3] + break + card["profile_summary"] = ps + + return card + + +def build_stream_card(stream: Dict[str, Any]) -> Dict[str, Any]: + """Build a preview card for a stream.""" + import json + + metadata = stream.get("metadata") or {} + if isinstance(metadata, str): + try: + metadata = json.loads(metadata) + except Exception: + metadata = {} + if metadata is None: + metadata = {} + + return { + "stream_id": stream.get("stream_id"), + "title": stream.get("title"), + "lab_id": stream.get("lab_id"), + "organization": stream.get("organization"), + "status": stream.get("status"), + "created_at": stream.get("created_at"), + "updated_at": stream.get("updated_at"), + "stats": { + "file_count": stream.get("file_count", 0), + "total_size": _parse_size(stream.get("total_bytes", 0)), + "total_bytes": stream.get("total_bytes", 0), + }, + "metadata": { + "instrument": metadata.get("instrument"), + "facility": metadata.get("facility"), + "operator": metadata.get("operator"), + "run_id": metadata.get("run_id"), + }, + "links": { + "self": f"/stream/{stream.get('stream_id')}", + "files": f"/stream/{stream.get('stream_id')}/files", + } + } diff --git a/aws/v2/demo_full_workflow.py b/aws/v2/demo_full_workflow.py new file mode 100755 index 0000000..6818fb4 --- /dev/null +++ b/aws/v2/demo_full_workflow.py @@ -0,0 +1,498 @@ +#!/usr/bin/env python3 +"""MDF v2 Backend Demo - Full Workflow Showcase. + +This script demonstrates all capabilities of the MDF v2 local backend: +1. Dataset publishing (git-style workflow) +2. Stream creation with file uploads +3. Unified search across datasets and streams + +Requirements: + pip install rich httpx + +Usage: + # Start the local server first (in another terminal): + cd cs/aws && python -m v2.local_server + + # Then run this demo: + python v2/demo_full_workflow.py +""" + +import base64 +import json +import os +import sys +import tempfile +import time +from pathlib import Path + +import httpx +from rich.console import Console +from rich.panel import Panel +from rich.progress import Progress, SpinnerColumn, TextColumn +from rich.table import Table +from rich.syntax import Syntax +from rich.tree import Tree +from rich import box + +console = Console() + +API_URL = os.environ.get("MDF_API_URL", "http://127.0.0.1:8080") + + +def banner(): + """Display the demo banner.""" + console.print() + console.print(Panel.fit( + "[bold blue]MDF v2 Backend Demo[/bold blue]\n" + "[dim]Materials Data Facility - Local Development Server[/dim]", + border_style="blue", + )) + console.print() + + +def section(title: str, description: str = ""): + """Display a section header.""" + console.print() + console.rule(f"[bold cyan]{title}[/bold cyan]") + if description: + console.print(f"[dim]{description}[/dim]") + console.print() + + +def api_call(method: str, path: str, data: dict = None, params: dict = None) -> dict: + """Make an API call to the local backend.""" + url = f"{API_URL}{path}" + with httpx.Client(timeout=30.0) as client: + if method == "GET": + response = client.get(url, params=params) + else: + response = client.post(url, json=data) + return response.json() + + +def demo_dataset_publishing(): + """Demonstrate the git-style dataset publishing workflow.""" + section( + "1. Dataset Publishing", + "Git-style workflow: define metadata → submit → track status" + ) + + # Show the payload we'll submit + payload = { + "dc": { + "titles": [{"title": "High-Throughput DFT Study of Perovskite Stability"}], + "creators": [ + {"creatorName": "Chen, Alice", "givenName": "Alice", "familyName": "Chen", + "affiliation": "Argonne National Laboratory"}, + {"creatorName": "Kumar, Raj", "givenName": "Raj", "familyName": "Kumar", + "affiliation": "University of Chicago"}, + ], + "publisher": "Materials Data Facility", + "publicationYear": "2026", + "descriptions": [{ + "description": "Density functional theory calculations examining the thermodynamic " + "stability of 500+ perovskite compositions for solar cell applications.", + "descriptionType": "Abstract" + }], + "subjects": [ + {"subject": "perovskite"}, + {"subject": "DFT"}, + {"subject": "solar cells"}, + {"subject": "stability"}, + ], + "resourceType": {"resourceTypeGeneral": "Dataset", "resourceType": "Dataset"}, + }, + "data_sources": ["globus://my-endpoint/perovskite_dft_2026/"], + "mdf": { + "source_name": "perovskite_stability_highthroughput", + "organization": "argonne_msd", + "lab_id": "chen-lab", + "facility": "ALCF", + "instruments": ["Theta", "Polaris"], + } + } + + console.print("[bold]Submission Payload:[/bold]") + syntax = Syntax(json.dumps(payload, indent=2), "json", theme="monokai", line_numbers=True) + console.print(Panel(syntax, title="POST /submit", border_style="green")) + + # Submit + with Progress( + SpinnerColumn(), + TextColumn("[progress.description]{task.description}"), + console=console, + ) as progress: + task = progress.add_task("Submitting dataset...", total=None) + result = api_call("POST", "/submit", data=payload) + progress.update(task, completed=True) + + if result.get("success"): + console.print(f"[green]✓[/green] Dataset submitted successfully!") + + table = Table(show_header=False, box=box.SIMPLE) + table.add_column("Field", style="dim") + table.add_column("Value", style="cyan") + table.add_row("Source ID", result.get("source_id", "")) + table.add_row("Version", result.get("version", "")) + table.add_row("Status", result.get("status", "submitted")) + console.print(table) + else: + console.print(f"[red]✗[/red] Error: {result.get('error')}") + + return result.get("source_id") + + +def demo_streaming_workflow(): + """Demonstrate the streaming data workflow with file uploads.""" + section( + "2. Streaming Data Workflow", + "For automated labs: create stream → upload files → track → close" + ) + + # Create a stream + console.print("[bold]Creating a new data stream...[/bold]") + + stream_payload = { + "title": "Autonomous XRD Synthesis Campaign", + "lab_id": "selfdriving-lab-01", + "organization": "argonne_asl", + "metadata": { + "instrument": "Bruker D8 Advance", + "facility": "Argonne Self-Driving Lab", + "operator": "AutoBot v2.1", + "run_id": "campaign-2026-01-31", + } + } + + result = api_call("POST", "/stream/create", data=stream_payload) + + if not result.get("success"): + console.print(f"[red]✗[/red] Failed to create stream: {result.get('error')}") + return None + + stream_id = result.get("stream_id") + console.print(f"[green]✓[/green] Stream created: [cyan]{stream_id}[/cyan]") + console.print() + + # Simulate uploading experimental data files + console.print("[bold]Uploading experimental data files...[/bold]") + + # Generate some fake XRD data + sample_files = [ + { + "filename": "sample_001_BaTiO3.xy", + "content": "# BaTiO3 XRD Pattern\n# 2theta intensity\n20.0 150\n22.5 890\n31.5 2100\n38.9 450\n45.0 1800\n", + "metadata": {"composition": "BaTiO3", "temperature_K": 300, "sample_id": "001"} + }, + { + "filename": "sample_002_SrTiO3.xy", + "content": "# SrTiO3 XRD Pattern\n# 2theta intensity\n22.8 920\n32.4 2300\n39.9 520\n46.5 1950\n57.8 780\n", + "metadata": {"composition": "SrTiO3", "temperature_K": 300, "sample_id": "002"} + }, + { + "filename": "sample_003_PbTiO3.xy", + "content": "# PbTiO3 XRD Pattern\n# 2theta intensity\n21.5 780\n31.2 1900\n38.1 380\n44.5 1650\n55.2 620\n", + "metadata": {"composition": "PbTiO3", "temperature_K": 300, "sample_id": "003"} + }, + { + "filename": "synthesis_log.json", + "content": json.dumps({ + "campaign": "perovskite-screen-01", + "start_time": "2026-01-31T10:00:00Z", + "samples_synthesized": 3, + "success_rate": 1.0, + "notes": "All samples crystallized successfully" + }, indent=2), + "metadata": {"file_type": "log", "format": "json"} + } + ] + + upload_table = Table(title="Uploaded Files", box=box.ROUNDED) + upload_table.add_column("Filename", style="cyan") + upload_table.add_column("Size", justify="right") + upload_table.add_column("Checksum", style="dim") + upload_table.add_column("Metadata", style="yellow") + + with Progress( + SpinnerColumn(), + TextColumn("[progress.description]{task.description}"), + console=console, + ) as progress: + for sample in sample_files: + task = progress.add_task(f"Uploading {sample['filename']}...", total=None) + + upload_payload = { + "filename": sample["filename"], + "content_base64": base64.b64encode(sample["content"].encode()).decode("ascii"), + "metadata": sample["metadata"], + } + + result = api_call("POST", f"/stream/{stream_id}/upload", data=upload_payload) + progress.update(task, completed=True) + + if result.get("success"): + for f in result.get("files", []): + meta_str = ", ".join(f"{k}={v}" for k, v in sample["metadata"].items()) + upload_table.add_row( + f["filename"], + f"{f['size_bytes']} bytes", + f["checksum_md5"][:12] + "...", + meta_str[:30] + ("..." if len(meta_str) > 30 else "") + ) + + time.sleep(0.2) # Small delay for visual effect + + console.print() + console.print(upload_table) + + # Show stream status + console.print() + console.print("[bold]Stream Status:[/bold]") + status = api_call("GET", f"/stream/{stream_id}") + + if status.get("success"): + stream = status.get("stream", {}) + status_table = Table(show_header=False, box=box.SIMPLE) + status_table.add_column("Field", style="dim") + status_table.add_column("Value", style="cyan") + status_table.add_row("Stream ID", stream.get("stream_id", "")) + status_table.add_row("Title", stream.get("title", "")) + status_table.add_row("Status", f"[green]{stream.get('status', '')}[/green]") + status_table.add_row("File Count", str(stream.get("file_count", 0))) + status_table.add_row("Total Bytes", str(stream.get("total_bytes", 0))) + status_table.add_row("Lab ID", stream.get("lab_id", "")) + console.print(status_table) + + # List files in stream + console.print() + console.print("[bold]Files in Stream:[/bold]") + files_result = api_call("GET", f"/stream/{stream_id}/files") + + if files_result.get("success"): + files = files_result.get("files", []) + tree = Tree(f"[cyan]{stream_id}[/cyan]") + for f in files: + meta = f.get("metadata", {}) + meta_str = f" [dim]({meta.get('composition', meta.get('file_type', ''))})[/dim]" if meta else "" + tree.add(f"[green]{f['filename']}[/green]{meta_str}") + console.print(tree) + + return stream_id + + +def demo_search(source_id: str = None, stream_id: str = None): + """Demonstrate unified search across datasets and streams.""" + section( + "3. Unified Search", + "Search across both published datasets and active streams" + ) + + searches = [ + ("perovskite", "all", "Searching for 'perovskite' across all content..."), + ("DFT", "datasets", "Searching datasets for 'DFT'..."), + ("XRD", "streams", "Searching streams for 'XRD'..."), + ("synthesis", "all", "Searching for 'synthesis'..."), + ] + + for query, search_type, description in searches: + console.print(f"[bold]{description}[/bold]") + + result = api_call("GET", "/search", params={"q": query, "type": search_type, "limit": "5"}) + + if result.get("total", 0) > 0: + results_table = Table(box=box.SIMPLE) + results_table.add_column("Type", width=8) + results_table.add_column("Title") + results_table.add_column("ID", style="cyan") + results_table.add_column("Score", justify="right", style="yellow") + + for r in result.get("results", []): + type_style = "[blue]dataset[/blue]" if r.get("type") == "dataset" else "[green]stream[/green]" + results_table.add_row( + type_style, + (r.get("title", "")[:35] + "...") if len(r.get("title", "")) > 35 else r.get("title", ""), + r.get("source_id", r.get("stream_id", "")), + f"{r.get('score', 0):.1f}", + ) + + console.print(results_table) + else: + console.print(f"[dim]No results found[/dim]") + + console.print() + + +def demo_cards_and_citations(source_id: str): + """Demonstrate dataset cards and citation export.""" + section( + "4. Dataset Cards & Citations", + "Quick previews and citation export for researchers" + ) + + if not source_id: + console.print("[dim]No dataset to show card for[/dim]") + return + + # Get dataset card + console.print("[bold]Dataset Preview Card:[/bold]") + result = api_call("GET", f"/card/{source_id}") + + if result.get("success"): + card = result.get("card", {}) + console.print(Panel( + f"[bold]{card.get('title', 'Untitled')}[/bold]\n\n" + f"[dim]{card.get('description', 'No description')}[/dim]", + title=f"[cyan]{card.get('source_id')}[/cyan] v{card.get('version', '1.0')}", + border_style="blue", + )) + + # Metadata + table = Table(show_header=False, box=box.SIMPLE) + table.add_column("Field", style="dim", width=12) + table.add_column("Value") + if card.get("authors"): + table.add_row("Authors", ", ".join(card["authors"])) + if card.get("keywords"): + table.add_row("Keywords", ", ".join(card["keywords"][:5])) + if card.get("methods"): + table.add_row("Methods", ", ".join(card["methods"])) + table.add_row("Status", f"[green]{card.get('status')}[/green]") + console.print(table) + else: + console.print(f"[red]Error:[/red] {result.get('error')}") + + console.print() + + # Get citations + console.print("[bold]Citation Export:[/bold]") + + # APA + result = api_call("GET", f"/citation/{source_id}", params={"format": "apa"}) + if result.get("success"): + console.print(Panel(result.get("apa", ""), title="APA Format", border_style="green")) + + # BibTeX + result = api_call("GET", f"/citation/{source_id}", params={"format": "bibtex"}) + if result.get("success"): + from rich.syntax import Syntax + bibtex = result.get("bibtex", "") + syntax = Syntax(bibtex, "bibtex", theme="monokai") + console.print(Panel(syntax, title="BibTeX", border_style="green")) + + +def demo_api_overview(): + """Show an overview of all available API endpoints.""" + section( + "5. API Reference", + "All endpoints available in the MDF v2 local backend" + ) + + endpoints = [ + ("Dataset Publishing", [ + ("POST", "/submit", "Submit a new dataset"), + ("GET", "/status/{source_id}", "Get dataset status"), + ("GET", "/submissions", "List all submissions"), + ("POST", "/status/update", "Update submission status"), + ]), + ("Streaming Data", [ + ("POST", "/stream/create", "Create a new stream"), + ("POST", "/stream/{id}/upload", "Upload files to stream"), + ("GET", "/stream/{id}/files", "List files in stream"), + ("POST", "/stream/{id}/append", "Append metadata to stream"), + ("POST", "/stream/{id}/snapshot", "Create searchable snapshot"), + ("POST", "/stream/{id}/close", "Close/finalize stream"), + ("GET", "/stream/{id}", "Get stream status"), + ]), + ("Discovery", [ + ("GET", "/search?q={query}", "Search datasets and streams"), + ("GET", "/card/{source_id}", "Get dataset preview card"), + ("GET", "/citation/{source_id}", "Export citation (bibtex, ris, apa)"), + ]), + ] + + for category, routes in endpoints: + table = Table(title=f"[bold]{category}[/bold]", box=box.ROUNDED, title_justify="left") + table.add_column("Method", style="magenta", width=6) + table.add_column("Endpoint", style="cyan") + table.add_column("Description", style="dim") + + for method, path, desc in routes: + table.add_row(method, path, desc) + + console.print(table) + console.print() + + +def demo_cli_commands(): + """Show the CLI commands available.""" + section( + "6. CLI Quick Reference", + "Use these commands with the mdf CLI tool" + ) + + cli_examples = """ +# Dataset publishing workflow +mdf init ./my_dataset --title "My Dataset" --author "Jane Doe" +mdf add *.csv data/*.json +mdf commit -m "Initial dataset" +mdf validate +mdf publish --local --submit + +# Streaming workflow +mdf stream create --title "Lab Experiment" --lab-id "lab-01" +mdf stream upload --stream-id data.csv results.json +mdf stream files --stream-id +mdf stream status --stream-id +mdf stream close --stream-id + +# Search +mdf search "perovskite" +mdf search "XRD" --type streams +mdf backend search "DFT calculations" --limit 10 +""" + + syntax = Syntax(cli_examples.strip(), "bash", theme="monokai", line_numbers=False) + console.print(Panel(syntax, title="CLI Examples", border_style="green")) + + +def main(): + """Run the full demo.""" + banner() + + # Check if server is running + console.print("[dim]Checking connection to local server...[/dim]") + try: + result = api_call("GET", "/submissions") + console.print(f"[green]✓[/green] Connected to {API_URL}") + except Exception as e: + console.print(f"[red]✗[/red] Cannot connect to {API_URL}") + console.print(f"[dim]Start the server with: cd cs/aws && python -m v2.local_server[/dim]") + sys.exit(1) + + # Run demos + source_id = demo_dataset_publishing() + stream_id = demo_streaming_workflow() + demo_search(source_id, stream_id) + demo_cards_and_citations(source_id) + demo_api_overview() + demo_cli_commands() + + # Final summary + section("Demo Complete!") + + summary = Table(show_header=False, box=box.SIMPLE) + summary.add_column("", style="green") + summary.add_column("") + summary.add_row("✓", "Dataset published and indexed") + summary.add_row("✓", "Stream created with file uploads") + summary.add_row("✓", "Unified search working") + summary.add_row("✓", "Full API available at " + API_URL) + console.print(summary) + + console.print() + console.print("[dim]Explore more at: https://github.com/materials-data-facility/mdf-connect[/dim]") + console.print() + + +if __name__ == "__main__": + main() diff --git a/aws/v2/demo_new_cli.sh b/aws/v2/demo_new_cli.sh new file mode 100644 index 0000000..fb220da --- /dev/null +++ b/aws/v2/demo_new_cli.sh @@ -0,0 +1,180 @@ +#!/usr/bin/env bash +set -euo pipefail + +############################################################################### +# demo_new_cli.sh — MDF Agent CLI UX demo +# +# Showcases: global config, direct publish with HTTPS upload to Globus, +# status from memory, repo mode, curation workflow, search. +# +# Prerequisites: +# 1. pip install -e . (so `mdf` is on PATH) +# 2. mdf login (Globus auth for staging) +# +# Usage: bash cs/aws/v2/demo_new_cli.sh +############################################################################### + +blue() { printf "\033[1;34m%s\033[0m\n" "$*"; } +green() { printf "\033[1;32m%s\033[0m\n" "$*"; } +dim() { printf "\033[2m%s\033[0m\n" "$*"; } +banner() { echo; printf "\033[1;36m══ %s ══\033[0m\n" "$*"; echo; } +pause() { dim "(enter to continue)"; read -r; } + +# ── Create sample data ─────────────────────────────────────────────────────── + +DATA_DIR=$(mktemp -d) +REPO_DIR=$(mktemp -d) +trap 'rm -rf "${DATA_DIR}" "${REPO_DIR}"' EXIT + +cat > "${DATA_DIR}/xrd_scan_001.csv" <<'CSV' +two_theta,intensity,d_spacing +10.5,120,8.42 +21.3,450,4.17 +31.7,890,2.82 +38.2,340,2.35 +44.5,670,2.03 +50.1,210,1.82 +CSV + +cat > "${DATA_DIR}/xrd_scan_002.csv" <<'CSV' +two_theta,intensity,d_spacing +10.5,115,8.42 +21.4,460,4.16 +31.6,910,2.83 +38.3,355,2.35 +44.4,680,2.04 +50.0,225,1.82 +CSV + +cat > "${DATA_DIR}/metadata.json" <<'JSON' +{ + "instrument": "Rigaku SmartLab", + "wavelength_angstrom": 1.5406, + "scan_type": "theta-2theta", + "sample": "Fe3Al intermetallic", + "temperature_K": 298 +} +JSON + +green "Sample data created:" +ls -lh "${DATA_DIR}" +echo + +############################################################################### +banner "1. GLOBAL CONFIG — set defaults once, use everywhere" +############################################################################### + +mdf config set defaults.service staging +mdf config set user.organization argonne +mdf config set user.publisher "Materials Data Facility" + +blue "Config:" +mdf config show + +pause + +############################################################################### +banner "2. DIRECT PUBLISH — files upload to Globus, no repo needed" +############################################################################### + +dim '$ mdf publish ./data/ \' +dim ' --title "Fe3Al XRD Characterization" \' +dim ' --author "Doe, Jane" --author "Smith, Bob" \' +dim ' --description "Theta-2theta XRD scans ..." \' +dim ' --test --submit' +echo + +mdf publish "${DATA_DIR}/" \ + --title "Fe3Al XRD Characterization" \ + --author "Doe, Jane" --author "Smith, Bob" \ + --description "Theta-2theta XRD scans of Fe3Al intermetallic" \ + --test --submit + +echo +blue "Saved to config:" +mdf config get last_publish + +pause + +############################################################################### +banner "3. STATUS — remembers what you just published" +############################################################################### + +dim '$ mdf status # no args — reads last_publish from config' +echo +mdf status + +pause + +############################################################################### +banner "4. REPO MODE — init / add / commit / publish" +############################################################################### + +mdf init "${REPO_DIR}" \ + --title "High-Entropy Alloy Tensile Data" \ + --author "Kumar, Raj" + +cat > "${REPO_DIR}/tensile_data.csv" <<'CSV' +strain_pct,stress_mpa,temp_k +0.1,210,298 +0.5,450,298 +1.0,680,298 +2.0,890,298 +5.0,1050,298 +CSV + +(cd "${REPO_DIR}" && mdf add tensile_data.csv) +(cd "${REPO_DIR}" && mdf commit -m "Initial experimental data") +echo +(cd "${REPO_DIR}" && mdf publish --test --submit) + +echo +blue "Config tracks the latest:" +mdf config get last_publish + +pause + +############################################################################### +banner "5. CURATION — list pending, then approve" +############################################################################### + +mdf pending + +echo +SOURCE_ID=$(mdf config get last_publish | python3 -c " +import json, sys +print(json.load(sys.stdin).get('source_id', '')) +" 2>/dev/null || true) + +if [[ -n "${SOURCE_ID}" ]]; then + blue "Approving: ${SOURCE_ID}" + mdf approve "${SOURCE_ID}" --notes "Data looks great" +fi + +pause + +############################################################################### +banner "6. SEARCH" +############################################################################### + +dim '$ mdf search "XRD"' +mdf search "XRD" || true + +echo +dim '$ mdf search "tensile"' +mdf search "tensile" || true + +############################################################################### +banner "DONE" +############################################################################### + +green "What we demonstrated:" +echo " 1. mdf config set/show — set defaults once" +echo " 2. mdf publish ./data/ — direct publish, real Globus upload" +echo " 3. mdf status — remembers last publish" +echo " 4. mdf init/add/commit/pub — repo mode still works" +echo " 5. mdf pending / approve — curation at top level" +echo " 6. mdf search — find datasets" +echo +dim "Config file:" +mdf config path diff --git a/aws/v2/demo_search.py b/aws/v2/demo_search.py new file mode 100755 index 0000000..a233fc6 --- /dev/null +++ b/aws/v2/demo_search.py @@ -0,0 +1,231 @@ +#!/usr/bin/env python3 +"""Demo: Full MDF workflow with search. + +This demonstrates the complete local MDF experience: +1. Create multiple datasets and streams +2. Search across all of them +3. Find what you're looking for! + +Run with: python cs/aws/v2/demo_search.py +""" + +import os +import sys +import time +from pathlib import Path + +# Setup paths +THIS_DIR = Path(__file__).resolve().parent +sys.path.insert(0, str(THIS_DIR.parent)) +sys.path.insert(0, str(THIS_DIR.parent.parent.parent / "src")) + +os.environ.setdefault("MDF_API_URL", "http://127.0.0.1:8080") +os.environ.setdefault("STORE_BACKEND", "sqlite") +os.environ.setdefault("SQLITE_PATH", "/tmp/mdf_connect_v2.db") + +from mdf_agent.core.backend_client import BackendClient + +# Rich console for pretty output +try: + from rich.console import Console + from rich.panel import Panel + from rich.table import Table + from rich import print as rprint + console = Console() + HAS_RICH = True +except ImportError: + HAS_RICH = False + console = None + def rprint(*args, **kwargs): + print(*args) + + +def print_header(text): + if HAS_RICH: + console.print(f"\n[bold cyan]{'='*60}[/bold cyan]") + console.print(f"[bold cyan]{text}[/bold cyan]") + console.print(f"[bold cyan]{'='*60}[/bold cyan]\n") + else: + print(f"\n{'='*60}") + print(text) + print(f"{'='*60}\n") + + +def print_step(num, text): + if HAS_RICH: + console.print(f"[bold yellow]Step {num}:[/bold yellow] {text}") + else: + print(f"Step {num}: {text}") + + +def main(): + client = BackendClient.from_env() + + # Intro + if HAS_RICH: + console.print(Panel.fit( + "[bold]MDF Local Demo: Datasets, Streams & Search[/bold]\n\n" + "This demo creates sample datasets and streams,\n" + "then shows how to search across all of them.", + title="Materials Data Facility", + border_style="blue", + )) + else: + print("=" * 60) + print("MDF Local Demo: Datasets, Streams & Search") + print("=" * 60) + + time.sleep(1) + + # Step 1: Create some datasets + print_header("Creating Sample Datasets") + + datasets = [ + { + "dc": { + "titles": [{"title": "Fe-Al Intermetallic Formation Energies"}], + "creators": [{"creatorName": "Doe, Jane"}, {"creatorName": "Smith, John"}], + "publisher": "Materials Data Facility", + "publicationYear": "2024", + "descriptions": [{"description": "DFT calculations of iron-aluminum intermetallic compounds"}], + "subjects": [{"subject": "DFT"}, {"subject": "intermetallics"}, {"subject": "iron"}, {"subject": "aluminum"}], + }, + "data_sources": ["globus://endpoint/fe-al-data/"], + "test": True, + }, + { + "dc": { + "titles": [{"title": "Perovskite Solar Cell Efficiency Database"}], + "creators": [{"creatorName": "Chen, Wei"}], + "publisher": "Materials Data Facility", + "publicationYear": "2024", + "descriptions": [{"description": "Experimental efficiency measurements for perovskite solar cells with XRD characterization"}], + "subjects": [{"subject": "perovskite"}, {"subject": "solar cells"}, {"subject": "XRD"}, {"subject": "efficiency"}], + }, + "data_sources": ["globus://endpoint/perovskite-solar/"], + "test": True, + }, + { + "dc": { + "titles": [{"title": "High-Entropy Alloy Mechanical Properties"}], + "creators": [{"creatorName": "Kumar, Raj"}, {"creatorName": "Williams, Emma"}], + "publisher": "Materials Data Facility", + "publicationYear": "2024", + "descriptions": [{"description": "Tensile testing and microstructure analysis of HEA compositions"}], + "subjects": [{"subject": "high-entropy alloys"}, {"subject": "mechanical properties"}, {"subject": "tensile testing"}], + }, + "data_sources": ["globus://endpoint/hea-data/"], + "test": True, + }, + ] + + for i, dataset in enumerate(datasets, 1): + print_step(i, f"Submitting: {dataset['dc']['titles'][0]['title']}") + result = client.submit(dataset) + if HAS_RICH: + console.print(f" [green]Created:[/green] {result.get('source_id')} v{result.get('version')}") + else: + print(f" Created: {result.get('source_id')} v{result.get('version')}") + time.sleep(0.3) + + # Step 2: Create some streams + print_header("Creating Lab Streams") + + streams = [ + {"title": "Argonne XRD Beamline - Jan 2024", "lab_id": "argonne-11-id-c"}, + {"title": "NIST Neutron Diffraction Run", "lab_id": "nist-ncnr"}, + {"title": "Stanford SLAC LCLS XRD Campaign", "lab_id": "slac-lcls"}, + ] + + stream_ids = [] + for i, stream in enumerate(streams, 1): + print_step(i, f"Creating stream: {stream['title']}") + result = client.stream_create(stream["title"], lab_id=stream["lab_id"]) + stream_id = result.get("stream_id") + stream_ids.append(stream_id) + if HAS_RICH: + console.print(f" [green]Created:[/green] {stream_id}") + else: + print(f" Created: {stream_id}") + + # Add some files + client.stream_append(stream_id, file_count=25, total_bytes=1024*1024*100) + time.sleep(0.2) + + # Step 3: Demo search! + print_header("Searching the Repository") + + queries = [ + ("perovskite", "all"), + ("XRD", "all"), + ("iron", "datasets"), + ("argonne", "streams"), + ("DFT intermetallics", "all"), + ] + + for query, search_type in queries: + if HAS_RICH: + console.print(f"\n[bold]Searching:[/bold] [cyan]'{query}'[/cyan] (type: {search_type})") + else: + print(f"\nSearching: '{query}' (type: {search_type})") + + result = client.search(query, search_type=search_type) + + if result.get("results"): + if HAS_RICH: + table = Table(show_header=True, header_style="bold") + table.add_column("Type", width=8) + table.add_column("Title", width=40) + table.add_column("ID") + + for item in result["results"]: + if item.get("type") == "dataset": + table.add_row( + "[blue]dataset[/blue]", + item.get("title", "")[:40], + item.get("source_id", ""), + ) + else: + table.add_row( + "[green]stream[/green]", + item.get("title", "")[:40], + item.get("stream_id", ""), + ) + console.print(table) + else: + for item in result["results"]: + print(f" - [{item.get('type')}] {item.get('title', '')[:40]}") + else: + print(" (no results)") + + time.sleep(0.5) + + # Summary + print_header("Demo Complete!") + + if HAS_RICH: + console.print(Panel.fit( + "[bold green]What we demonstrated:[/bold green]\n\n" + "• Created 3 datasets with rich metadata\n" + "• Created 3 streaming lab data feeds\n" + "• Searched across all content by keyword\n" + "• Filtered by type (datasets vs streams)\n\n" + "[dim]Try it yourself:[/dim]\n" + " mdf search 'perovskite'\n" + " mdf search 'XRD' --type streams\n" + " mdf backend search 'iron oxide'", + title="Summary", + border_style="green", + )) + else: + print("What we demonstrated:") + print("• Created 3 datasets with rich metadata") + print("• Created 3 streaming lab data feeds") + print("• Searched across all content by keyword") + print("\nTry: mdf search 'perovskite'") + + client.close() + + +if __name__ == "__main__": + main() diff --git a/aws/v2/demo_workflows.py b/aws/v2/demo_workflows.py new file mode 100644 index 0000000..4380aa9 --- /dev/null +++ b/aws/v2/demo_workflows.py @@ -0,0 +1,951 @@ +#!/usr/bin/env python +""" +MDF Connect v2 — Interactive Workflow Demo +========================================== + +Three end-to-end workflows that exercise the new FastAPI backend, +rendered with Rich panels, tables, trees, and progress bars. + +Usage: + # Start the server first (in another terminal): + cd cs/aws && STORE_BACKEND=sqlite AUTH_MODE=dev python -m v2.app.main + + # Then run the demo: + python v2/demo_workflows.py [--base-url http://127.0.0.1:8080] +""" + +import argparse +import base64 +import json +import sys +import time +from datetime import datetime + +import requests +from rich.align import Align +from rich.columns import Columns +from rich.console import Console, Group +from rich.live import Live +from rich.markdown import Markdown +from rich.padding import Padding +from rich.panel import Panel +from rich.progress import ( + BarColumn, + Progress, + SpinnerColumn, + TextColumn, + TimeElapsedColumn, +) +from rich.rule import Rule +from rich.style import Style +from rich.syntax import Syntax +from rich.table import Table +from rich.text import Text +from rich.tree import Tree + +console = Console(width=100) + +# -- Theme colours ---------------------------------------------------------- +ACCENT = "bright_cyan" +OK = "bright_green" +WARN = "bright_yellow" +ERR = "bright_red" +DIM = "dim" +TITLE = "bold bright_white" + +HEADER = { + "Content-Type": "application/json", + "X-User-Id": "demo-researcher", + "X-User-Email": "researcher@mdf.org", +} + +CURATOR_HEADER = { + "Content-Type": "application/json", + "X-User-Id": "curator-admin", + "X-User-Email": "curator@mdf.org", +} + +# -- Helpers ---------------------------------------------------------------- + +def api(method, path, base, **kw): + """Fire an HTTP request and return (response_json, elapsed_ms).""" + url = f"{base.rstrip('/')}{path}" + t0 = time.perf_counter() + resp = getattr(requests, method)(url, **kw) + elapsed = (time.perf_counter() - t0) * 1000 + try: + body = resp.json() + except Exception: + body = {"raw": resp.text} + return body, elapsed + + +def status_dot(success): + return Text("●", style=OK) if success else Text("●", style=ERR) + + +def latency_text(ms): + colour = OK if ms < 200 else WARN if ms < 500 else ERR + return Text(f"{ms:.0f} ms", style=colour) + + +def step_header(num, total, label): + console.print() + console.print( + Text(f" Step {num}/{total} ", style="bold white on dark_green"), + Text(f" {label}", style="bold"), + ) + console.print() + + +def pause(seconds=1.0): + time.sleep(seconds) + + +def banner(): + art = Text.from_markup( + "\n" + "[bold bright_cyan] ╔═══════════════════════════════════════════════════════╗[/]\n" + "[bold bright_cyan] ║[/] [bold bright_white]MDF Connect v2[/] [dim]·[/] [bold bright_magenta]FastAPI Backend Demo[/] [bold bright_cyan]║[/]\n" + "[bold bright_cyan] ║[/] [dim]Flat metadata schema · ML-ready · DataCite 4.5[/] [bold bright_cyan]║[/]\n" + "[bold bright_cyan] ╚═══════════════════════════════════════════════════════╝[/]\n" + ) + console.print(Align.center(art)) + + +# =================================================================== +# WORKFLOW 1 — Dataset Submission & Discovery (new flat format) +# =================================================================== + +def workflow_1(base): + panel = Panel( + Text.from_markup( + "[bold]A researcher submits an ML-ready materials dataset using the\n" + "new flat metadata schema, then discovers it via keyword search,\n" + "ML task-type search, rich dataset card, and citation export.[/]" + ), + title="[bold bright_magenta]Workflow 1[/] [bold]Dataset Submission & ML Discovery[/]", + subtitle="[dim]POST /submit → search → ML search → card → ML detail → citation[/]", + border_style="bright_magenta", + padding=(1, 3), + ) + console.print(panel) + pause(0.5) + + STEPS = 7 + + # -- Step 1: Submit (flat format with ML metadata) ------------------ + step_header(1, STEPS, "Submit a dataset with ML metadata (flat format)") + + payload = { + "title": "Thermal Conductivity of High-Entropy Alloys", + "authors": [ + {"name": "Chen, Wei", "given_name": "Wei", "family_name": "Chen", + "affiliations": ["Argonne National Laboratory"], + "orcid": "0000-0001-2345-6789"}, + {"name": "Park, Joon", "given_name": "Joon", "family_name": "Park", + "affiliations": ["University of Chicago"]}, + ], + "description": "Measured thermal conductivity of 47 high-entropy alloy compositions using laser flash analysis at temperatures from 300K to 1200K.", + "keywords": ["high-entropy alloys", "thermal conductivity", "materials science"], + "publisher": "Materials Data Facility", + "publication_year": 2026, + "resource_type": "Dataset", + "organization": "Argonne National Laboratory", + "facility": "Advanced Photon Source", + "methods": ["Laser Flash Analyzer LFA 457"], + "fields_of_science": ["materials science", "condensed matter physics"], + "data_sources": [ + "https://data.materialsdatafacility.org/hea_thermal/data.csv", + "https://data.materialsdatafacility.org/hea_thermal/metadata.json", + ], + "license": {"name": "CC BY 4.0", "url": "https://creativecommons.org/licenses/by/4.0/", + "identifier": "CC-BY-4.0"}, + "funding": [ + {"funder_name": "National Science Foundation", "award_number": "DMR-2012345", + "award_title": "High-Entropy Alloy Thermal Properties"}, + {"funder_name": "DOE Office of Science", "award_number": "DE-AC02-06CH11357"}, + ], + "related_works": [ + {"identifier": "10.1038/s41524-025-01234-5", "identifier_type": "DOI", + "relation_type": "IsSupplementTo", "description": "Original publication"}, + ], + "ml": { + "data_format": "tabular", + "task_type": ["supervised", "regression"], + "domain": ["materials science", "thermodynamics"], + "n_items": 47, + "short_name": "hea_thermal_v1", + "splits": [ + {"type": "train", "path": "train.csv", "n_items": 38}, + {"type": "test", "path": "test.csv", "n_items": 9}, + ], + "keys": [ + {"name": "composition", "role": "input", "dtype": "string", + "description": "Chemical formula (e.g. CoCrFeMnNi)"}, + {"name": "temperature", "role": "input", "dtype": "float64", + "units": "K", "description": "Measurement temperature"}, + {"name": "crystal_structure", "role": "input", "dtype": "string", + "description": "Crystal structure type", + "classes": ["FCC", "BCC", "HCP", "multi-phase"]}, + {"name": "thermal_conductivity", "role": "target", "dtype": "float64", + "units": "W/(m*K)", "description": "Measured thermal conductivity"}, + ], + }, + } + + body, ms = api("post", "/submit", base, json=payload, headers=HEADER) + + source_id = body.get("source_id", "???") + tbl = Table(show_header=False, box=None, padding=(0, 2)) + tbl.add_column(style="bold") + tbl.add_column() + tbl.add_row("Status", status_dot(body.get("success"))) + tbl.add_row("Source ID", Text(source_id, style=ACCENT)) + tbl.add_row("Version", body.get("version", "?")) + tbl.add_row("Organization", body.get("organization", "?")) + tbl.add_row("Latency", latency_text(ms)) + console.print(Panel(tbl, title="[bold]Submit Response[/]", border_style=DIM)) + pause(0.8) + + # -- Step 2: Check status ------------------------------------------- + step_header(2, STEPS, "Check submission status") + body, ms = api("get", f"/status/{source_id}", base, headers=HEADER) + sub = body.get("submission", {}) + + tbl = Table(show_header=False, box=None, padding=(0, 2)) + tbl.add_column(style="bold") + tbl.add_column() + tbl.add_row("Status", Text(sub.get("status", "?"), style=OK)) + tbl.add_row("Schema Version", sub.get("schema_version", "?")) + tbl.add_row("Created", sub.get("created_at", "?")) + tbl.add_row("User", sub.get("user_id", "?")) + tbl.add_row("Latency", latency_text(ms)) + console.print(Panel(tbl, title="[bold]Submission Status[/]", border_style=DIM)) + pause(0.8) + + # -- Step 3: Search by keyword -------------------------------------- + step_header(3, STEPS, 'Search for "thermal conductivity"') + body, ms = api("get", "/search", base, params={"q": "thermal conductivity"}) + results = body.get("results", []) + + tbl = Table(title="Keyword Search Results", border_style=ACCENT, show_lines=True) + tbl.add_column("#", style="dim", width=3) + tbl.add_column("Type", width=8) + tbl.add_column("Title / ID", ratio=3) + tbl.add_column("Status", width=12) + tbl.add_column("Score", width=6, justify="right") + + for i, r in enumerate(results[:5], 1): + title = r.get("title") or r.get("source_id") or "—" + tbl.add_row( + str(i), + Text(r.get("type", "?"), style="bold"), + title if len(title) < 55 else title[:52] + "...", + Text(r.get("status", "?"), style=OK), + f"{r.get('score', 0):.1f}", + ) + + console.print(tbl) + console.print(Text(f" {body.get('total', 0)} results in {ms:.0f} ms", style=DIM)) + pause(0.8) + + # -- Step 4: Search by ML task type (NEW) --------------------------- + step_header(4, STEPS, 'Search by ML task type: "regression"') + body, ms = api("get", "/search", base, params={"q": "regression"}) + results = body.get("results", []) + + tbl = Table(title="ML Task-Type Search Results", border_style="bright_yellow", show_lines=True) + tbl.add_column("#", style="dim", width=3) + tbl.add_column("Title", ratio=3) + tbl.add_column("Authors", ratio=2) + tbl.add_column("Score", width=6, justify="right") + + for i, r in enumerate(results[:5], 1): + title = r.get("title") or "—" + authors_str = ", ".join(r.get("authors", [])) + tbl.add_row( + str(i), + title if len(title) < 50 else title[:47] + "...", + authors_str if len(authors_str) < 30 else authors_str[:27] + "...", + f"{r.get('score', 0):.1f}", + ) + + console.print(tbl) + console.print(Text.from_markup( + f" [bold]ML metadata is now searchable![/] " + f"[dim]\"regression\" matched via ml.task_type — {ms:.0f} ms[/]" + )) + pause(0.8) + + # -- Step 5: Dataset card with ML detail ---------------------------- + step_header(5, STEPS, "Fetch dataset preview card (with ML summary)") + body, ms = api("get", f"/card/{source_id}", base) + card = body.get("card", {}) + + tree = Tree(f"[bold]{card.get('title', 'Untitled')}[/]", style=ACCENT) + + # Core metadata + meta_node = tree.add("[bold]Metadata[/]") + meta_node.add(f"Authors: {', '.join(card.get('authors', ['—']))}") + meta_node.add(f"Publisher: {card.get('publisher', '—')}") + meta_node.add(f"Year: {card.get('publication_year', '—')}") + meta_node.add(f"Organization: {card.get('organization', '—')}") + if card.get("keywords"): + meta_node.add(f"Keywords: {', '.join(card['keywords'])}") + if card.get("description"): + desc = card["description"] + meta_node.add(f"Description: {desc[:80]}{'...' if len(desc) > 80 else ''}") + if card.get("license"): + meta_node.add(f"License: {card['license']}") + if card.get("facility"): + meta_node.add(f"Facility: {card['facility']}") + if card.get("methods"): + meta_node.add(f"Methods: {', '.join(card['methods'])}") + + # ML summary in the card + ml_info = card.get("ml") + if ml_info: + ml_node = tree.add("[bold bright_yellow]ML-Ready[/]") + ml_node.add(f"Format: [bold]{ml_info.get('data_format', '?')}[/]") + ml_node.add(f"Task: [bold]{', '.join(ml_info.get('task_type', []))}[/]") + ml_node.add(f"Total samples: [bold]{ml_info.get('n_items', '?')}[/]") + if ml_info.get("short_name"): + ml_node.add(f"Short name: [bold]{ml_info['short_name']}[/] (for Foundry loading)") + + if ml_info.get("splits"): + splits_node = ml_node.add("[dim]Splits[/]") + for sp in ml_info["splits"]: + n = sp.get("n_items") + splits_node.add(f"{sp['type']}: {n} samples" if n else sp["type"]) + + if ml_info.get("input_keys") or ml_info.get("target_keys"): + keys_node = ml_node.add("[dim]Feature Schema[/]") + for k in (ml_info.get("input_keys") or []): + keys_node.add(f"[green]input[/] {k}") + for k in (ml_info.get("target_keys") or []): + keys_node.add(f"[red]target[/] {k}") + + stats_node = tree.add("[bold]Stats[/]") + stats = card.get("stats", {}) + stats_node.add(f"Size: {stats.get('size_human', '?')}") + stats_node.add(f"Data sources: {stats.get('data_sources_count', 0)}") + stats_node.add(f"File types: {', '.join(stats.get('file_types', []))}") + + links_node = tree.add("[bold]Links[/]") + for k, v in card.get("links", {}).items(): + links_node.add(f"{k}: {v}") + + console.print(Panel(tree, title="[bold]Dataset Card[/]", border_style="bright_magenta")) + console.print(Text(f" Card generated in {ms:.0f} ms", style=DIM)) + pause(0.8) + + # -- Step 6: Detailed ML keys table --------------------------------- + step_header(6, STEPS, "Inspect ML feature schema (from stored metadata)") + body, ms = api("get", f"/status/{source_id}", base, headers=HEADER) + sub = body.get("submission", {}) + mdata = sub.get("dataset_mdata") or {} + ml_block = mdata.get("ml") or {} + + keys = ml_block.get("keys") or [] + splits = ml_block.get("splits") or [] + + if keys: + key_tbl = Table( + title="Feature / Target Schema", + border_style="bright_yellow", + show_lines=True, + ) + key_tbl.add_column("Name", style="bold", ratio=2) + key_tbl.add_column("Role", width=8) + key_tbl.add_column("Type", width=10) + key_tbl.add_column("Units", width=10) + key_tbl.add_column("Description", ratio=3) + key_tbl.add_column("Classes", ratio=2) + + for k in keys: + role_style = OK if k.get("role") == "input" else ERR if k.get("role") == "target" else "white" + classes_str = ", ".join(k["classes"]) if k.get("classes") else "—" + key_tbl.add_row( + k.get("name", "?"), + Text(k.get("role", "?"), style=role_style), + k.get("dtype", "—"), + k.get("units", "—"), + k.get("description", "—"), + classes_str, + ) + console.print(key_tbl) + + if splits: + split_tbl = Table( + title="Train / Test Splits", + border_style="bright_yellow", + show_lines=True, + ) + split_tbl.add_column("Split", style="bold", width=12) + split_tbl.add_column("Path", ratio=2) + split_tbl.add_column("Samples", width=10, justify="right") + + total = 0 + for s in splits: + n = s.get("n_items") or 0 + total += n + split_tbl.add_row( + s.get("type", "?"), + s.get("path", "?"), + str(n) if n else "—", + ) + split_tbl.add_row( + Text("TOTAL", style="bold"), + "", + Text(str(total), style="bold"), + ) + console.print(split_tbl) + + console.print(Text(f" ML metadata fetched in {ms:.0f} ms", style=DIM)) + pause(0.8) + + # -- Step 7: Citation ----------------------------------------------- + step_header(7, STEPS, "Export citation in all formats") + body, ms = api("get", f"/citation/{source_id}", base, params={"format": "all"}) + + if body.get("bibtex"): + console.print(Panel( + Syntax(body["bibtex"], "bibtex", theme="monokai", line_numbers=False), + title="[bold]BibTeX[/]", + border_style="bright_yellow", + )) + + if body.get("apa"): + console.print(Panel( + Text(body["apa"], style="italic"), + title="[bold]APA[/]", + border_style="bright_yellow", + )) + + console.print(Text(f" 4 citation formats generated in {ms:.0f} ms", style=DIM)) + pause(0.3) + + console.print() + console.print(Rule(style="bright_magenta")) + console.print( + Align.center(Text("Workflow 1 Complete", style="bold bright_magenta")) + ) + console.print(Rule(style="bright_magenta")) + + return source_id + + +# =================================================================== +# WORKFLOW 2 — Live Streaming Data Pipeline +# =================================================================== + +def workflow_2(base): + panel = Panel( + Text.from_markup( + "[bold]A lab instrument streams data files in real-time, then the\n" + "stream is snapshotted into a citable dataset submission.[/]" + ), + title="[bold bright_green]Workflow 2[/] [bold]Live Streaming Data Pipeline[/]", + subtitle="[dim]stream/create → upload files → append → snapshot → close[/]", + border_style="bright_green", + padding=(1, 3), + ) + console.print(panel) + pause(0.5) + + # -- Step 1: Create stream ------------------------------------------ + step_header(1, 5, "Create a live data stream") + body, ms = api("post", "/stream/create", base, json={ + "title": "APS Beamline 11-ID SAXS Run", + "lab_id": "APS-11ID-2026-Feb", + "organization": "Argonne National Laboratory", + "metadata": { + "instrument": "Pilatus 2M", + "facility": "Advanced Photon Source", + "operator": "Dr. Sarah Kim", + "run_id": "run-2026-02-05-001", + }, + }, headers=HEADER) + + stream_id = body.get("stream_id", "???") + stream = body.get("stream", {}) + + tbl = Table(show_header=False, box=None, padding=(0, 2)) + tbl.add_column(style="bold") + tbl.add_column() + tbl.add_row("Stream ID", Text(stream_id, style=ACCENT)) + tbl.add_row("Title", stream.get("title", "—")) + tbl.add_row("Status", Text(stream.get("status", "?"), style=OK)) + tbl.add_row("Instrument", (stream.get("metadata") or {}).get("instrument", "—")) + tbl.add_row("Operator", (stream.get("metadata") or {}).get("operator", "—")) + tbl.add_row("Latency", latency_text(ms)) + console.print(Panel(tbl, title="[bold]Stream Created[/]", border_style=DIM)) + pause(0.5) + + # -- Step 2: Simulate file uploads with progress -------------------- + step_header(2, 5, "Stream data files from the beamline") + + simulated_files = [ + ("frame_001.tiff", 2_400_000, "SAXS frame - q=0.01-0.5 A^-1"), + ("frame_002.tiff", 2_380_000, "SAXS frame - sample rotated 15deg"), + ("frame_003.tiff", 2_420_000, "SAXS frame - temperature 350K"), + ("dark_current.tiff", 2_100_000, "Dark current calibration"), + ("metadata.json", 4_200, "Run parameters and instrument config"), + ("frame_004.tiff", 2_390_000, "SAXS frame - temperature 400K"), + ("frame_005.tiff", 2_410_000, "SAXS frame - temperature 450K"), + ("reduction_log.txt", 12_800, "Azimuthal integration log"), + ] + + progress = Progress( + SpinnerColumn(style="bright_green"), + TextColumn("[bold]{task.description}[/]"), + BarColumn(bar_width=30, complete_style="bright_green", finished_style="bold bright_green"), + TextColumn("{task.completed}/{task.total} files"), + TimeElapsedColumn(), + console=console, + ) + + file_table = Table( + title="Uploaded Files", + border_style="bright_green", + show_lines=False, + padding=(0, 1), + ) + file_table.add_column("File", style="bold", ratio=2) + file_table.add_column("Size", justify="right", width=10) + file_table.add_column("Note", style="dim", ratio=2) + file_table.add_column("", width=3) + + with progress: + task = progress.add_task("Streaming files", total=len(simulated_files)) + + for fname, size, note in simulated_files: + fake_content = base64.b64encode(b"x" * min(size, 256)).decode() + upload_body, upload_ms = api("post", f"/stream/{stream_id}/upload", base, json={ + "filename": fname, + "content_base64": fake_content, + "content_type": "application/octet-stream", + }, headers=HEADER) + + ok = upload_body.get("success", False) + size_str = f"{size / 1024:.0f} KB" if size < 1_000_000 else f"{size / 1_000_000:.1f} MB" + file_table.add_row(fname, size_str, note, status_dot(ok)) + + progress.advance(task) + pause(0.3) + + console.print(file_table) + pause(0.5) + + # -- Step 3: Check stream status ------------------------------------ + step_header(3, 5, "Verify stream status") + body, ms = api("get", f"/stream/{stream_id}", base, headers=HEADER) + s = body.get("stream", {}) + + cols = [] + for label, value, style in [ + ("Files", str(s.get("file_count", 0)), "bold bright_white"), + ("Status", s.get("status", "?"), OK), + ("Latency", f"{ms:.0f} ms", ACCENT), + ]: + t = Table(show_header=False, box=None) + t.add_column(justify="center") + t.add_row(Text(value, style=style)) + t.add_row(Text(label, style=DIM)) + cols.append(t) + + console.print(Columns(cols, equal=True, expand=True)) + pause(0.5) + + # -- Step 4: Snapshot -> submission ---------------------------------- + step_header(4, 5, "Snapshot stream into a citable dataset") + body, ms = api("post", f"/stream/{stream_id}/snapshot", base, json={ + "title": "APS 11-ID SAXS Dataset — February 2026", + }, headers=HEADER) + + snap_source = body.get("source_id", "?") + tbl = Table(show_header=False, box=None, padding=(0, 2)) + tbl.add_column(style="bold") + tbl.add_column() + tbl.add_row("Status", status_dot(body.get("success"))) + tbl.add_row("Source ID", Text(snap_source, style=ACCENT)) + tbl.add_row("Version", body.get("version", "?")) + tbl.add_row("Latency", latency_text(ms)) + console.print(Panel(tbl, title="[bold]Snapshot Created[/]", border_style=DIM)) + pause(0.5) + + # -- Step 5: Close stream ------------------------------------------- + step_header(5, 5, "Close the stream") + body, ms = api("post", f"/stream/{stream_id}/close", base, + json={"mint_doi": False}, headers=HEADER) + + tbl = Table(show_header=False, box=None, padding=(0, 2)) + tbl.add_column(style="bold") + tbl.add_column() + tbl.add_row("Status", status_dot(body.get("success"))) + tbl.add_row("Stream Status", Text(body.get("status", "?"), style=WARN)) + tbl.add_row("Latency", latency_text(ms)) + console.print(Panel(tbl, title="[bold]Stream Closed[/]", border_style=DIM)) + pause(0.3) + + console.print() + console.print(Rule(style="bright_green")) + console.print( + Align.center(Text("Workflow 2 Complete", style="bold bright_green")) + ) + console.print(Rule(style="bright_green")) + + return snap_source + + +# =================================================================== +# WORKFLOW 3 — Curation & Approval Pipeline +# =================================================================== + +def workflow_3(base, source_id): + panel = Panel( + Text.from_markup( + "[bold]A curator reviews a pending submission, adds metadata,\n" + "approves it, and the system mints a DOI.[/]" + ), + title="[bold bright_yellow]Workflow 3[/] [bold]Curation & Approval Pipeline[/]", + subtitle="[dim]status/update → curation/pending → curation/{id} → approve[/]", + border_style="bright_yellow", + padding=(1, 3), + ) + console.print(panel) + pause(0.5) + + # -- Step 1: Move to pending_curation ------------------------------- + step_header(1, 5, "Transition submission to pending_curation") + body, ms = api("post", "/status/update", base, json={ + "source_id": source_id, + "version": "1.0", + "status": "pending_curation", + }, headers=HEADER) + + tbl = Table(show_header=False, box=None, padding=(0, 2)) + tbl.add_column(style="bold") + tbl.add_column() + tbl.add_row("Status", status_dot(body.get("success"))) + tbl.add_row("New Status", Text(body.get("status", "?"), style=WARN)) + tbl.add_row("Latency", latency_text(ms)) + console.print(Panel(tbl, title="[bold]Status Updated[/]", border_style=DIM)) + pause(0.5) + + # -- Step 2: Curator lists pending ---------------------------------- + step_header(2, 5, "Curator views the curation queue") + body, ms = api("get", "/curation/pending", base, headers=CURATOR_HEADER) + + submissions = body.get("submissions", []) + tbl = Table( + title=f"Curation Queue ({body.get('pending_count', 0)} pending)", + border_style="bright_yellow", + show_lines=True, + ) + tbl.add_column("#", style="dim", width=3) + tbl.add_column("Source ID", ratio=2) + tbl.add_column("Title", ratio=2) + tbl.add_column("Submitter", width=16) + tbl.add_column("Submitted", width=14) + + for i, sub in enumerate(submissions[:5], 1): + sid = sub.get("source_id", "?") + tbl.add_row( + str(i), + sid if len(sid) < 30 else sid[:27] + "...", + sub.get("title", "Untitled"), + sub.get("submitter", "?"), + (sub.get("submitted_at") or "?")[:10], + ) + + console.print(tbl) + console.print(Text(f" Fetched in {ms:.0f} ms", style=DIM)) + pause(0.5) + + # -- Step 3: Curator inspects submission ---------------------------- + step_header(3, 5, "Curator inspects the submission details") + body, ms = api("get", f"/curation/{source_id}", base, + params={"version": "1.0"}, headers=CURATOR_HEADER) + + sub = body.get("submission", {}) + history = body.get("curation_history", []) + + detail = Table(show_header=False, box=None, padding=(0, 2)) + detail.add_column(style="bold") + detail.add_column() + detail.add_row("Source ID", Text(sub.get("source_id", "?"), style=ACCENT)) + detail.add_row("Status", Text(body.get("current_status", "?"), style=WARN)) + detail.add_row("Can Approve", Text(str(body.get("can_approve")), style=OK if body.get("can_approve") else ERR)) + detail.add_row("Can Reject", Text(str(body.get("can_reject")), style=OK if body.get("can_reject") else ERR)) + detail.add_row("History", f"{len(history or [])} entries") + detail.add_row("Latency", latency_text(ms)) + console.print(Panel(detail, title="[bold]Curation Detail[/]", border_style=DIM)) + pause(0.5) + + # -- Step 4: Approve with metadata update --------------------------- + step_header(4, 5, "Curator approves and mints a DOI") + body, ms = api("post", f"/curation/{source_id}/approve", base, json={ + "version": "1.0", + "notes": "Excellent dataset. Metadata verified, data files accessible.", + "metadata_updates": { + "license": {"name": "CC BY 4.0", "url": "https://creativecommons.org/licenses/by/4.0/"}, + }, + "mint_doi": True, + }, headers=CURATOR_HEADER) + + tbl = Table(show_header=False, box=None, padding=(0, 2)) + tbl.add_column(style="bold") + tbl.add_column() + tbl.add_row("Status", status_dot(body.get("success"))) + tbl.add_row("New Status", Text(body.get("status", "?"), style=OK)) + tbl.add_row("Approved By", body.get("approved_by", "?")) + tbl.add_row("Approved At", (body.get("approved_at") or "?")[:19]) + + doi_info = body.get("doi", {}) + if doi_info: + doi_str = doi_info.get("doi", "—") + tbl.add_row("DOI", Text(doi_str, style="bold bright_yellow")) + tbl.add_row("DOI Status", Text("minted" if doi_info.get("success") else "failed", + style=OK if doi_info.get("success") else ERR)) + + tbl.add_row("Latency", latency_text(ms)) + console.print(Panel(tbl, title="[bold]Approval Result[/]", border_style="bright_yellow")) + pause(0.5) + + # -- Step 5: Verify final state ------------------------------------- + step_header(5, 5, "Verify the published dataset") + body, ms = api("get", f"/status/{source_id}", base, headers=HEADER) + sub = body.get("submission", {}) + + final = Table(show_header=False, box=None, padding=(0, 2)) + final.add_column(style="bold") + final.add_column() + final.add_row("Source ID", Text(sub.get("source_id", "?"), style=ACCENT)) + final.add_row("Final Status", Text(sub.get("status", "?"), style=OK)) + final.add_row("Approved By", sub.get("approved_by", "—")) + final.add_row("DOI", Text(sub.get("doi", "—") or "—", style="bold bright_yellow")) + final.add_row("Latency", latency_text(ms)) + console.print(Panel(final, title="[bold]Published Dataset[/]", border_style=OK)) + pause(0.3) + + console.print() + console.print(Rule(style="bright_yellow")) + console.print( + Align.center(Text("Workflow 3 Complete", style="bold bright_yellow")) + ) + console.print(Rule(style="bright_yellow")) + + +# =================================================================== +# WORKFLOW 4 — Backward Compatibility (v1 format auto-migration) +# =================================================================== + +def workflow_4(base): + panel = Panel( + Text.from_markup( + "[bold]A legacy client submits using the old dc/mdf/custom format.\n" + "The server auto-detects and migrates to the flat v2 schema.[/]" + ), + title="[bold bright_blue]Workflow 4[/] [bold]Backward Compatibility[/]", + subtitle="[dim]POST /submit (v1 format) → auto-migrate → verify flat storage[/]", + border_style="bright_blue", + padding=(1, 3), + ) + console.print(panel) + pause(0.5) + + step_header(1, 2, "Submit using old dc/mdf/custom format") + + old_payload = { + "data_sources": [ + "https://data.materialsdatafacility.org/legacy/data.hdf5", + ], + "dc": { + "titles": [{"title": "Legacy HDF5 Dataset"}], + "creators": [ + {"creatorName": "Smith, Alice", "givenName": "Alice", "familyName": "Smith", + "affiliation": "MIT"}, + ], + "publisher": "Materials Data Facility", + "publicationYear": "2025", + "descriptions": [ + {"description": "A dataset submitted with the old format.", + "descriptionType": "Abstract"} + ], + "subjects": [ + {"subject": "legacy"}, + {"subject": "backward compatibility"}, + ], + }, + "mdf": { + "organization": "MIT", + }, + } + + body, ms = api("post", "/submit", base, json=old_payload, headers=HEADER) + + source_id = body.get("source_id", "???") + tbl = Table(show_header=False, box=None, padding=(0, 2)) + tbl.add_column(style="bold") + tbl.add_column() + tbl.add_row("Status", status_dot(body.get("success"))) + tbl.add_row("Source ID", Text(source_id, style=ACCENT)) + tbl.add_row("Auto-migrated", Text("Yes", style=OK)) + tbl.add_row("Latency", latency_text(ms)) + console.print(Panel(tbl, title="[bold]Legacy Submit Response[/]", border_style=DIM)) + pause(0.5) + + step_header(2, 2, "Verify migrated metadata") + body, ms = api("get", f"/card/{source_id}", base) + card = body.get("card", {}) + + tbl = Table(show_header=False, box=None, padding=(0, 2)) + tbl.add_column(style="bold") + tbl.add_column() + tbl.add_row("Title", card.get("title", "?")) + tbl.add_row("Authors", ", ".join(card.get("authors", ["?"]))) + tbl.add_row("Publisher", card.get("publisher", "?")) + tbl.add_row("Year", str(card.get("publication_year", "?"))) + tbl.add_row("Keywords", ", ".join(card.get("keywords", []))) + tbl.add_row("Organization", card.get("organization", "?")) + console.print(Panel(tbl, title="[bold]Migrated Card[/]", border_style="bright_blue")) + pause(0.3) + + console.print() + console.print(Rule(style="bright_blue")) + console.print( + Align.center(Text("Workflow 4 Complete", style="bold bright_blue")) + ) + console.print(Rule(style="bright_blue")) + + +# =================================================================== +# Final Summary +# =================================================================== + +def summary(base): + console.print() + + body, ms1 = api("get", "/submissions", base, headers=HEADER) + subs = body.get("submissions", []) + + body, ms2 = api("get", "/search", base, params={"q": "*"}) + + body, ms3 = api("get", "/health", base) + + tbl = Table( + title="Final System State", + border_style="bright_cyan", + show_lines=True, + ) + tbl.add_column("Metric", style="bold", ratio=2) + tbl.add_column("Value", justify="center", ratio=1) + + tbl.add_row("Total Submissions", str(len(subs))) + tbl.add_row("API Health", Text(body.get("status", "?"), style=OK)) + tbl.add_row("Server", Text(body.get("service", "?"), style=ACCENT)) + + status_counts = {} + for s in subs: + st = s.get("status", "unknown") + status_counts[st] = status_counts.get(st, 0) + 1 + for st, ct in sorted(status_counts.items()): + colour = OK if st in ("published", "approved") else WARN if st == "pending_curation" else "white" + tbl.add_row(f" {st}", Text(str(ct), style=colour)) + + console.print(tbl) + + arch = Tree("[bold bright_cyan]MDF Connect v2 Architecture[/]") + client = arch.add("[bold]Client Layer[/]") + client.add("mdf-agent CLI") + client.add("BackendClient (requests)") + client.add("Rich demo script") + + api_node = arch.add("[bold]API Layer[/] (FastAPI + Mangum)") + api_node.add("[dim]7 routers, 24 endpoints[/]") + api_node.add("[dim]Flat DatasetMetadata schema (Pydantic)[/]") + api_node.add("[dim]Auto-migration from v1 dc/mdf/custom[/]") + api_node.add("[dim]Dependency injection (auth, stores)[/]") + + storage = arch.add("[bold]Storage Layer[/]") + storage.add("DynamoDB [dim](production)[/]") + storage.add("SQLite [dim](development)[/]") + storage.add("Globus HTTPS [dim](file storage)[/]") + + console.print() + console.print(Panel(arch, border_style=ACCENT)) + + console.print() + fin = Text.from_markup( + "\n" + " [bold bright_cyan]All 4 workflows completed successfully.[/]\n" + "\n" + " [dim]Triple-nested dc/mdf/custom → flat DatasetMetadata[/]\n" + " [dim]9 DataCite fields → 24 (full kernel-4 coverage)[/]\n" + " [dim]projects.foundry → first-class ml metadata[/]\n" + " [dim]Auto-migration for backward compatibility[/]\n" + "\n" + " [bold]Visit[/] [underline]http://127.0.0.1:8080/docs[/] [bold]for interactive API documentation.[/]\n" + ) + console.print(Panel(fin, title="[bold bright_cyan]Demo Complete[/]", border_style="bright_cyan")) + + +# =================================================================== +# Main +# =================================================================== + +def main(): + parser = argparse.ArgumentParser(description="MDF v2 backend workflow demo") + parser.add_argument("--base-url", default="http://127.0.0.1:8080", + help="Base URL of the MDF v2 API server") + args = parser.parse_args() + base = args.base_url + + # Check server is reachable + try: + resp = requests.get(f"{base}/health", timeout=3) + resp.raise_for_status() + except Exception: + console.print(Panel( + Text.from_markup( + f"[bold red]Cannot reach server at {base}[/]\n\n" + "Start the server first:\n" + " [bold]cd cs/aws[/]\n" + " [bold]STORE_BACKEND=sqlite AUTH_MODE=dev python -m v2.app.main[/]" + ), + title="[bold red]Server Not Running[/]", + border_style="red", + )) + sys.exit(1) + + banner() + console.print() + + # Workflow 1: Submit & Discover (new flat format) + source_id_1 = workflow_1(base) + console.print() + pause(1) + + # Workflow 2: Stream Pipeline + source_id_2 = workflow_2(base) + console.print() + pause(1) + + # Workflow 3: Curation (uses submission from workflow 1) + workflow_3(base, source_id_1) + console.print() + pause(0.5) + + # Workflow 4: Backward Compatibility (v1 format) + workflow_4(base) + console.print() + pause(0.5) + + # Summary + summary(base) + + +if __name__ == "__main__": + main() diff --git a/aws/v2/doi_utils.py b/aws/v2/doi_utils.py new file mode 100644 index 0000000..9a54c39 --- /dev/null +++ b/aws/v2/doi_utils.py @@ -0,0 +1,38 @@ +import json +from datetime import datetime, timezone +from typing import Dict + + +def mint_doi_for_stream(stream: Dict, overrides: Dict) -> Dict: + from v2.datacite import get_datacite_client + + try: + client = get_datacite_client() + + metadata_field = stream.get("metadata") or {} + if isinstance(metadata_field, str): + try: + metadata_field = json.loads(metadata_field) + except Exception: + metadata_field = {} + + metadata = { + "title": overrides.get("title") or stream.get("title") or "Untitled Dataset", + "description": overrides.get("description") or metadata_field.get("description", ""), + "authors": overrides.get("authors") or metadata_field.get("authors", []), + "keywords": overrides.get("keywords") or metadata_field.get("keywords", []), + "publisher": "Materials Data Facility", + "publication_year": datetime.now(timezone.utc).year, + "version": "1.0", + } + + if overrides.get("license"): + metadata["license"] = overrides["license"] + + source_id = stream["stream_id"].replace("stream-", "") + + result = client.mint_doi(source_id=source_id, metadata=metadata, publish=True) + client.close() + return result + except Exception as e: + return {"success": False, "error": str(e)} diff --git a/aws/v2/email_utils.py b/aws/v2/email_utils.py new file mode 100644 index 0000000..d372cf5 --- /dev/null +++ b/aws/v2/email_utils.py @@ -0,0 +1,359 @@ +"""Email notifications for MDF v2 via AWS SES. + +Sends transactional emails for key curation lifecycle events: + - New submission awaiting curation → curators + - Submission approved / published → submitter + - Submission rejected → submitter + +Configuration (environment variables): + SES_FROM_EMAIL Sender address (must be SES-verified) + CURATOR_EMAILS Comma-separated curator addresses for new-submission alerts + PORTAL_URL Public dataset portal base URL (e.g. https://www.materialsdatafacility.org) + CURATION_PORTAL_URL Curation review page URL (e.g. https://www.materialsdatafacility.org/curation) +""" + +import logging +import os +from typing import Any, Dict, List + +from v2.metadata import parse_metadata + +logger = logging.getLogger(__name__) + +# --------------------------------------------------------------------------- +# Config helpers +# --------------------------------------------------------------------------- + +def _from_address() -> str: + return os.environ.get("SES_FROM_EMAIL", "noreply@materialsdatafacility.org") + + +def _curator_emails() -> List[str]: + raw = os.environ.get("CURATOR_EMAILS", "") + return [e.strip() for e in raw.split(",") if e.strip()] + + +def _portal_url() -> str: + return os.environ.get("PORTAL_URL", "https://www.materialsdatafacility.org").rstrip("/") + + +def _curation_url() -> str: + return os.environ.get("CURATION_PORTAL_URL", _portal_url() + "/curation").rstrip("/") + + +def _emails_enabled() -> bool: + return bool(os.environ.get("SES_FROM_EMAIL")) + + +# --------------------------------------------------------------------------- +# SES send +# --------------------------------------------------------------------------- + +def _send(to: List[str], subject: str, html: str, text: str) -> bool: + """Send via SES. Silently logs and returns False on any failure.""" + if not to: + return True + if not _emails_enabled(): + logger.debug("Email skipped (SES_FROM_EMAIL not set): %s → %s", subject, to) + return True + try: + import boto3 + region = os.environ.get("SES_REGION", "us-east-1") + client = boto3.client("ses", region_name=region) + client.send_email( + Source=_from_address(), + Destination={"ToAddresses": to}, + Message={ + "Subject": {"Data": subject, "Charset": "UTF-8"}, + "Body": { + "Html": {"Data": html, "Charset": "UTF-8"}, + "Text": {"Data": text, "Charset": "UTF-8"}, + }, + }, + ) + logger.info("Email sent subject=%r to=%s", subject, to) + return True + except Exception: + logger.warning("Failed to send email to %s", to, exc_info=True) + return False + + +# --------------------------------------------------------------------------- +# Public notification functions +# --------------------------------------------------------------------------- + +def notify_curators_new_submission(record: Dict[str, Any]) -> bool: + """Email curators when a new submission enters pending_curation.""" + recipients = _curator_emails() + if not recipients: + return True + + meta = parse_metadata(record) + source_id = record.get("source_id", "") + version = record.get("version", "") + review_url = _curation_url() + + subject = f"New Dataset Pending Review: {meta.title}" + html = _build_email( + header_color="#1e3a5f", + header_label="Curation Request", + header_title="New Dataset Awaiting Review", + body_html=_dataset_card(meta, record) + _cta_button(review_url, "Review Submission", "#2563eb"), + footer_note=f"Submission ID: {source_id} · v{version}", + ) + text = _plain_text( + f"New Dataset Awaiting Review\n\n" + f"Title: {meta.title}\n" + f"Authors: {', '.join(a.name for a in meta.authors)}\n" + f"Org: {record.get('organization', '')}\n" + f"ID: {source_id} v{version}\n\n" + f"Review: {review_url}" + ) + return _send(recipients, subject, html, text) + + +def notify_submitter_approved(record: Dict[str, Any]) -> bool: + """Email the submitter when their dataset is approved and published.""" + submitter_email = record.get("user_email") + if not submitter_email: + return True + + meta = parse_metadata(record) + source_id = record.get("source_id", "") + version = record.get("version", "") + doi = record.get("dataset_doi") or record.get("doi") + dataset_url = f"{_portal_url()}/detail/{source_id}" + + doi_line = f"\nDOI: https://doi.org/{doi}" if doi else "" + subject = f"Your MDF Dataset is Now Published: {meta.title}" + html = _build_email( + header_color="#15803d", + header_label="Publication Confirmed", + header_title="Your Dataset is Now Live!", + body_html=( + _message_block( + "Congratulations — your submission has been reviewed and is now publicly available " + "in the Materials Data Facility." + ) + + _dataset_card(meta, record, show_doi=True) + + _cta_button(dataset_url, "View Your Dataset →", "#15803d") + ), + footer_note=f"Dataset ID: {source_id} · v{version}{doi_line}", + ) + text = _plain_text( + f"Your Dataset is Now Live!\n\n" + f"Title: {meta.title}\n" + f"Authors: {', '.join(a.name for a in meta.authors)}\n" + f"Version: {version}\n" + + (f"DOI: https://doi.org/{doi}\n" if doi else "") + + f"\nView: {dataset_url}" + ) + return _send([submitter_email], subject, html, text) + + +def notify_submitter_rejected(record: Dict[str, Any], reason: str, suggestions: str = "") -> bool: + """Email the submitter when their dataset is rejected.""" + submitter_email = record.get("user_email") + if not submitter_email: + return True + + meta = parse_metadata(record) + source_id = record.get("source_id", "") + version = record.get("version", "") + # Link to the submitter's dashboard, not /detail/ which requires published status + dashboard_url = f"{_portal_url()}/submissions" + + reason_block = ( + f'
' + f'

Curator Feedback

' + f'

{_escape(reason)}

' + + ( + f'

' + f'{_escape(suggestions)}

' + if suggestions else "" + ) + + "
" + ) + + subject = f"MDF Submission Needs Attention: {meta.title}" + html = _build_email( + header_color="#b45309", + header_label="Submission Update", + header_title="Your Submission Needs Revision", + body_html=( + _message_block( + "Thank you for your submission. After review, our curators have requested " + "changes before this dataset can be published." + ) + + _dataset_card(meta, record) + + reason_block + + _cta_button(dashboard_url, "View Feedback & Resubmit →", "#b45309") + ), + footer_note=f"Dataset ID: {source_id} · v{version}", + ) + text = _plain_text( + f"Your Submission Needs Revision\n\n" + f"Title: {meta.title}\n" + f"Version: {version}\n\n" + f"Curator feedback:\n{reason}\n" + + (f"\nSuggestions:\n{suggestions}\n" if suggestions else "") + + f"\nView & resubmit: {dashboard_url}" + ) + return _send([submitter_email], subject, html, text) + + +# --------------------------------------------------------------------------- +# HTML building blocks +# --------------------------------------------------------------------------- + +def _escape(s: str) -> str: + """Escape for safe use in both HTML content and attribute values.""" + return (s or "").replace("&", "&").replace("<", "<").replace(">", ">").replace('"', """) + + +def _build_email( + header_color: str, + header_label: str, + header_title: str, + body_html: str, + footer_note: str = "", +) -> str: + return f""" + + + + + + + + + +
+ + + + + + + + + + + +
+

Materials Data Facility

+

{_escape(header_label)}

+

{_escape(header_title)}

+
+ {body_html} +
+

+ Materials Data Facility +  ·  University of Chicago  ·  Argonne National Laboratory +

+ {f'

{_escape(footer_note)}

' if footer_note else ''} +
+
+ +""" + + +def _message_block(text: str) -> str: + return ( + f'

' + f"{_escape(text)}

" + ) + + +def _dataset_card(meta: Any, record: Dict[str, Any], show_doi: bool = False) -> str: + authors_str = ", ".join(a.name for a in meta.authors[:5]) + if len(meta.authors) > 5: + authors_str += f" +{len(meta.authors) - 5} more" + + description = (meta.description or "").strip() + description_html = "" + if description: + snippet = description[:400] + ("…" if len(description) > 400 else "") + description_html = ( + f'

' + f"{_escape(snippet)}

" + ) + + # Stats row + stats: list[str] = [] + org = record.get("organization") or "" + if org: + stats.append(f"Org: {_escape(org)}") + file_count = record.get("file_count") + if file_count: + stats.append(f"Files: {file_count:,}") + total_bytes = record.get("total_bytes") + if total_bytes: + stats.append(f"Size: {_human_bytes(total_bytes)}") + doi = record.get("dataset_doi") or record.get("doi") + if doi and show_doi: + doi_link = f'{_escape(doi)}' + stats.append(f"DOI: {doi_link}") + stats_html = "" + if stats: + stats_html = ( + '

' + + "  ·  ".join(stats) + + "

" + ) + + keywords = meta.keywords[:6] + keywords_html = "" + if keywords: + tags = "".join( + f'{_escape(k)}' + for k in keywords + ) + keywords_html = f'
{tags}
' + + return ( + f'
' + f'

' + f"{_escape(meta.title)}

" + f'

{_escape(authors_str)}

' + f"{description_html}" + f"{stats_html}" + f"{keywords_html}" + f"
" + ) + + +def _cta_button(url: str, label: str, color: str) -> str: + safe_url = _escape(url) + return ( + f'
' + f'{_escape(label)}' + f'
' + f'

' + f'Or copy this link: {safe_url}

' + ) + + +def _human_bytes(n: int) -> str: + for unit in ("B", "KB", "MB", "GB", "TB"): + if n < 1024: + return f"{n:.1f} {unit}" if unit != "B" else f"{n} B" + n /= 1024 + return f"{n:.1f} PB" + + +def _plain_text(body: str) -> str: + return ( + "Materials Data Facility\n" + "─────────────────────────────────────\n" + + body + + "\n\n─────────────────────────────────────\n" + "materialsdatafacility.org\n" + ) diff --git a/aws/v2/local_demo.sh b/aws/v2/local_demo.sh new file mode 100755 index 0000000..d81192d --- /dev/null +++ b/aws/v2/local_demo.sh @@ -0,0 +1,91 @@ +#!/usr/bin/env bash +set -euo pipefail + +# Combined local demo: dataset + stream flow using mdf backend commands + +API_URL=${MDF_API_URL:-http://127.0.0.1:8080} + +SUBMIT_TMP=$(mktemp) +STREAM_TMP=$(mktemp) +trap 'rm -f "${SUBMIT_TMP}" "${STREAM_TMP}"' EXIT + +TMP_PAYLOAD="/tmp/mdf_payload.json" +TMP_FILES="/tmp/mdf_stream_files.json" + +cat <<'JSON' > "${TMP_PAYLOAD}" +{ + "dc": { + "titles": [{"title": "Local Demo Dataset"}], + "creators": [{"creatorName": "Doe, Jane"}] + }, + "data_sources": ["globus://example/collection"], + "test": true +} +JSON + +cat <<'JSON' > "${TMP_FILES}" +{ + "files": [ + {"path": "file1.csv", "size": 1234}, + {"path": "file2.csv", "size": 5678} + ] +} +JSON + +echo "Submitting dataset via mdf backend..." +mdf backend submit --payload "${TMP_PAYLOAD}" --api-url "${API_URL}" | tee "${SUBMIT_TMP}" + +SOURCE_ID=$(python - "$SUBMIT_TMP" <<'PY' +import json,sys +path=sys.argv[1] +with open(path, "r", encoding="utf-8") as handle: + data = json.load(handle) +print(data.get("source_id", "")) +PY +) + +if [[ -z "${SOURCE_ID}" ]]; then + echo "Failed to parse source_id" >&2 + exit 1 +fi + +echo "Checking status..." +mdf backend status --source-id "${SOURCE_ID}" --api-url "${API_URL}" + +echo "Updating status to processing..." +mdf backend update-status --source-id "${SOURCE_ID}" --version "1.0" --status "processing" --api-url "${API_URL}" + +echo "Checking status again..." +mdf backend status --source-id "${SOURCE_ID}" --api-url "${API_URL}" + +echo "\nCreating stream..." +mdf backend stream-create --title "Demo Stream" --lab-id "lab-1" --api-url "${API_URL}" | tee "${STREAM_TMP}" + +STREAM_ID=$(python - "$STREAM_TMP" <<'PY' +import json,sys +path=sys.argv[1] +with open(path, "r", encoding="utf-8") as handle: + data = json.load(handle) +stream = data.get("stream", {}) or {} +print(stream.get("stream_id", "")) +PY +) + +if [[ -z "${STREAM_ID}" ]]; then + echo "Failed to parse stream_id" >&2 + exit 1 +fi + +echo "Appending files to stream..." +mdf backend stream-append --stream-id "${STREAM_ID}" --files "${TMP_FILES}" --api-url "${API_URL}" + +echo "Stream status..." +mdf backend stream-status --stream-id "${STREAM_ID}" --api-url "${API_URL}" + +echo "Closing stream..." +mdf backend stream-close --stream-id "${STREAM_ID}" --api-url "${API_URL}" + +echo "Snapshotting stream into dataset..." +mdf backend stream-snapshot --stream-id "${STREAM_ID}" --api-url "${API_URL}" + +echo "Done." diff --git a/aws/v2/local_lab_stream_demo.py b/aws/v2/local_lab_stream_demo.py new file mode 100755 index 0000000..de63f0a --- /dev/null +++ b/aws/v2/local_lab_stream_demo.py @@ -0,0 +1,289 @@ +#!/usr/bin/env python3 +import json +import os +from datetime import datetime, timedelta +from typing import Any, Dict, List, Tuple + +API_URL = os.environ.get("MDF_API_URL", "http://127.0.0.1:8080").rstrip("/") + +LAB_NAME = "Argonne National Laboratory" +LAB_ID = "anl-xrd-tga" +SAMPLE_ID = "ANL-SAMPLE-042" +RUN_ID = "RUN-2026-01-31-01" +OPERATOR = "A. Researcher" + +try: + from rich.console import Console + from rich.panel import Panel + from rich.table import Table + from rich import box +except Exception: # pragma: no cover + Console = None + Panel = None + Table = None + box = None + Panel = None + Table = None + box = None + + +class SimpleConsole: + def print(self, message=""): + print(message) + + +console = Console() if Console else SimpleConsole() + + +def request(method: str, path: str, payload: Dict[str, Any] | None = None) -> Dict[str, Any]: + url = f"{API_URL}{path}" + body = json.dumps(payload).encode("utf-8") if payload is not None else None + headers = {"Content-Type": "application/json"} + try: + import httpx + + with httpx.Client(timeout=30.0) as client: + resp = client.request(method, url, json=payload, headers=headers) + return resp.json() + except Exception: + from urllib import request as urlrequest + + req = urlrequest.Request(url, data=body, headers=headers, method=method) + with urlrequest.urlopen(req) as resp: + return json.loads(resp.read().decode("utf-8")) + + +def format_bytes(num_bytes: int) -> str: + if num_bytes < 1024: + return f"{num_bytes} B" + if num_bytes < 1024 * 1024: + return f"{num_bytes / 1024:.1f} KB" + return f"{num_bytes / (1024 * 1024):.1f} MB" + + +def _metadata_field(stream: Dict[str, Any], key: str, default: str = "") -> str: + metadata = stream.get("metadata") or {} + value = metadata.get(key, default) + if isinstance(value, list): + return ", ".join(str(item) for item in value) + return str(value) if value is not None else default + + +def make_stream_table(stream: Dict[str, Any]): + if Table is None: + lines = [ + "Stream Summary:", + f" Stream ID: {stream.get('stream_id', '')}", + f" Status: {stream.get('status', '')}", + f" Title: {stream.get('title', '')}", + f" Lab ID: {stream.get('lab_id', '')}", + f" Organization: {stream.get('organization', '')}", + f" Sample ID: {_metadata_field(stream, 'sample_id')}", + f" Run ID: {_metadata_field(stream, 'run_id')}", + f" Instruments: {_metadata_field(stream, 'instruments')}", + f" File Count: {stream.get('file_count', 0)}", + f" Total Bytes: {format_bytes(int(stream.get('total_bytes', 0) or 0))}", + f" Last Append: {stream.get('last_append_at') or '-'}", + f" Updated: {stream.get('updated_at') or '-'}", + ] + return "\n".join(lines) + + table = Table(title="Stream Summary", box=box.SIMPLE, show_header=True) + table.add_column("Field", style="cyan", no_wrap=True) + table.add_column("Value", style="white") + + table.add_row("Stream ID", stream.get("stream_id", "")) + table.add_row("Status", stream.get("status", "")) + table.add_row("Title", stream.get("title", "")) + table.add_row("Lab ID", stream.get("lab_id", "")) + table.add_row("Organization", stream.get("organization", "")) + table.add_row("Sample ID", _metadata_field(stream, "sample_id")) + table.add_row("Run ID", _metadata_field(stream, "run_id")) + table.add_row("Instruments", _metadata_field(stream, "instruments")) + table.add_row("File Count", str(stream.get("file_count", 0))) + table.add_row("Total Bytes", format_bytes(int(stream.get("total_bytes", 0) or 0))) + table.add_row("Last Append", stream.get("last_append_at") or "-") + table.add_row("Updated", stream.get("updated_at") or "-") + + return table + + +def make_dataset_table(snapshot: Dict[str, Any]): + if Table is None: + lines = [ + "Snapshot Dataset:", + f" Source ID: {snapshot.get('source_id', '')}", + f" Version: {snapshot.get('version', '')}", + f" Versioned Source ID: {snapshot.get('versioned_source_id', '')}", + f" Stream ID: {snapshot.get('stream_id', '')}", + ] + return "\n".join(lines) + + table = Table(title="Snapshot Dataset", box=box.SIMPLE, show_header=True) + table.add_column("Field", style="cyan", no_wrap=True) + table.add_column("Value", style="white") + + table.add_row("Source ID", snapshot.get("source_id", "")) + table.add_row("Version", snapshot.get("version", "")) + table.add_row("Versioned Source ID", snapshot.get("versioned_source_id", "")) + table.add_row("Stream ID", snapshot.get("stream_id", "")) + return table + + +def make_files_table(title: str, files: List[Dict[str, Any]]): + if Table is None: + lines = [f"{title}:"] + for entry in files: + lines.append( + " - {path} ({size} B, {instrument}, {timestamp})".format( + path=entry.get("path", ""), + size=entry.get("size", 0), + instrument=entry.get("instrument", ""), + timestamp=entry.get("timestamp", ""), + ) + ) + return "\n".join(lines) + + table = Table(title=title, box=box.SIMPLE, show_header=True) + table.add_column("Path", style="cyan") + table.add_column("Size", justify="right") + table.add_column("Instrument") + table.add_column("Timestamp") + for entry in files: + table.add_row( + entry.get("path", ""), + format_bytes(int(entry.get("size", 0))), + entry.get("instrument", ""), + entry.get("timestamp", ""), + ) + return table + + +def print_intro(): + text = ( + "This demo simulates a lab instrument workflow at Argonne National Laboratory.\n\n" + "What will happen:\n" + "- Create a streaming session for a sample that includes XRD and TGA data.\n" + "- Append XRD pattern files, then append TGA run files.\n" + "- Fetch stream status, close the stream, and snapshot to a dataset.\n\n" + "What you can do with MDF Streaming:\n" + "- Continuously append new files as instruments generate data.\n" + "- Track stream status, counts, and total bytes in near real time.\n" + "- Close a stream when a run completes and snapshot it into a dataset submission.\n" + ) + if Console and Panel and box: + console.print(Panel(text, title="MDF Streaming Demo", box=box.SIMPLE)) + else: + console.print(text) + + +def build_files(base_time: datetime) -> Tuple[List[Dict[str, Any]], List[Dict[str, Any]]]: + xrd_files = [ + { + "path": f"xrd/{SAMPLE_ID}/pattern_001.xy", + "size": 24576, + "instrument": "XRD", + "sample_id": SAMPLE_ID, + "run_id": RUN_ID, + "timestamp": (base_time + timedelta(seconds=0)).isoformat() + "Z", + }, + { + "path": f"xrd/{SAMPLE_ID}/pattern_002.xy", + "size": 25120, + "instrument": "XRD", + "sample_id": SAMPLE_ID, + "run_id": RUN_ID, + "timestamp": (base_time + timedelta(seconds=30)).isoformat() + "Z", + }, + ] + tga_files = [ + { + "path": f"tga/{SAMPLE_ID}/tga_run_001.csv", + "size": 10240, + "instrument": "TGA", + "sample_id": SAMPLE_ID, + "run_id": RUN_ID, + "timestamp": (base_time + timedelta(minutes=2)).isoformat() + "Z", + }, + { + "path": f"tga/{SAMPLE_ID}/tga_run_002.csv", + "size": 11264, + "instrument": "TGA", + "sample_id": SAMPLE_ID, + "run_id": RUN_ID, + "timestamp": (base_time + timedelta(minutes=3)).isoformat() + "Z", + }, + ] + return xrd_files, tga_files + + +def main(): + print_intro() + + console.print("Step 1/6: Creating stream...") + create_payload = { + "title": f"{LAB_NAME} XRD+TGA Stream - {SAMPLE_ID}", + "lab_id": LAB_ID, + "organization": "ANL", + "metadata": { + "facility": LAB_NAME, + "instruments": ["XRD", "TGA"], + "operator": OPERATOR, + "sample_id": SAMPLE_ID, + "run_id": RUN_ID, + "beamline": "11-ID-B", + "notes": "Local demo stream with XRD + TGA instrumentation", + }, + } + create_res = request("POST", "/stream/create", create_payload) + stream = create_res.get("stream", {}) + stream_id = stream.get("stream_id") + + if not stream_id: + console.print(f"Failed to create stream: {create_res}") + raise SystemExit(1) + + console.print(make_stream_table(stream)) + + base_time = datetime.utcnow() + xrd_files, tga_files = build_files(base_time) + + console.print("\nStep 2/6: Appending XRD patterns...") + console.print(make_files_table("XRD Files", xrd_files)) + xrd_res = request("POST", f"/stream/{stream_id}/append", {"files": xrd_files}) + stream = xrd_res.get("stream", stream) + console.print(make_stream_table(stream)) + + console.print("\nStep 3/6: Appending TGA runs...") + console.print(make_files_table("TGA Files", tga_files)) + tga_res = request("POST", f"/stream/{stream_id}/append", {"files": tga_files}) + stream = tga_res.get("stream", stream) + console.print(make_stream_table(stream)) + + console.print("\nStep 4/6: Fetching stream status...") + status_res = request("GET", f"/stream/{stream_id}") + stream = status_res.get("stream", stream) + console.print(make_stream_table(stream)) + + console.print("\nStep 5/6: Closing stream...") + close_res = request("POST", f"/stream/{stream_id}/close", {"stream_id": stream_id}) + stream = close_res.get("stream", stream) + console.print(make_stream_table(stream)) + + console.print("\nStep 6/6: Snapshotting stream into a dataset...") + snapshot_res = request("POST", f"/stream/{stream_id}/snapshot", {"stream_id": stream_id}) + if snapshot_res.get("success"): + console.print(make_dataset_table(snapshot_res)) + else: + console.print(f"Snapshot failed: {snapshot_res}") + + console.print( + "\nDemo complete. Next steps:\n" + "- Append new files as instruments run.\n" + "- Snapshot streams into datasets for indexing.\n" + "- Start parallel streams for additional samples.\n" + ) + + +if __name__ == "__main__": + main() diff --git a/aws/v2/local_lab_stream_demo.sh b/aws/v2/local_lab_stream_demo.sh new file mode 100755 index 0000000..def0030 --- /dev/null +++ b/aws/v2/local_lab_stream_demo.sh @@ -0,0 +1,5 @@ +#!/usr/bin/env bash +set -euo pipefail + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +python "${SCRIPT_DIR}/local_lab_stream_demo.py" diff --git a/aws/v2/local_runner.py b/aws/v2/local_runner.py new file mode 100644 index 0000000..2e7acb0 --- /dev/null +++ b/aws/v2/local_runner.py @@ -0,0 +1,108 @@ +#!/usr/bin/env python3 +"""Command-line runner for MDF v2 local FastAPI server.""" + +from __future__ import annotations + +import argparse +import json +from pathlib import Path +from typing import Any, Dict, Optional + +import httpx + + +def _request( + method: str, + base_url: str, + path: str, + payload: Optional[Dict[str, Any]] = None, + params: Optional[Dict[str, Any]] = None, +) -> Dict[str, Any]: + url = f"{base_url.rstrip('/')}{path}" + with httpx.Client(timeout=30.0) as client: + response = client.request(method, url, json=payload, params=params) + try: + return response.json() + except Exception: + return { + "success": False, + "status_code": response.status_code, + "error": response.text, + } + + +def _load_json(path: Path) -> Dict[str, Any]: + return json.loads(path.read_text(encoding="utf-8")) + + +def main() -> None: + parser = argparse.ArgumentParser(description="Run MDF v2 local API commands") + parser.add_argument( + "command", + choices=[ + "submit", + "status", + "submissions", + "update-status", + "stream-create", + "stream-append", + "stream-status", + "stream-close", + "stream-snapshot", + ], + ) + parser.add_argument("--api-url", default="http://127.0.0.1:8080") + parser.add_argument("--payload", type=Path, help="Path to JSON payload") + parser.add_argument("--source-id", help="Source ID") + parser.add_argument("--version", help="Dataset version") + parser.add_argument("--status", help="Submission status") + parser.add_argument("--organization", help="Organization filter") + parser.add_argument("--title", help="Stream title") + parser.add_argument("--lab-id", help="Stream lab ID") + parser.add_argument("--stream-id", help="Stream ID") + parser.add_argument("--files", type=Path, help="Path to JSON file list") + parser.add_argument("--file-count", type=int, help="Number of files appended") + parser.add_argument("--total-bytes", type=int, help="Bytes appended") + parser.add_argument("--update", action="store_true", help="Snapshot as update") + + args = parser.parse_args() + + payload: Dict[str, Any] + if args.command == "submit": + payload = _load_json(args.payload) if args.payload else {} + result = _request("POST", args.api_url, "/submit", payload=payload) + elif args.command == "status": + params = {"version": args.version} if args.version else None + result = _request("GET", args.api_url, f"/status/{args.source_id}", params=params) + elif args.command == "submissions": + params = {"organization": args.organization} if args.organization else None + result = _request("GET", args.api_url, "/submissions", params=params) + elif args.command == "update-status": + payload = {"source_id": args.source_id, "version": args.version, "status": args.status} + result = _request("POST", args.api_url, "/status/update", payload=payload) + elif args.command == "stream-create": + payload = {"title": args.title, "lab_id": args.lab_id, "organization": args.organization} + result = _request("POST", args.api_url, "/stream/create", payload=payload) + elif args.command == "stream-append": + payload = {} + if args.files: + files_data = _load_json(args.files) + payload["files"] = files_data.get("files", files_data) + if args.file_count is not None: + payload["file_count"] = args.file_count + if args.total_bytes is not None: + payload["total_bytes"] = args.total_bytes + result = _request("POST", args.api_url, f"/stream/{args.stream_id}/append", payload=payload) + elif args.command == "stream-status": + result = _request("GET", args.api_url, f"/stream/{args.stream_id}") + elif args.command == "stream-close": + result = _request("POST", args.api_url, f"/stream/{args.stream_id}/close", payload={}) + else: + payload = {"title": args.title, "update": args.update} + result = _request("POST", args.api_url, f"/stream/{args.stream_id}/snapshot", payload=payload) + + print(json.dumps(result, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/aws/v2/local_start.sh b/aws/v2/local_start.sh new file mode 100755 index 0000000..31c404d --- /dev/null +++ b/aws/v2/local_start.sh @@ -0,0 +1,72 @@ +#!/usr/bin/env bash +set -euo pipefail + +# Local MDF v2 FastAPI server startup. +# Usage: STORE_BACKEND=sqlite SQLITE_PATH=/tmp/mdf_connect_v2.db ./local_start.sh + +export STORE_BACKEND=${STORE_BACKEND:-sqlite} +export SQLITE_PATH=${SQLITE_PATH:-/tmp/mdf_connect_v2.db} +export STORAGE_BACKEND=${STORAGE_BACKEND:-local} +export ASYNC_DISPATCH_MODE=${ASYNC_DISPATCH_MODE:-inline} +export AUTH_MODE=${AUTH_MODE:-dev} +export ALLOW_ALL_CURATORS=${ALLOW_ALL_CURATORS:-true} +export CURATOR_GROUP_IDS=${CURATOR_GROUP_IDS:-} +export REQUIRED_GROUP_MEMBERSHIP=${REQUIRED_GROUP_MEMBERSHIP:-} +export USE_MOCK_DATACITE=${USE_MOCK_DATACITE:-true} +export LOCAL_HOST=${LOCAL_HOST:-127.0.0.1} +export LOCAL_PORT=${LOCAL_PORT:-8080} +export FORCE_RESTART=${FORCE_RESTART:-false} + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +ROOT_DIR="$(cd "${SCRIPT_DIR}/.." && pwd)" +PID_FILE="${SCRIPT_DIR}/.local_server.pid" +LOG_FILE="${SCRIPT_DIR}/.local_server.log" + +if [[ -f "${PID_FILE}" ]]; then + PID=$(cat "${PID_FILE}") + if ps -p "${PID}" > /dev/null 2>&1; then + if [[ "${FORCE_RESTART}" == "true" || "${FORCE_RESTART}" == "1" ]]; then + echo "Restarting local server (pid ${PID})..." + kill "${PID}" || true + rm -f "${PID_FILE}" + else + echo "Local server already running (pid ${PID})." + echo "Stop it with: kill ${PID} && rm -f ${PID_FILE}" + echo "Or restart with: FORCE_RESTART=true ./local_start.sh" + exit 0 + fi + fi +fi + +( + cd "${ROOT_DIR}" + python3 -m v2.app.main +) >"${LOG_FILE}" 2>&1 & +PID=$! +sleep 1 +if ! kill -0 "${PID}" >/dev/null 2>&1; then + echo "Failed to start local server. Last log lines:" + tail -n 40 "${LOG_FILE}" || true + exit 1 +fi + +printf "%s" "${PID}" > "${PID_FILE}" + +cat < "${TMP_PAYLOAD}" +{ + "dc": { + "titles": [{"title": "Local Test Dataset"}], + "creators": [{"creatorName": "Doe, Jane"}] + }, + "data_sources": ["globus://example/collection"], + "test": true +} +JSON + +echo "Submitting test dataset..." +SUBMIT_RES=$(curl -s -X POST "${BASE_URL}/submit" \ + -H "Content-Type: application/json" \ + -d @"${TMP_PAYLOAD}") + +echo "Response:" +echo "${SUBMIT_RES}" + +SOURCE_ID=$(python - <<'PY' +import json,sys +res=json.loads(sys.stdin.read()) +print(res["source_id"]) +PY +<<< "${SUBMIT_RES}") + +if [[ -z "${SOURCE_ID}" ]]; then + echo "Failed to parse source_id from submit response" >&2 + exit 1 +fi + +echo "Source ID: ${SOURCE_ID}" + +echo "\nFetching status..." +curl -s "${BASE_URL}/status/${SOURCE_ID}" + +echo "\nUpdating status to processing..." +curl -s -X POST "${BASE_URL}/status/update" \ + -H "Content-Type: application/json" \ + -d "{\"source_id\":\"${SOURCE_ID}\",\"version\":\"1.0\",\"status\":\"processing\"}" + +echo "\nFetching status again..." +curl -s "${BASE_URL}/status/${SOURCE_ID}" + +echo "\nListing submissions..." +curl -s "${BASE_URL}/submissions" + +echo "\nDone." diff --git a/aws/v2/local_test_stream.sh b/aws/v2/local_test_stream.sh new file mode 100755 index 0000000..c145c5d --- /dev/null +++ b/aws/v2/local_test_stream.sh @@ -0,0 +1,58 @@ +#!/usr/bin/env bash +set -euo pipefail + +HOST=${LOCAL_HOST:-127.0.0.1} +PORT=${LOCAL_PORT:-8080} +BASE_URL="http://${HOST}:${PORT}" + +TMP_FILES="/tmp/mdf_stream_files.json" +cat <<'JSON' > "${TMP_FILES}" +{ + "files": [ + {"path": "file1.csv", "size": 1234}, + {"path": "file2.csv", "size": 5678} + ] +} +JSON + +echo "Creating stream..." +CREATE_RES=$(curl -s -X POST "${BASE_URL}/stream/create" \ + -H "Content-Type: application/json" \ + -d '{"title":"Test Stream","lab_id":"lab-1"}') + +echo "Response:" +echo "${CREATE_RES}" + +STREAM_ID=$(python - <<'PY' +import json,sys +res=json.loads(sys.stdin.read()) +print(res["stream_id"]) +PY +<<< "${CREATE_RES}") + +if [[ -z "${STREAM_ID}" ]]; then + echo "Failed to parse stream_id" >&2 + exit 1 +fi + +echo "Stream ID: ${STREAM_ID}" + +echo "Appending files..." +curl -s -X POST "${BASE_URL}/stream/${STREAM_ID}/append" \ + -H "Content-Type: application/json" \ + -d @"${TMP_FILES}" + +echo "\nFetching stream status..." +curl -s "${BASE_URL}/stream/${STREAM_ID}" + +echo "\nClosing stream..." +curl -s -X POST "${BASE_URL}/stream/${STREAM_ID}/close" \ + -H "Content-Type: application/json" \ + -d '{}' + +echo "\nSnapshotting stream..." +curl -s -X POST "${BASE_URL}/stream/${STREAM_ID}/snapshot" \ + -H "Content-Type: application/json" \ + -d "{\"stream_id\":\"${STREAM_ID}\"}" + +echo "\nDone." diff --git a/aws/v2/metadata.py b/aws/v2/metadata.py new file mode 100644 index 0000000..a727bc6 --- /dev/null +++ b/aws/v2/metadata.py @@ -0,0 +1,661 @@ +"""Canonical metadata schema for MDF v2. + +Defines a flat, validated Pydantic model with researcher-friendly field names. +Translates to DataCite format only when minting DOIs. + +Key components: +- DatasetMetadata: the top-level schema (replaces dc/mdf/custom triple nesting) +- MLMetadata: first-class ML-readiness metadata (replaces projects.foundry) +- to_datacite(): DataCite kernel-4 payload builder +- migrate_v1_payload(): convert old dc/mdf/custom format to flat format +- parse_metadata(): parse a DB record into DatasetMetadata (handles both schemas) +""" + +from __future__ import annotations + +from datetime import datetime +from typing import Any, Dict, List, Optional + +from pydantic import BaseModel, Field, model_validator + + +# --------------------------------------------------------------------------- +# Sub-models +# --------------------------------------------------------------------------- + +class Author(BaseModel): + name: str + given_name: Optional[str] = None + family_name: Optional[str] = None + orcid: Optional[str] = None + affiliations: List[str] = Field(default_factory=list) + + +class FundingReference(BaseModel): + funder_name: str + award_number: Optional[str] = None + award_title: Optional[str] = None + funder_id: Optional[str] = None + funder_id_type: Optional[str] = None + + +class RelatedWork(BaseModel): + identifier: str + identifier_type: str = "DOI" + relation_type: str = "References" + description: Optional[str] = None + + +class GeoLocation(BaseModel): + place: Optional[str] = None + point: Optional[Dict[str, Any]] = None + box: Optional[Dict[str, Any]] = None + + +class License(BaseModel): + name: str + url: Optional[str] = None + identifier: Optional[str] = None + + +class ExternalSource(BaseModel): + """Provenance for datasets cross-published from another repository.""" + source: str # e.g. "Zenodo", "Figshare", "NOMAD" + doi: Optional[str] = None # original DOI + url: Optional[str] = None # landing page URL + identifier: Optional[str] = None # repo-specific ID (e.g. zenodo record id) + doi_relation: str = "IsVariantFormOf" # DataCite relation type for the link + + +# --------------------------------------------------------------------------- +# ML-ready metadata (replaces projects.foundry) +# --------------------------------------------------------------------------- + +class DataKey(BaseModel): + name: str + role: str = "input" + description: Optional[str] = None + units: Optional[str] = None + dtype: Optional[str] = None + shape: Optional[List[int]] = None + classes: Optional[List[str]] = None + + +class DataSplit(BaseModel): + type: str + path: str + label: Optional[str] = None + n_items: Optional[int] = None + + +class MLMetadata(BaseModel): + data_format: str + task_type: List[str] = Field(default_factory=list) + domain: List[str] = Field(default_factory=list) + n_items: Optional[int] = None + splits: List[DataSplit] = Field(default_factory=list) + keys: List[DataKey] = Field(default_factory=list) + short_name: Optional[str] = None + + +# --------------------------------------------------------------------------- +# Top-level dataset metadata +# --------------------------------------------------------------------------- + +class DatasetMetadata(BaseModel): + # Required (DataCite mandatory) + title: str + authors: List[Author] + + # Recommended + description: Optional[str] = None + keywords: List[str] = Field(default_factory=list) + publisher: str = "Materials Data Facility" + publication_year: Optional[int] = None + resource_type: str = "Dataset" + + # Attribution & Provenance + license: Optional[License] = None + funding: List[FundingReference] = Field(default_factory=list) + related_works: List[RelatedWork] = Field(default_factory=list) + + # Scientific Context + methods: List[str] = Field(default_factory=list) + facility: Optional[str] = None + fields_of_science: List[str] = Field(default_factory=list) + domains: List[str] = Field(default_factory=list) + + # ML-Ready Data Structure + ml: Optional[MLMetadata] = None + + # Geospatial + geo_locations: List[GeoLocation] = Field(default_factory=list) + + # Data Description + data_sources: List[str] = Field(default_factory=list) + data_type: Optional[str] = None + formats: List[str] = Field(default_factory=list) + language: str = "en" + download_url: Optional[str] = None + archive_size: Optional[int] = None + + # External Import Provenance + external: Optional[ExternalSource] = None + + # Versioning + version: Optional[str] = None + previous_version: Optional[str] = None + root_version: Optional[str] = None + latest: bool = True + + # MDF Platform + organization: Optional[str] = None + tags: List[str] = Field(default_factory=list) + acl: List[str] = Field(default_factory=list) + extensions: Dict[str, Any] = Field(default_factory=dict) + + # Submission flags (not stored in metadata proper) + test: bool = False + update: bool = False + + @model_validator(mode="before") + @classmethod + def _migrate_external_fields(cls, values): + if isinstance(values, dict) and not values.get("external"): + ext_doi = values.pop("external_doi", None) + ext_url = values.pop("external_url", None) + ext_src = values.pop("external_source", None) + if ext_doi or ext_url or ext_src: + values["external"] = { + "doi": ext_doi, "url": ext_url, + "source": ext_src or "Unknown", + } + return values + + @property + def external_doi(self) -> Optional[str]: + return self.external.doi if self.external else None + + @property + def external_url(self) -> Optional[str]: + return self.external.url if self.external else None + + @property + def external_source(self) -> Optional[str]: + return self.external.source if self.external else None + + +# --------------------------------------------------------------------------- +# DataCite translation +# --------------------------------------------------------------------------- + +def _parse_author_name(author: Author) -> dict: + """Build a DataCite creator dict from an Author.""" + creator: Dict[str, Any] = {} + + given = author.given_name or "" + family = author.family_name or "" + + # Auto-parse if given/family not provided + if not given and not family and author.name: + if "," in author.name: + parts = author.name.split(",", 1) + family = parts[0].strip() + given = parts[1].strip() + else: + parts = author.name.rsplit(" ", 1) + if len(parts) == 2: + given = parts[0].strip() + family = parts[1].strip() + else: + family = author.name + + if family and given: + creator["name"] = f"{family}, {given}" + else: + creator["name"] = author.name + + if given: + creator["givenName"] = given + if family: + creator["familyName"] = family + + if author.affiliations: + creator["affiliation"] = [{"name": a} for a in author.affiliations] + + if author.orcid: + creator["nameIdentifiers"] = [{ + "nameIdentifier": f"https://orcid.org/{author.orcid}", + "nameIdentifierScheme": "ORCID", + "schemeUri": "https://orcid.org", + }] + + return creator + + +def to_datacite( + meta: DatasetMetadata, + doi: str, + url: str, + source_id: Optional[str] = None, + created_at: Optional[str] = None, + published_at: Optional[str] = None, +) -> dict: + """Translate DatasetMetadata to a DataCite API payload (kernel 4.5). + + Returns the full payload ready for POST to DataCite /dois endpoint. + """ + pub_year = meta.publication_year or datetime.now().year + + creators = [_parse_author_name(a) for a in meta.authors] + if not creators: + creators = [{"name": "Materials Data Facility"}] + + attributes: Dict[str, Any] = { + "doi": doi, + "url": url, + "titles": [{"title": meta.title}], + "creators": creators, + "publisher": meta.publisher, + "publicationYear": int(pub_year), + "types": {"resourceTypeGeneral": meta.resource_type or "Dataset"}, + "schemaVersion": "http://datacite.org/schema/kernel-4", + } + + # Description + if meta.description: + attributes["descriptions"] = [{ + "description": meta.description, + "descriptionType": "Abstract", + }] + + # Subjects (keywords + fields_of_science) + subjects = [] + for kw in meta.keywords: + subjects.append({"subject": kw}) + for fos in meta.fields_of_science: + subjects.append({"subject": fos, "subjectScheme": "Fields of Science"}) + if subjects: + attributes["subjects"] = subjects + + # Language + if meta.language: + attributes["language"] = meta.language + + # Formats + if meta.formats: + attributes["formats"] = meta.formats + + # Rights / License + if meta.license: + rights_entry: Dict[str, str] = {"rights": meta.license.name} + if meta.license.url: + rights_entry["rightsUri"] = meta.license.url + if meta.license.identifier: + rights_entry["rightsIdentifier"] = meta.license.identifier + attributes["rightsList"] = [rights_entry] + + # Funding references + if meta.funding: + funding_refs = [] + for f in meta.funding: + ref: Dict[str, Any] = {"funderName": f.funder_name} + if f.award_number: + ref["awardNumber"] = f.award_number + if f.award_title: + ref["awardTitle"] = f.award_title + if f.funder_id: + ref["funderIdentifier"] = f.funder_id + ref["funderIdentifierType"] = f.funder_id_type or "Crossref Funder ID" + funding_refs.append(ref) + attributes["fundingReferences"] = funding_refs + + # Related identifiers + if meta.related_works: + related = [] + for rw in meta.related_works: + related.append({ + "relatedIdentifier": rw.identifier, + "relatedIdentifierType": rw.identifier_type, + "relationType": rw.relation_type, + }) + attributes["relatedIdentifiers"] = related + + # External DOI relation (cross-publish provenance) + if meta.external and meta.external.doi: + has_ext = any( + rw.identifier == meta.external.doi for rw in meta.related_works + ) + if not has_ext: + if "relatedIdentifiers" not in attributes: + attributes["relatedIdentifiers"] = [] + attributes["relatedIdentifiers"].append({ + "relatedIdentifier": meta.external.doi, + "relatedIdentifierType": "DOI", + "relationType": meta.external.doi_relation, + }) + + # GeoLocations + if meta.geo_locations: + geo_locs = [] + for gl in meta.geo_locations: + loc: Dict[str, Any] = {} + if gl.place: + loc["geoLocationPlace"] = gl.place + if gl.point: + loc["geoLocationPoint"] = { + "pointLatitude": gl.point.get("latitude"), + "pointLongitude": gl.point.get("longitude"), + } + if gl.box: + loc["geoLocationBox"] = gl.box + geo_locs.append(loc) + attributes["geoLocations"] = geo_locs + + # Alternate identifiers (source_id) + if source_id: + attributes["alternateIdentifiers"] = [{ + "alternateIdentifier": source_id, + "alternateIdentifierType": "MDF Source ID", + }] + + # Dates + dates = [] + if created_at: + dates.append({"date": created_at[:10], "dateType": "Created"}) + if published_at: + dates.append({"date": published_at[:10], "dateType": "Available"}) + if dates: + attributes["dates"] = dates + + return { + "data": { + "type": "dois", + "attributes": attributes, + } + } + + +# --------------------------------------------------------------------------- +# v1 migration (dc/mdf/custom/projects.foundry -> flat format) +# --------------------------------------------------------------------------- + +def migrate_v1_payload(old: dict) -> dict: + """Convert a v1 dc/mdf/custom payload to the new flat format. + + Handles: + - dc.titles -> title + - dc.creators -> authors (with name parsing) + - dc.descriptions -> description + - dc.subjects -> keywords + - dc.publisher -> publisher + - dc.publicationYear -> publication_year + - dc.resourceType -> resource_type + - dc.relatedIdentifiers -> related_works + - mdf.organization -> organization + - mdf.instruments -> methods + - mdf.facility -> facility + - mdf.acl -> acl + - mdf.doi -> (stored separately on record) + - projects.foundry -> ml + - custom -> extensions + - tags (subjects) -> keywords or tags + """ + dc = old.get("dc") or {} + mdf = old.get("mdf") or {} + custom = old.get("custom") or {} + projects = old.get("projects") or {} + + result: Dict[str, Any] = {} + + # Title + titles = dc.get("titles") or [] + if titles: + first = titles[0] + result["title"] = first.get("title") if isinstance(first, dict) else str(first) + else: + result["title"] = "Untitled" + + # Authors + authors = [] + for c in dc.get("creators") or []: + if isinstance(c, dict): + author: Dict[str, Any] = {} + author["name"] = c.get("creatorName") or f"{c.get('givenName', '')} {c.get('familyName', '')}".strip() + if c.get("givenName"): + author["given_name"] = c["givenName"] + if c.get("familyName"): + author["family_name"] = c["familyName"] + if c.get("affiliation"): + aff = c["affiliation"] + author["affiliations"] = [aff] if isinstance(aff, str) else aff + if c.get("affiliations"): + author["affiliations"] = c["affiliations"] + # ORCID from nameIdentifiers + for ni in c.get("nameIdentifiers") or []: + if isinstance(ni, dict) and ni.get("nameIdentifierScheme") == "ORCID": + orcid = ni.get("nameIdentifier", "") + # Strip URL prefix if present + orcid = orcid.replace("https://orcid.org/", "").replace("http://orcid.org/", "") + author["orcid"] = orcid + authors.append(author) + elif isinstance(c, str): + authors.append({"name": c}) + result["authors"] = authors if authors else [{"name": "Unknown"}] + + # Description + descriptions = dc.get("descriptions") or [] + if descriptions: + first_desc = descriptions[0] + result["description"] = first_desc.get("description") if isinstance(first_desc, dict) else str(first_desc) + + # Keywords (from dc.subjects) + keywords = [] + for subj in dc.get("subjects") or []: + if isinstance(subj, dict): + keywords.append(subj.get("subject", "")) + elif isinstance(subj, str): + keywords.append(subj) + keywords = [k for k in keywords if k] + if keywords: + result["keywords"] = keywords + + # Publisher + result["publisher"] = dc.get("publisher") or "Materials Data Facility" + + # Publication year + pub_year = dc.get("publicationYear") + if pub_year: + try: + result["publication_year"] = int(pub_year) + except (ValueError, TypeError): + pass + + # Resource type + rt = dc.get("resourceType") + if isinstance(rt, dict): + result["resource_type"] = rt.get("resourceType") or rt.get("resourceTypeGeneral") or "Dataset" + elif isinstance(rt, str): + result["resource_type"] = rt + + # Related identifiers -> related_works + related = dc.get("relatedIdentifiers") or [] + if related: + works = [] + for ri in related: + if isinstance(ri, dict): + works.append({ + "identifier": ri.get("relatedIdentifier", ""), + "identifier_type": ri.get("relatedIdentifierType", "DOI"), + "relation_type": ri.get("relationType", "References"), + }) + if works: + result["related_works"] = works + + # Rights -> license + rights = dc.get("rights") or dc.get("rightsList") or [] + if rights and isinstance(rights, list) and rights: + r = rights[0] + if isinstance(r, dict): + result["license"] = { + "name": r.get("rights", ""), + "url": r.get("rightsURI") or r.get("rightsUri"), + } + + # MDF block + if mdf.get("organization"): + result["organization"] = mdf["organization"] + elif mdf.get("organizations"): + orgs = mdf["organizations"] + if isinstance(orgs, list) and orgs: + result["organization"] = orgs[0] + if mdf.get("instruments"): + instr = mdf["instruments"] + result["methods"] = instr if isinstance(instr, list) else [str(instr)] + if mdf.get("facility"): + result["facility"] = mdf["facility"] + if mdf.get("acl"): + result["acl"] = mdf["acl"] + if mdf.get("source_id"): + result.setdefault("extensions", {})["mdf_source_id"] = mdf["source_id"] + if mdf.get("source_name"): + result.setdefault("extensions", {})["mdf_source_name"] = mdf["source_name"] + + # Data sources + if old.get("data_sources"): + result["data_sources"] = old["data_sources"] + + # Tags + if old.get("tags"): + result["tags"] = old["tags"] + + # Test / Update flags + if old.get("test"): + result["test"] = old["test"] + if old.get("update"): + result["update"] = old["update"] + + # Custom -> extensions + if custom: + result.setdefault("extensions", {}).update(custom) + + # Projects -> extensions (except foundry which becomes ml) + foundry = projects.get("foundry") + if foundry: + result["ml"] = _migrate_foundry(foundry) + + other_projects = {k: v for k, v in projects.items() if k != "foundry"} + if other_projects: + result.setdefault("extensions", {}).update(other_projects) + + return result + + +def _migrate_foundry(foundry: dict) -> dict: + """Convert projects.foundry schema to MLMetadata dict.""" + ml: Dict[str, Any] = {} + + ml["data_format"] = foundry.get("data_type") or foundry.get("data_format") or "unknown" + + if foundry.get("short_name"): + ml["short_name"] = foundry["short_name"] + + if foundry.get("n_items"): + ml["n_items"] = foundry["n_items"] + + # Splits + splits = [] + for s in foundry.get("splits") or []: + if isinstance(s, dict): + splits.append({ + "type": s.get("type", ""), + "path": s.get("path", ""), + "label": s.get("label"), + "n_items": s.get("n_items"), + }) + if splits: + ml["splits"] = splits + + # Keys + keys = [] + for k in foundry.get("keys") or foundry.get("key") or []: + if isinstance(k, dict): + key_entry: Dict[str, Any] = {} + # Foundry uses key[].key (list of strings) or just a string + key_names = k.get("key") or k.get("name") + if isinstance(key_names, list): + # Foundry style: key: ["col1", "col2"] with a shared type + for kn in key_names: + keys.append({ + "name": kn, + "role": k.get("type", "input"), + "units": k.get("units"), + "description": k.get("description"), + }) + continue + else: + key_entry["name"] = str(key_names) if key_names else "" + + key_entry["role"] = k.get("type") or k.get("role") or "input" + if k.get("units"): + key_entry["units"] = k["units"] + if k.get("description"): + key_entry["description"] = k["description"] + if k.get("classes"): + key_entry["classes"] = k["classes"] + keys.append(key_entry) + + if keys: + ml["keys"] = keys + + return ml + + +# --------------------------------------------------------------------------- +# Metadata parsing from DB records +# --------------------------------------------------------------------------- + +def _is_v1_format(mdata: dict) -> bool: + """Check if metadata is in the old dc/mdf/custom v1 format.""" + dc = mdata.get("dc") + if isinstance(dc, dict) and ("titles" in dc or "creators" in dc): + return True + return False + + +def parse_metadata(record: dict) -> DatasetMetadata: + """Parse a submission record into DatasetMetadata. + + Handles both v1 (dc/mdf/custom stored in dataset_mdata) and v2 (flat) + stored formats. Accepts either a full DB record (with dataset_mdata key) + or a raw metadata dict. + """ + import json as _json + + # Extract the metadata dict from the record + mdata = record.get("dataset_mdata") or record + if isinstance(mdata, str): + try: + mdata = _json.loads(mdata) + except Exception: + mdata = {} + if not isinstance(mdata, dict): + mdata = {} + + # If it's v1 format, migrate first + if _is_v1_format(mdata): + mdata = migrate_v1_payload(mdata) + + # If it's already a flat format (has "title" at top level), use directly + if "title" not in mdata: + # Fallback: might be a record without dataset_mdata + mdata.setdefault("title", record.get("title", "Untitled")) + if "authors" not in mdata: + mdata["authors"] = [{"name": "Unknown"}] + + # Ensure authors is well-formed + authors = mdata.get("authors", []) + if not authors: + mdata["authors"] = [{"name": "Unknown"}] + + return DatasetMetadata.model_validate(mdata) diff --git a/aws/v2/preview.py b/aws/v2/preview.py new file mode 100644 index 0000000..9b00aeb --- /dev/null +++ b/aws/v2/preview.py @@ -0,0 +1,235 @@ +"""Dataset and file preview for MDF v2. + +Provides preview capabilities so researchers can inspect data +before downloading entire datasets. + +Supported previews: +- CSV/TSV: First N rows, column statistics +- JSON: Structure/schema, first N keys +- Text: First N lines +- Images: Dimensions, format info +- Binary: File info only +""" + +import base64 +import csv +import io +import json +import os +from typing import Any, Dict, List, Optional + +from v2.storage import get_storage_backend + + +def preview_csv(content: bytes, max_rows: int = 20) -> Dict[str, Any]: + """Preview a CSV file.""" + try: + text = content.decode("utf-8") + except UnicodeDecodeError: + text = content.decode("latin-1") + + reader = csv.reader(io.StringIO(text)) + rows = list(reader) + + if not rows: + return {"type": "csv", "empty": True} + + headers = rows[0] if rows else [] + data_rows = rows[1:max_rows + 1] + total_rows = len(rows) - 1 # Exclude header + + # Calculate column statistics + columns = [] + for i, header in enumerate(headers): + col_values = [row[i] for row in rows[1:] if i < len(row)] + + # Try to detect numeric columns + numeric_values = [] + for v in col_values: + try: + numeric_values.append(float(v)) + except (ValueError, TypeError): + pass + + col_info = { + "name": header, + "index": i, + "non_null_count": len([v for v in col_values if v]), + "sample_values": col_values[:5], + } + + if len(numeric_values) > len(col_values) * 0.5: # Mostly numeric + col_info["type"] = "numeric" + if numeric_values: + col_info["min"] = min(numeric_values) + col_info["max"] = max(numeric_values) + col_info["mean"] = sum(numeric_values) / len(numeric_values) + else: + col_info["type"] = "string" + unique = set(col_values) + col_info["unique_count"] = len(unique) + if len(unique) <= 10: + col_info["unique_values"] = list(unique)[:10] + + columns.append(col_info) + + return { + "type": "csv", + "headers": headers, + "columns": columns, + "total_rows": total_rows, + "preview_rows": len(data_rows), + "rows": data_rows, + "truncated": total_rows > max_rows, + } + + +def preview_json(content: bytes, max_keys: int = 50) -> Dict[str, Any]: + """Preview a JSON file.""" + try: + data = json.loads(content.decode("utf-8")) + except (json.JSONDecodeError, UnicodeDecodeError) as e: + return {"type": "json", "error": str(e)} + + def summarize(obj, depth=0, max_depth=3): + """Recursively summarize JSON structure.""" + if depth >= max_depth: + return {"_truncated": True, "_type": type(obj).__name__} + + if isinstance(obj, dict): + result = {} + for i, (k, v) in enumerate(obj.items()): + if i >= max_keys: + result["_more_keys"] = len(obj) - max_keys + break + result[k] = summarize(v, depth + 1, max_depth) + return result + elif isinstance(obj, list): + if not obj: + return [] + # Show first few items + sample = [summarize(item, depth + 1, max_depth) for item in obj[:3]] + if len(obj) > 3: + sample.append({"_more_items": len(obj) - 3}) + return sample + else: + return obj + + return { + "type": "json", + "structure": summarize(data), + "size_bytes": len(content), + "is_array": isinstance(data, list), + "is_object": isinstance(data, dict), + "top_level_keys": list(data.keys())[:20] if isinstance(data, dict) else None, + "array_length": len(data) if isinstance(data, list) else None, + } + + +def preview_text(content: bytes, max_lines: int = 50) -> Dict[str, Any]: + """Preview a text file.""" + try: + text = content.decode("utf-8") + except UnicodeDecodeError: + try: + text = content.decode("latin-1") + except UnicodeDecodeError: + return {"type": "text", "error": "Unable to decode as text"} + + lines = text.split("\n") + total_lines = len(lines) + + return { + "type": "text", + "total_lines": total_lines, + "preview_lines": min(max_lines, total_lines), + "lines": lines[:max_lines], + "truncated": total_lines > max_lines, + "size_bytes": len(content), + } + + +def preview_binary(content: bytes, filename: str) -> Dict[str, Any]: + """Preview info for binary files.""" + ext = filename.rsplit(".", 1)[-1].lower() if "." in filename else "" + + info = { + "type": "binary", + "size_bytes": len(content), + "extension": ext, + } + + # Check for known magic bytes + if content[:4] == b"\x89PNG": + info["format"] = "PNG image" + # Parse PNG dimensions + if len(content) > 24: + width = int.from_bytes(content[16:20], "big") + height = int.from_bytes(content[20:24], "big") + info["dimensions"] = {"width": width, "height": height} + + elif content[:2] == b"\xff\xd8": + info["format"] = "JPEG image" + + elif content[:4] == b"PK\x03\x04": + info["format"] = "ZIP archive" + + elif content[:8] == b"\x89HDF\r\n\x1a\n": + info["format"] = "HDF5 file" + + elif content[:4] == b"CDF\x01" or content[:4] == b"CDF\x02": + info["format"] = "NetCDF file" + + elif ext in ("npy", "npz"): + info["format"] = "NumPy array" + + elif ext == "cif": + info["format"] = "Crystallographic Information File" + # Try to extract basic CIF info + try: + text = content.decode("utf-8") + for line in text.split("\n")[:100]: + if line.startswith("_chemical_formula_sum"): + info["formula"] = line.split(None, 1)[1].strip().strip("'\"") + elif line.startswith("_symmetry_space_group_name"): + info["space_group"] = line.split(None, 1)[1].strip().strip("'\"") + except Exception: + pass + + return info + + +def generate_preview( + content: bytes, + filename: str, + content_type: str = "", + max_rows: int = 20, + max_lines: int = 50, +) -> Dict[str, Any]: + """Generate a preview for any file type.""" + ext = filename.rsplit(".", 1)[-1].lower() if "." in filename else "" + + # Determine preview type + if ext in ("csv", "tsv") or "csv" in content_type: + return preview_csv(content, max_rows) + + elif ext == "json" or "json" in content_type: + return preview_json(content) + + elif ext in ("txt", "md", "log", "py", "yaml", "yml", "xml", "html") or "text" in content_type: + return preview_text(content, max_lines) + + else: + # Try to detect if it's text + try: + sample = content[:1000].decode("utf-8") + # Check if it looks like text (mostly printable) + printable = sum(1 for c in sample if c.isprintable() or c in "\n\r\t") + if printable > len(sample) * 0.9: + return preview_text(content, max_lines) + except UnicodeDecodeError: + pass + + return preview_binary(content, filename) + + diff --git a/aws/v2/profiler.py b/aws/v2/profiler.py new file mode 100644 index 0000000..74f7f6c --- /dev/null +++ b/aws/v2/profiler.py @@ -0,0 +1,333 @@ +"""Dataset profiling for MDF v2. + +Scans files associated with a submission and produces a structured +profile with column statistics, sample data, and format information. +""" + +import csv +import io +import json +import logging +from datetime import datetime, timezone +from typing import Any, Dict, List, Optional + +from pydantic import BaseModel + +from v2.preview import generate_preview +from v2.storage import StorageBackend + +logger = logging.getLogger(__name__) + +MAX_FILE_BYTES = 10 * 1024 * 1024 # 10 MB cap per file for profiling +MAX_SAMPLE_ROWS = 5 + + +class ColumnSummary(BaseModel): + name: str + dtype: str # "float64", "int64", "string", "bool", "datetime" + count: int # non-null count + nulls: int = 0 + unique: Optional[int] = None + min: Optional[float] = None + max: Optional[float] = None + mean: Optional[float] = None + std: Optional[float] = None + top_values: List[str] = [] + + +class FileProfile(BaseModel): + path: str + filename: str + size_bytes: int + content_type: str + format: str # "csv", "json", "hdf5", "cif", "image", "text", "binary" + columns: List[ColumnSummary] = [] + n_rows: Optional[int] = None + sample_rows: List[Dict] = [] + structure: Optional[Dict] = None + preview_lines: List[str] = [] + extra: Dict[str, Any] = {} + + +class DatasetProfile(BaseModel): + source_id: str + profiled_at: str + total_files: int + total_bytes: int + formats: Dict[str, int] = {} + files: List[FileProfile] = [] + + +def _detect_format(filename: str, content: Optional[bytes] = None) -> str: + ext = filename.rsplit(".", 1)[-1].lower() if "." in filename else "" + format_map = { + "csv": "csv", "tsv": "csv", + "json": "json", "jsonl": "json", + "hdf5": "hdf5", "h5": "hdf5", "hdf": "hdf5", + "cif": "cif", + "png": "image", "jpg": "image", "jpeg": "image", + "tif": "image", "tiff": "image", "bmp": "image", + "txt": "text", "md": "text", "log": "text", + "py": "text", "yaml": "text", "yml": "text", + "xml": "text", "html": "text", + } + if ext in format_map: + return format_map[ext] + + if content: + if content[:8] == b"\x89HDF\r\n\x1a\n": + return "hdf5" + if content[:4] == b"\x89PNG": + return "image" + if content[:2] == b"\xff\xd8": + return "image" + try: + content[:1000].decode("utf-8") + return "text" + except UnicodeDecodeError: + pass + + return "binary" + + +def _profile_csv(content: bytes, filename: str) -> dict: + """Profile a CSV/TSV file, returning columns, sample_rows, and n_rows.""" + try: + text = content.decode("utf-8") + except UnicodeDecodeError: + text = content.decode("latin-1") + + dialect = csv.Sniffer().sniff(text[:4096]) if text[:4096].strip() else None + reader = csv.reader(io.StringIO(text), dialect or csv.excel) + rows = list(reader) + if not rows: + return {"columns": [], "sample_rows": [], "n_rows": 0} + + headers = rows[0] + data_rows = rows[1:] + n_rows = len(data_rows) + + columns = [] + for i, header in enumerate(headers): + col_values = [row[i] for row in data_rows if i < len(row)] + non_null = [v for v in col_values if v.strip()] + nulls = len(col_values) - len(non_null) + + # Try numeric detection + numeric = [] + for v in non_null: + try: + numeric.append(float(v)) + except (ValueError, TypeError): + pass + + if len(numeric) > len(non_null) * 0.5 and numeric: + mean_val = sum(numeric) / len(numeric) + variance = sum((x - mean_val) ** 2 for x in numeric) / max(len(numeric) - 1, 1) + std_val = variance ** 0.5 + # Detect int vs float + all_int = all(x == int(x) for x in numeric) + dtype = "int64" if all_int else "float64" + columns.append(ColumnSummary( + name=header, + dtype=dtype, + count=len(non_null), + nulls=nulls, + min=min(numeric), + max=max(numeric), + mean=round(mean_val, 6), + std=round(std_val, 6), + )) + else: + unique_vals = set(non_null) + top = sorted(unique_vals, key=lambda v: non_null.count(v), reverse=True)[:5] + columns.append(ColumnSummary( + name=header, + dtype="string", + count=len(non_null), + nulls=nulls, + unique=len(unique_vals), + top_values=top, + )) + + # Build sample rows as list of dicts + sample_rows = [] + for row in data_rows[:MAX_SAMPLE_ROWS]: + row_dict = {} + for i, header in enumerate(headers): + row_dict[header] = row[i] if i < len(row) else "" + sample_rows.append(row_dict) + + return {"columns": columns, "sample_rows": sample_rows, "n_rows": n_rows} + + +def _profile_json(content: bytes) -> dict: + """Profile a JSON file, returning structure summary.""" + try: + data = json.loads(content.decode("utf-8")) + except (json.JSONDecodeError, UnicodeDecodeError): + return {"structure": None} + + def summarize(obj, depth=0, max_depth=3): + if depth >= max_depth: + return {"_type": type(obj).__name__} + if isinstance(obj, dict): + return {k: summarize(v, depth + 1) for k, v in list(obj.items())[:20]} + elif isinstance(obj, list): + if not obj: + return [] + return [summarize(obj[0], depth + 1), f"... ({len(obj)} items)"] + else: + return type(obj).__name__ + + result = {"structure": summarize(data)} + + # If it's an array of dicts (tabular-like JSON), extract sample rows + if isinstance(data, list) and data and isinstance(data[0], dict): + result["n_rows"] = len(data) + result["sample_rows"] = data[:MAX_SAMPLE_ROWS] + # Extract "columns" from the first record's keys + first = data[0] + columns = [] + for key, val in first.items(): + if isinstance(val, (int, float)): + dtype = "float64" + elif isinstance(val, bool): + dtype = "bool" + elif isinstance(val, str): + dtype = "string" + else: + dtype = type(val).__name__ + columns.append(ColumnSummary( + name=key, dtype=dtype, count=len(data), + )) + result["columns"] = columns + + return result + + +def _profile_cif(content: bytes) -> dict: + """Extract basic CIF info.""" + extra = {} + try: + text = content.decode("utf-8") + for line in text.split("\n")[:200]: + if line.startswith("_chemical_formula_sum"): + extra["formula"] = line.split(None, 1)[1].strip().strip("'\"") + elif line.startswith("_symmetry_space_group_name"): + extra["space_group"] = line.split(None, 1)[1].strip().strip("'\"") + except Exception: + pass + return {"extra": extra} + + +def _profile_image(content: bytes, filename: str) -> dict: + """Extract basic image info.""" + extra = {} + if content[:4] == b"\x89PNG" and len(content) > 24: + extra["width"] = int.from_bytes(content[16:20], "big") + extra["height"] = int.from_bytes(content[20:24], "big") + extra["image_format"] = "PNG" + elif content[:2] == b"\xff\xd8": + extra["image_format"] = "JPEG" + else: + ext = filename.rsplit(".", 1)[-1].upper() if "." in filename else "unknown" + extra["image_format"] = ext + return {"extra": extra} + + +def _profile_text(content: bytes, max_lines: int = 30) -> dict: + """Extract preview lines from text files.""" + try: + text = content.decode("utf-8") + except UnicodeDecodeError: + try: + text = content.decode("latin-1") + except UnicodeDecodeError: + return {"preview_lines": []} + + lines = text.split("\n")[:max_lines] + return {"preview_lines": lines} + + +def _profile_file(file_meta, content: Optional[bytes]) -> FileProfile: + """Build a FileProfile for one file.""" + filename = file_meta.filename + fmt = _detect_format(filename, content) + size_bytes = file_meta.size_bytes or (len(content) if content else 0) + + profile = FileProfile( + path=file_meta.path, + filename=filename, + size_bytes=size_bytes, + content_type=file_meta.content_type or "", + format=fmt, + ) + + if content is None or size_bytes > MAX_FILE_BYTES: + return profile + + if fmt == "csv": + result = _profile_csv(content, filename) + profile.columns = result.get("columns", []) + profile.sample_rows = result.get("sample_rows", []) + profile.n_rows = result.get("n_rows") + elif fmt == "json": + result = _profile_json(content) + profile.structure = result.get("structure") + profile.sample_rows = result.get("sample_rows", []) + profile.n_rows = result.get("n_rows") + profile.columns = result.get("columns", []) + elif fmt == "cif": + result = _profile_cif(content) + profile.extra = result.get("extra", {}) + elif fmt == "image": + result = _profile_image(content, filename) + profile.extra = result.get("extra", {}) + elif fmt == "text": + result = _profile_text(content) + profile.preview_lines = result.get("preview_lines", []) + + return profile + + +def build_dataset_profile(source_id: str, storage: StorageBackend) -> DatasetProfile: + """Scan files for a dataset and build a structured profile. + + Args: + source_id: The source_id / stream_id to scan files for. + storage: The storage backend to read files from. + + Returns: + A DatasetProfile with file-level details and aggregate stats. + """ + files = storage.list_files(source_id) + + total_bytes = 0 + format_counts: Dict[str, int] = {} + file_profiles: List[FileProfile] = [] + + for file_meta in files: + size = file_meta.size_bytes or 0 + total_bytes += size + + # Only read file content if under the size cap + content = None + if size <= MAX_FILE_BYTES: + try: + content = storage.get_file(file_meta.path) + except Exception: + logger.debug("Could not read %s for profiling", file_meta.path) + + fp = _profile_file(file_meta, content) + file_profiles.append(fp) + format_counts[fp.format] = format_counts.get(fp.format, 0) + 1 + + return DatasetProfile( + source_id=source_id, + profiled_at=datetime.now(timezone.utc).isoformat().replace("+00:00", "Z"), + total_files=len(files), + total_bytes=total_bytes, + formats=format_counts, + files=file_profiles, + ) diff --git a/aws/v2/requirements-test.txt b/aws/v2/requirements-test.txt new file mode 100644 index 0000000..b199470 --- /dev/null +++ b/aws/v2/requirements-test.txt @@ -0,0 +1,10 @@ +# Backend test dependencies (no AWS credentials needed) +fastapi>=0.100.0 +mangum>=0.17.0 +uvicorn[standard]>=0.23.0 +pydantic>=2.0 +httpx +globus-sdk>=3.0 +click +requests +pytest diff --git a/aws/v2/scripts/convert_production_datasets.py b/aws/v2/scripts/convert_production_datasets.py new file mode 100644 index 0000000..744e2a0 --- /dev/null +++ b/aws/v2/scripts/convert_production_datasets.py @@ -0,0 +1,295 @@ +#!/usr/bin/env python3 +"""Convert extracted MDF production datasets from v1 format to v2 flat format. + +Reads the raw gmeta dump (from extract_mdf_production_datasets.py) and runs +each entry through migrate_v1_payload() to produce the v2 DatasetMetadata +format. Outputs a JSON file with one record per dataset. + +Skips: services, mrr, jarvis, oqmd blocks (not needed). +Organizations: collapsed to first entry when plural. +DOI: preserved from dc.identifier. +Version linking: builds previous_version chains from source_name groups. +Download URL: constructed from endpoint_path when available. + +Usage: + python cs/aws/v2/scripts/convert_production_datasets.py + python cs/aws/v2/scripts/convert_production_datasets.py -i datasets.json -o converted.json +""" + +import argparse +import json +import re +import sys +import os +from collections import defaultdict + +# Allow importing v2.metadata from the scripts/ directory +sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..")) + +from v2.metadata import migrate_v1_payload + + +def parse_version_number(version_raw) -> tuple: + """Normalize version to a comparable tuple. + + Handles: 1, "1", "1.0", "1.1", "1.2", etc. + Returns (major, minor) tuple for sorting. + """ + s = str(version_raw) + m = re.match(r"(\d+)(?:\.(\d+))?", s) + if m: + return (int(m.group(1)), int(m.group(2) or 0)) + return (0, 0) + + +def make_version_key(record: dict) -> str: + """Build a unique version key for a record. + + Uses source_id + version to uniquely identify a specific version, + since some datasets share the same source_id across versions. + """ + sid = record["source_id"] or "" + v = record["version"] + if v is not None: + return f"{sid}_v{v}" + return sid + + +def build_version_chains(records: list): + """Build previous_version and root_version for all records. + + Three cases: + 1. Multiple versions present in the index (same source_name, different + versions) — chain them together, root is the earliest. + 2. Single entry but version > 1 — prior versions were superseded and + aren't in the index. Use source_name as the root identifier to + represent the original dataset lineage. + 3. Single entry, version 1 — standalone. root_version = its own source_id. + + Returns (prev_map, root_map, stats) where maps are keyed by + make_version_key(record). + """ + by_source_name = defaultdict(list) + for r in records: + sn = r["source_name"] + if sn: + by_source_name[sn].append(r) + + prev_map = {} + root_map = {} + latest_set = set() # version_keys of the latest version per group + multi_present = 0 + implicit_versioned = 0 + + for source_name, group in by_source_name.items(): + group.sort(key=lambda r: parse_version_number(r["version"])) + + if len(group) > 1: + # Case 1: multiple versions present in index + multi_present += 1 + + unique_sids = set(r["source_id"] for r in group) + needs_version_suffix = len(unique_sids) == 1 + + root_r = group[0] + root_ref = make_version_key(root_r) if needs_version_suffix else root_r["source_id"] + + for r in group: + root_map[make_version_key(r)] = root_ref + + for i in range(1, len(group)): + cur_key = make_version_key(group[i]) + prev_r = group[i - 1] + prev_ref = make_version_key(prev_r) if needs_version_suffix else prev_r["source_id"] + prev_map[cur_key] = prev_ref + + # Last in sorted order is the latest + latest_set.add(make_version_key(group[-1])) + else: + # Single entry — it's the latest (and only) version + r = group[0] + major, _ = parse_version_number(r["version"]) + key = make_version_key(r) + latest_set.add(key) + + if major > 1: + # Case 2: version > 1 but prior versions not in index. + implicit_versioned += 1 + root_map[key] = source_name + else: + # Case 3: standalone v1 — root is itself + root_map[key] = r["source_id"] + + stats = { + "multi_present": multi_present, + "implicit_versioned": implicit_versioned, + "prev_links": len(prev_map), + "root_set": len(root_map), + "latest_count": len(latest_set), + } + return prev_map, root_map, latest_set, stats + + +def build_download_url(endpoint_path: str) -> str | None: + """Build a direct HTTPS download URL from a Globus endpoint path. + + Converts: globus://82f1b5c6-6e9b-11e5-ba47-22000b92c6ec/path/ + To: https://data.materialsdatafacility.org/path/ + """ + if not endpoint_path: + return None + # Strip the globus://UUID/ prefix + m = re.match(r"globus://[a-f0-9-]+/(.+)", endpoint_path) + if m: + return f"https://data.materialsdatafacility.org/{m.group(1)}" + return None + + +def convert_entry(gmeta_entry: dict) -> dict: + """Convert a single gmeta entry to v2 format. + + Returns a record with: + - source_id, source_name, version, ingest_date (from mdf block) + - doi (from dc.identifier if present) + - endpoint_path (from data block) + - metadata: the v2 flat metadata dict + """ + content = gmeta_entry.get("entries", [{}])[0].get("content", {}) + mdf = content.get("mdf", {}) + dc = content.get("dc", {}) + data = content.get("data", {}) + + # Run the v1 -> v2 migration + v2_metadata = migrate_v1_payload(content) + + # Extract DOI from dc.identifier if present + doi = None + dc_id = dc.get("identifier") + if isinstance(dc_id, dict): + doi = dc_id.get("identifier") + elif isinstance(dc_id, str): + doi = dc_id + + endpoint_path = data.get("endpoint_path") + + # Set download_url from endpoint_path + download_url = build_download_url(endpoint_path) + if download_url: + v2_metadata["download_url"] = download_url + + # Build the output record + record = { + "source_id": mdf.get("source_id"), + "source_name": mdf.get("source_name"), + "version": mdf.get("version"), + "ingest_date": mdf.get("ingest_date"), + "doi": doi, + "endpoint_path": endpoint_path, + "metadata": v2_metadata, + } + + return record + + +def main(): + parser = argparse.ArgumentParser( + description="Convert MDF production datasets from v1 to v2 format." + ) + parser.add_argument( + "-i", "--input", + default="datasets.json", + help="Input JSON from extract script (default: datasets.json)", + ) + parser.add_argument( + "-o", "--output", + default="converted_datasets.json", + help="Output JSON file (default: converted_datasets.json)", + ) + args = parser.parse_args() + + with open(args.input) as f: + data = json.load(f) + + gmeta_list = data.get("gmeta", []) + print(f"Input: {len(gmeta_list)} entries from {args.input}") + + converted = [] + errors = [] + for i, entry in enumerate(gmeta_list): + try: + record = convert_entry(entry) + converted.append(record) + except Exception as exc: + subject = entry.get("subject", f"entry_{i}") + errors.append({"subject": subject, "error": str(exc)}) + print(f" ERROR [{subject}]: {exc}", file=sys.stderr) + + print(f"Converted: {len(converted)}") + if errors: + print(f"Errors: {len(errors)}") + + # Build version chains across records sharing the same source_name + prev_map, root_map, latest_set, stats = build_version_chains(converted) + for record in converted: + key = make_version_key(record) + # Version string (normalize to "major.minor") + major, minor = parse_version_number(record["version"]) + record["metadata"]["version"] = f"{major}.{minor}" + # Previous version link + prev = prev_map.get(key) + if prev: + record["metadata"]["previous_version"] = prev + # Root version + root = root_map.get(key) + if root: + record["metadata"]["root_version"] = root + # Latest flag + record["metadata"]["latest"] = key in latest_set + print(f"Versioning: {stats['multi_present']} with multiple versions in index, " + f"{stats['implicit_versioned']} with prior versions superseded, " + f"{stats['prev_links']} previous_version links, " + f"{stats['latest_count']} marked latest") + + # Summary stats + with_doi = sum(1 for r in converted if r["doi"]) + with_ml = sum(1 for r in converted if r["metadata"].get("ml")) + with_org = sum(1 for r in converted if r["metadata"].get("organization")) + with_keywords = sum(1 for r in converted if r["metadata"].get("keywords")) + with_license = sum(1 for r in converted if r["metadata"].get("license")) + with_related = sum(1 for r in converted if r["metadata"].get("related_works")) + with_extensions = sum(1 for r in converted if r["metadata"].get("extensions")) + with_download = sum(1 for r in converted if r["metadata"].get("download_url")) + with_prev_ver = sum(1 for r in converted if r["metadata"].get("previous_version")) + with_root_ver = sum(1 for r in converted if r["metadata"].get("root_version")) + with_version = sum(1 for r in converted if r["metadata"].get("version")) + is_latest = sum(1 for r in converted if r["metadata"].get("latest")) + + print(f"\n--- Field coverage ---") + print(f" DOI: {with_doi}/{len(converted)}") + print(f" ML metadata: {with_ml}/{len(converted)}") + print(f" Organization: {with_org}/{len(converted)}") + print(f" Keywords: {with_keywords}/{len(converted)}") + print(f" License: {with_license}/{len(converted)}") + print(f" Related works: {with_related}/{len(converted)}") + print(f" Extensions: {with_extensions}/{len(converted)}") + print(f" Download URL: {with_download}/{len(converted)}") + print(f" Version: {with_version}/{len(converted)}") + print(f" Previous version: {with_prev_ver}/{len(converted)}") + print(f" Root version: {with_root_ver}/{len(converted)}") + print(f" Latest: {is_latest}/{len(converted)}") + + output = { + "source": args.input, + "count": len(converted), + "errors": errors, + "records": converted, + } + + with open(args.output, "w") as f: + json.dump(output, f, indent=2, default=str) + + print(f"\nSaved to {args.output}") + + +if __name__ == "__main__": + main() diff --git a/aws/v2/scripts/extract_mdf_production_datasets.py b/aws/v2/scripts/extract_mdf_production_datasets.py new file mode 100644 index 0000000..81ce7e7 --- /dev/null +++ b/aws/v2/scripts/extract_mdf_production_datasets.py @@ -0,0 +1,202 @@ +#!/usr/bin/env python3 +"""Extract all dataset entries from the MDF production Globus Search index. + +Reads all resource_type=dataset entries from the MDF production index +and saves the raw GMetaEntry data to a JSON file. This is READ-ONLY — +nothing is written back to the production index. + +The production index is: 1a57bbe5-5272-477f-9d31-343b8258b7a5 + +Usage: + # Interactive Globus login (opens browser) + python cs/aws/v2/scripts/extract_mdf_production_datasets.py + + # Save to a specific file + python cs/aws/v2/scripts/extract_mdf_production_datasets.py -o datasets.json + + # Limit results (for testing) + python cs/aws/v2/scripts/extract_mdf_production_datasets.py --limit 50 + + # Just print count, don't save + python cs/aws/v2/scripts/extract_mdf_production_datasets.py --count-only +""" + +import argparse +import json +import sys +import time + +MDF_PRODUCTION_INDEX = "1a57bbe5-5272-477f-9d31-343b8258b7a5" +NATIVE_APP_CLIENT_ID = "074cebcc-19ad-4332-bbf2-78402291b659" +SEARCH_SCOPE = "urn:globus:auth:scope:search.api.globus.org:all" + +# Globus Search max limit per request +PAGE_SIZE = 100 + + +def authenticate(): + """Authenticate via interactive Globus OAuth login. Returns a SearchClient.""" + import globus_sdk + + client = globus_sdk.NativeAppAuthClient(NATIVE_APP_CLIENT_ID) + client.oauth2_start_flow(requested_scopes=SEARCH_SCOPE) + + authorize_url = client.oauth2_get_authorize_url() + print(f"Go to this URL and login:\n\n {authorize_url}\n") + auth_code = input("Paste the authorization code here: ").strip() + + token_response = client.oauth2_exchange_code_for_tokens(auth_code) + search_token_data = token_response.by_resource_server.get("search.api.globus.org") + if not search_token_data: + print("ERROR: No search token in response. Check app scopes.") + sys.exit(1) + + access_token = ( + search_token_data.get("access_token") + if isinstance(search_token_data, dict) + else getattr(search_token_data, "access_token", None) + ) + if not access_token: + print("ERROR: access_token is None") + sys.exit(1) + + authorizer = globus_sdk.AccessTokenAuthorizer(access_token) + return globus_sdk.SearchClient(authorizer=authorizer) + + +def fetch_all_datasets(search_client, limit=None): + """Fetch all resource_type=dataset entries from the production index. + + Uses offset-based pagination to walk through all results. + Returns a list of raw gmeta entries (each with subject, content, etc). + """ + query = 'mdf.resource_type:"dataset"' + offset = 0 + all_gmeta = [] + total = None + + while True: + fetch_limit = PAGE_SIZE + if limit is not None: + remaining = limit - len(all_gmeta) + if remaining <= 0: + break + fetch_limit = min(PAGE_SIZE, remaining) + + print(f" Fetching offset={offset}, limit={fetch_limit} ...", end=" ", flush=True) + + result = search_client.search( + MDF_PRODUCTION_INDEX, + query, + limit=fetch_limit, + offset=offset, + advanced=True, + ) + data = result.data if hasattr(result, "data") else result + + if total is None: + total = data.get("total", 0) + print(f"(total in index: {total})") + else: + print() + + gmeta_list = data.get("gmeta", []) + if not gmeta_list: + break + + all_gmeta.extend(gmeta_list) + print(f" ... got {len(gmeta_list)} entries (cumulative: {len(all_gmeta)})") + + offset += len(gmeta_list) + + # Stop if we've fetched everything + if offset >= total: + break + + # Be polite to the API + time.sleep(0.2) + + return all_gmeta, total + + +def summarize(gmeta_list): + """Print a summary of the fetched datasets.""" + print(f"\nTotal GMetaEntries fetched: {len(gmeta_list)}") + + titles = [] + source_ids = [] + orgs = set() + for entry in gmeta_list: + for content in entry.get("content", []): + dc = content.get("dc", {}) + mdf = content.get("mdf", {}) + title = dc.get("title") or dc.get("titles", [{}])[0].get("title", "?") if dc else "?" + titles.append(title) + sid = mdf.get("source_id", "?") + source_ids.append(sid) + for org in mdf.get("organizations", []): + orgs.add(org) + + print(f"Unique organizations: {sorted(orgs)}") + print(f"\nFirst 10 datasets:") + for i, (title, sid) in enumerate(zip(titles[:10], source_ids[:10])): + print(f" {i+1}. [{sid}] {title}") + if len(titles) > 10: + print(f" ... and {len(titles) - 10} more") + + +def main(): + parser = argparse.ArgumentParser( + description="Extract dataset entries from the MDF production Globus Search index." + ) + parser.add_argument( + "-o", "--output", + default="mdf_production_datasets.json", + help="Output JSON file (default: mdf_production_datasets.json)", + ) + parser.add_argument( + "--limit", + type=int, + default=None, + help="Max number of entries to fetch (default: all)", + ) + parser.add_argument( + "--count-only", + action="store_true", + help="Just count entries, don't save to file", + ) + args = parser.parse_args() + + print(f"MDF Production Index: {MDF_PRODUCTION_INDEX}") + print(f"Query: resource_type=dataset\n") + + print("Authenticating with Globus...") + search_client = authenticate() + print("Authenticated.\n") + + print("Fetching datasets...") + gmeta_list, total = fetch_all_datasets(search_client, limit=args.limit) + + summarize(gmeta_list) + + if args.count_only: + print("\n--count-only: skipping file save.") + return + + # Save raw gmeta entries + output = { + "source_index": MDF_PRODUCTION_INDEX, + "query": 'mdf.resource_type:"dataset"', + "total_in_index": total, + "fetched_count": len(gmeta_list), + "gmeta": gmeta_list, + } + + with open(args.output, "w") as f: + json.dump(output, f, indent=2, default=str) + + print(f"\nSaved {len(gmeta_list)} entries to {args.output}") + + +if __name__ == "__main__": + main() diff --git a/aws/v2/scripts/grant_search_index_role.py b/aws/v2/scripts/grant_search_index_role.py new file mode 100755 index 0000000..503c6c2 --- /dev/null +++ b/aws/v2/scripts/grant_search_index_role.py @@ -0,0 +1,149 @@ +#!/usr/bin/env python3 +"""Grant writer role on a Globus Search index to the MDF confidential app. + +Run once to allow the backend's confidential app to ingest entries into +the search index. Authenticates interactively as **you** (the index owner), +then creates a writer role for the app identity. + +Usage: + # Auto-resolve client ID from AWS SSM (/mdf/globus-client-id): + python grant_search_index_role.py + + # Or pass client ID explicitly: + python grant_search_index_role.py --client-id 86e4853e-... + + # Override index UUID (default: test index): + python grant_search_index_role.py --index ab19b80b-... +""" + +import argparse +import subprocess +import sys + +import globus_sdk + +# MDF v2 test search index +DEFAULT_INDEX_UUID = "ab19b80b-0887-4337-b9f8-b8cc7feb1fdc" + +# Native app client for interactive login (same one used by mdf_agent) +NATIVE_APP_CLIENT_ID = "074cebcc-19ad-4332-bbf2-78402291b659" + + +def resolve_client_id_from_ssm() -> str | None: + """Try to read the confidential app client ID from AWS SSM.""" + try: + result = subprocess.run( + [ + "aws", "ssm", "get-parameter", + "--name", "/mdf/globus-client-id", + "--region", "us-east-1", + "--query", "Parameter.Value", + "--output", "text", + ], + capture_output=True, + text=True, + timeout=10, + ) + if result.returncode == 0 and result.stdout.strip(): + return result.stdout.strip() + except (FileNotFoundError, subprocess.TimeoutExpired): + pass + return None + + +def interactive_login() -> globus_sdk.SearchClient: + """Login interactively and return a SearchClient authorized as the user.""" + client = globus_sdk.NativeAppAuthClient(NATIVE_APP_CLIENT_ID) + client.oauth2_start_flow( + requested_scopes=["urn:globus:auth:scope:search.api.globus.org:all"], + refresh_tokens=False, + ) + + authorize_url = client.oauth2_get_authorize_url() + print(f"\nOpen this URL in your browser:\n {authorize_url}\n") + auth_code = input("Paste the authorization code here: ").strip() + + token_response = client.oauth2_exchange_code_for_tokens(auth_code) + search_token = token_response.by_resource_server["search.api.globus.org"] + access_token = search_token["access_token"] + + authorizer = globus_sdk.AccessTokenAuthorizer(access_token) + return globus_sdk.SearchClient(authorizer=authorizer) + + +def main(): + parser = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter) + parser.add_argument("--index", default=DEFAULT_INDEX_UUID, help="Globus Search index UUID") + parser.add_argument("--client-id", default=None, help="Confidential app client ID (auto-resolved from SSM if omitted)") + parser.add_argument("--role", default="writer", choices=["writer", "admin"], help="Role to grant (default: writer)") + parser.add_argument("--list-only", action="store_true", help="Just list current roles, don't create") + args = parser.parse_args() + + # Resolve the confidential app's client ID + app_client_id = args.client_id + if not app_client_id: + print("Resolving confidential app client ID from SSM...") + app_client_id = resolve_client_id_from_ssm() + if app_client_id: + print(f" Found: {app_client_id}") + else: + print(" Could not resolve from SSM. Pass --client-id explicitly.") + sys.exit(1) + + # Login interactively as the index owner + print("\nAuthenticating as index owner (interactive login)...") + search_client = interactive_login() + + # List current roles + print(f"\nCurrent roles on index {args.index}:") + try: + roles = search_client.get_role_list(args.index) + role_list = roles.get("role_list", []) if hasattr(roles, "get") else roles.data.get("role_list", []) + if role_list: + for r in role_list: + print(f" {r.get('role', '?'):10s} {r.get('principal', '?')}") + else: + print(" (no roles set)") + except Exception as exc: + print(f" Error listing roles: {exc}") + role_list = [] + + if args.list_only: + return + + # Check if role already exists + app_principal = f"{app_client_id}@clients.auth.globus.org" + existing = [r for r in role_list if r.get("principal") == app_principal and r.get("role") == args.role] + if existing: + print(f"\nRole '{args.role}' already granted to {app_principal}. Nothing to do.") + return + + # Create the role + print(f"\nGranting '{args.role}' role to {app_principal} on index {args.index}...") + try: + result = search_client.create_role( + args.index, + data={ + "principal": app_principal, + "principal_type": "identity", + "role": args.role, + }, + ) + print(f" Success: {result.data if hasattr(result, 'data') else result}") + except globus_sdk.GlobusAPIError as exc: + print(f" API error: {exc.message} (code={exc.code}, status={exc.http_status})") + sys.exit(1) + + # Verify + print("\nVerifying roles...") + roles = search_client.get_role_list(args.index) + role_list = roles.get("role_list", []) if hasattr(roles, "get") else roles.data.get("role_list", []) + for r in role_list: + marker = " <-- NEW" if r.get("principal") == app_principal else "" + print(f" {r.get('role', '?'):10s} {r.get('principal', '?')}{marker}") + + print("\nDone.") + + +if __name__ == "__main__": + main() diff --git a/aws/v2/scripts/ingest_converted_datasets.py b/aws/v2/scripts/ingest_converted_datasets.py new file mode 100644 index 0000000..937b400 --- /dev/null +++ b/aws/v2/scripts/ingest_converted_datasets.py @@ -0,0 +1,393 @@ +#!/usr/bin/env python3 +"""Ingest converted MDF production datasets into DynamoDB and Globus Search. + +Reads converted_datasets.json (output of convert_production_datasets.py), +builds proper v2 submission records, and loads them into both the submission +store and the search index. + +Supports: + - Auto-resolve config from a deployed stack (--env prod) + - Local mode (SQLite + MockSearch) for testing + - Dry-run mode to validate without writing + - Resume: skips records that already exist in the store + +Usage: + # Dry run — validate all records, write nothing + cd cs/aws + PYTHONPATH=. python v2/scripts/ingest_converted_datasets.py --dry-run + + # Local SQLite + mock search + PYTHONPATH=. STORE_BACKEND=sqlite USE_MOCK_SEARCH=true \ + python v2/scripts/ingest_converted_datasets.py + + # Production — auto-resolve config from deployed stack + PYTHONPATH=. python v2/scripts/ingest_converted_datasets.py --env prod + + # Staging — DynamoDB only, skip search + PYTHONPATH=. python v2/scripts/ingest_converted_datasets.py --env staging --skip-search +""" + +import argparse +import json +import os +import sys +import time +from typing import Any, Dict, List, Tuple + +# Allow imports from aws/ root +sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..")) + +# v2 modules are imported lazily (after env vars are resolved) because +# v2.config reads DYNAMO_SUBMISSIONS_TABLE etc. at import time. + +# Environment variables the ingestion script needs from the Lambda config. +_RELEVANT_ENV_KEYS = { + "STORE_BACKEND", + "DYNAMO_SUBMISSIONS_TABLE", + "SEARCH_INDEX_UUID", + "TEST_SEARCH_INDEX_UUID", + "USE_MOCK_SEARCH", + "GLOBUS_CLIENT_ID", + "GLOBUS_CLIENT_SECRET", + "AWS_REGION", +} + +REGION = "us-east-1" + + +def resolve_env_from_stack(env: str) -> Dict[str, str]: + """Read the deployed Lambda's environment variables from CloudFormation. + + Looks up the ApiFunction in stack mdf-connect-v2-{env}, reads its + resolved environment, and returns the subset relevant to ingestion. + This gives us DYNAMO_SUBMISSIONS_TABLE, SEARCH_INDEX_UUID, + GLOBUS_CLIENT_ID, GLOBUS_CLIENT_SECRET, etc. — exactly matching + what the live backend uses. + """ + import boto3 + + stack_name = f"mdf-connect-v2-{env}" + + cf = boto3.client("cloudformation", region_name=REGION) + try: + resp = cf.describe_stack_resources(StackName=stack_name) + except cf.exceptions.ClientError as exc: + raise SystemExit( + f"Stack '{stack_name}' not found in {REGION}. " + f"Deploy first with: ./deploy.sh {env}" + ) from exc + + api_func = None + for r in resp["StackResources"]: + if r["LogicalResourceId"] == "ApiFunction": + api_func = r["PhysicalResourceId"] + break + if not api_func: + raise SystemExit(f"ApiFunction not found in stack {stack_name}") + + lam = boto3.client("lambda", region_name=REGION) + config = lam.get_function_configuration(FunctionName=api_func) + lambda_env = config.get("Environment", {}).get("Variables", {}) + + resolved = {} + for key in _RELEVANT_ENV_KEYS: + if key in lambda_env: + resolved[key] = lambda_env[key] + + # The Lambda always runs with dynamo store + resolved.setdefault("STORE_BACKEND", "dynamo") + + return resolved + + +def apply_env(resolved: Dict[str, str]) -> None: + """Set resolved config as environment variables (without overriding explicit overrides).""" + for key, value in resolved.items(): + if key not in os.environ: + os.environ[key] = value + + +def build_submission_record(converted: Dict[str, Any]) -> Dict[str, Any]: + """Transform a converted_datasets.json record into a v2 submission record. + + The converted record has: + source_id, source_name, version (int), ingest_date, doi, endpoint_path, + metadata (flat v2 dict) + + The submission record needs: + source_id, version (str), versioned_source_id, user_id, user_email, + organization, status, dataset_mdata (JSON str), schema_version, test, + created_at, updated_at, published_at, doi, dataset_doi + """ + metadata = dict(converted["metadata"]) + + # Populate data_sources from endpoint_path (not present in converted metadata) + endpoint_path = converted.get("endpoint_path") + if endpoint_path and not metadata.get("data_sources"): + metadata["data_sources"] = [endpoint_path] + + version_str = metadata.get("version", "1.0") + doi = converted.get("doi") or None + ingest_date = converted.get("ingest_date", "") + + organization = metadata.get("organization") or "" + + record = { + "source_id": converted["source_id"], + "version": version_str, + "versioned_source_id": converted["source_id"], + "user_id": "v1-migration", + "status": "published", + "dataset_mdata": json.dumps(metadata), + "schema_version": "2", + "test": False, + "created_at": ingest_date, + "updated_at": ingest_date, + "published_at": ingest_date, + } + + # DynamoDB rejects empty strings for GSI key attributes. + # organization is the partition key of the org-submissions GSI, + # so omit it when empty (item won't appear in that index). + if organization: + record["organization"] = organization + + if doi: + record["doi"] = doi + record["dataset_doi"] = doi + + return record + + +def validate_record(converted: Dict[str, Any]) -> Tuple[bool, str]: + """Validate that a converted record can parse into DatasetMetadata. + + Returns (ok, error_message). + """ + from v2.metadata import DatasetMetadata + + try: + metadata = dict(converted["metadata"]) + if converted.get("endpoint_path") and not metadata.get("data_sources"): + metadata["data_sources"] = [converted["endpoint_path"]] + DatasetMetadata.model_validate(metadata) + return True, "" + except Exception as exc: + return False, str(exc) + + +SEARCH_BATCH_SIZE = 100 # GMetaList entries per request; well under 10MB limit + + +def ingest_records( + records: List[Dict[str, Any]], + *, + dry_run: bool = False, + skip_search: bool = False, + skip_store: bool = False, +) -> Dict[str, Any]: + """Ingest converted records into the submission store and search index. + + Store ingest is record-by-record (skip duplicates). + Search ingest uses GMetaList batching for speed (~10x faster than individual requests). + + Returns a summary dict with counts. + """ + from v2.store import get_store + from v2.search_client import get_search_client + + store = None if dry_run or skip_store else get_store() + search = None if dry_run or skip_search else get_search_client() + + stats = { + "total": len(records), + "validated": 0, + "validation_errors": [], + "store_inserted": 0, + "store_skipped": 0, + "store_errors": [], + "search_ingested": 0, + "search_errors": [], + } + + t0 = time.time() + + # ── Validate + Store (record-by-record) ── + valid_submissions: List[Tuple[str, Dict[str, Any]]] = [] + + for i, converted in enumerate(records): + source_id = converted.get("source_id", f"unknown-{i}") + version_str = converted.get("metadata", {}).get("version", "1.0") + + ok, err = validate_record(converted) + if not ok: + stats["validation_errors"].append({"source_id": source_id, "error": err}) + continue + stats["validated"] += 1 + + if dry_run: + continue + + submission = build_submission_record(converted) + valid_submissions.append((source_id, submission)) + + if store: + try: + existing = store.get_submission(source_id, version_str) + if existing: + stats["store_skipped"] += 1 + else: + store.upsert_submission(submission) + stats["store_inserted"] += 1 + except Exception as exc: + stats["store_errors"].append({"source_id": source_id, "error": str(exc)}) + + if (i + 1) % 100 == 0: + elapsed = time.time() - t0 + print(f" [validate+store {i+1}/{len(records)}] {elapsed:.1f}s elapsed") + + # ── Search ingest: GMetaList batches ── + if search and valid_submissions: + all_subs = [sub for _, sub in valid_submissions] + total_batches = (len(all_subs) + SEARCH_BATCH_SIZE - 1) // SEARCH_BATCH_SIZE + print(f"\n Ingesting {len(all_subs)} records to Globus Search " + f"({total_batches} batch{'es' if total_batches != 1 else ''} of {SEARCH_BATCH_SIZE})...") + + result = search.batch_ingest(all_subs, batch_size=SEARCH_BATCH_SIZE) + stats["search_ingested"] = result.get("ingested", 0) + stats["search_errors"] = result.get("errors", []) + + if result.get("task_ids"): + print(f" Task IDs: {result['task_ids']}") + + stats["elapsed_seconds"] = round(time.time() - t0, 2) + return stats + + +def main(): + parser = argparse.ArgumentParser( + description="Ingest converted MDF datasets into DynamoDB and Globus Search." + ) + parser.add_argument( + "--env", + choices=["dev", "staging", "prod"], + default=None, + help="Auto-resolve config from a deployed CloudFormation stack " + "(reads Lambda env vars for table name, search index, Globus creds)", + ) + parser.add_argument( + "-i", "--input", + default=os.path.join(os.path.dirname(__file__), "..", "..", "..", "..", "converted_datasets.json"), + help="Path to converted_datasets.json", + ) + parser.add_argument( + "--dry-run", + action="store_true", + help="Validate all records without writing to any store", + ) + parser.add_argument( + "--skip-search", + action="store_true", + help="Skip Globus Search ingest (DynamoDB only)", + ) + parser.add_argument( + "--skip-store", + action="store_true", + help="Skip DynamoDB/SQLite store (search only)", + ) + parser.add_argument( + "--limit", + type=int, + default=0, + help="Only process first N records (0 = all)", + ) + args = parser.parse_args() + + # ── Resolve config from deployed stack ── + if args.env: + print(f"Resolving config from stack mdf-connect-v2-{args.env}...") + resolved = resolve_env_from_stack(args.env) + apply_env(resolved) + print(f" STORE_BACKEND: {resolved.get('STORE_BACKEND', '(not set)')}") + print(f" DYNAMO_SUBMISSIONS_TABLE: {resolved.get('DYNAMO_SUBMISSIONS_TABLE', '(not set)')}") + print(f" SEARCH_INDEX_UUID: {resolved.get('SEARCH_INDEX_UUID', '(not set)')}") + print(f" USE_MOCK_SEARCH: {resolved.get('USE_MOCK_SEARCH', '(not set)')}") + has_globus = bool(resolved.get("GLOBUS_CLIENT_ID")) + print(f" GLOBUS_CLIENT_ID: {'***' if has_globus else '(not set)'}") + print(f" GLOBUS_CLIENT_SECRET: {'***' if resolved.get('GLOBUS_CLIENT_SECRET') else '(not set)'}") + print() + + input_path = os.path.abspath(args.input) + print(f"Loading {input_path}") + + with open(input_path) as f: + data = json.load(f) + + records = data.get("records", []) + print(f"Loaded {len(records)} records") + + if args.limit > 0: + records = records[:args.limit] + print(f" Limited to first {args.limit}") + + mode_parts = [] + if args.dry_run: + mode_parts.append("DRY RUN") + else: + if not args.skip_store: + backend = os.environ.get("STORE_BACKEND", "dynamo") + mode_parts.append(f"store={backend}") + if not args.skip_search: + use_mock = os.environ.get("USE_MOCK_SEARCH", "true").lower() == "true" + mode_parts.append(f"search={'mock' if use_mock else 'globus'}") + print(f"Mode: {' + '.join(mode_parts) or 'no-op'}\n") + + stats = ingest_records( + records, + dry_run=args.dry_run, + skip_search=args.skip_search, + skip_store=args.skip_store, + ) + + # ── Report ── + print(f"\n{'='*50}") + print(f"Results ({stats['elapsed_seconds']}s)") + print(f"{'='*50}") + print(f" Total records: {stats['total']}") + print(f" Validated: {stats['validated']}") + + if stats["validation_errors"]: + print(f" Validation errors: {len(stats['validation_errors'])}") + for e in stats["validation_errors"][:5]: + print(f" {e['source_id']}: {e['error'][:100]}") + if len(stats["validation_errors"]) > 5: + print(f" ... and {len(stats['validation_errors']) - 5} more") + + if not args.dry_run: + if not args.skip_store: + print(f" Store inserted: {stats['store_inserted']}") + print(f" Store skipped: {stats['store_skipped']} (already exist)") + if stats["store_errors"]: + print(f" Store errors: {len(stats['store_errors'])}") + for e in stats["store_errors"][:3]: + print(f" {e['source_id']}: {e['error'][:100]}") + + if not args.skip_search: + print(f" Search ingested: {stats['search_ingested']}") + if stats["search_errors"]: + print(f" Search errors: {len(stats['search_errors'])}") + for e in stats["search_errors"][:3]: + print(f" {e['source_id']}: {e['error'][:100]}") + + # Exit with error code if any failures + has_errors = ( + stats["validation_errors"] + or stats.get("store_errors") + or stats.get("search_errors") + ) + if has_errors: + sys.exit(1) + + +if __name__ == "__main__": + main() diff --git a/aws/v2/scripts/setup_datacite_ssm.sh b/aws/v2/scripts/setup_datacite_ssm.sh new file mode 100755 index 0000000..e0f60de --- /dev/null +++ b/aws/v2/scripts/setup_datacite_ssm.sh @@ -0,0 +1,68 @@ +#!/bin/bash +# Create DataCite SSM parameters for production. +# +# The deploy script (deploy.sh) already reads these automatically during +# staging/prod deployments and passes them as CloudFormation parameter overrides. +# +# For staging: not needed — test credentials are in samconfig.toml. +# For prod: run this script once with real DataCite production credentials. +# +# Usage: +# ./setup_datacite_ssm.sh +# +# You will be prompted for each value. The password is stored as SecureString. + +set -e + +REGION="us-east-1" + +echo "=== MDF DataCite SSM Parameter Setup ===" +echo "" +echo "These parameters are read by deploy.sh during staging/prod deployments." +echo "Region: $REGION" +echo "" + +read -p "DataCite repository ID (e.g., MDF.MDF): " DC_USER +read -s -p "DataCite password: " DC_PASS +echo "" +read -p "DataCite API URL [https://api.datacite.org]: " DC_URL +DC_URL=${DC_URL:-https://api.datacite.org} +read -p "DataCite DOI prefix (e.g., 10.18126): " DC_PREFIX + +echo "" +echo "Creating SSM parameters..." + +aws ssm put-parameter \ + --name "/mdf/datacite-username" \ + --type "String" \ + --value "$DC_USER" \ + --region "$REGION" \ + --overwrite + +aws ssm put-parameter \ + --name "/mdf/datacite-password" \ + --type "SecureString" \ + --value "$DC_PASS" \ + --region "$REGION" \ + --overwrite + +aws ssm put-parameter \ + --name "/mdf/datacite-api-url" \ + --type "String" \ + --value "$DC_URL" \ + --region "$REGION" \ + --overwrite + +aws ssm put-parameter \ + --name "/mdf/datacite-prefix" \ + --type "String" \ + --value "$DC_PREFIX" \ + --region "$REGION" \ + --overwrite + +echo "" +echo "Done. Verify with:" +echo " aws ssm get-parameter --name /mdf/datacite-username --region $REGION --query Parameter.Value --output text" +echo " aws ssm get-parameter --name /mdf/datacite-prefix --region $REGION --query Parameter.Value --output text" +echo "" +echo "deploy.sh will automatically pick these up on next staging/prod deployment." diff --git a/aws/v2/scripts/test_search_token.py b/aws/v2/scripts/test_search_token.py new file mode 100644 index 0000000..c6dbf18 --- /dev/null +++ b/aws/v2/scripts/test_search_token.py @@ -0,0 +1,97 @@ +#!/usr/bin/env python3 +"""Test whether the confidential app can obtain a Globus Search token. + +This verifies that: +1. GLOBUS_CLIENT_ID and GLOBUS_CLIENT_SECRET are set (reads from SSM if not) +2. The app can obtain an access token for search.api.globus.org +3. The token can be used to query the search index + +Usage: + python cs/aws/v2/scripts/test_search_token.py +""" + +import json +import os +import subprocess +import sys + + +def get_from_ssm(name: str, region: str = "us-east-1") -> str: + result = subprocess.run( + ["aws", "ssm", "get-parameter", "--name", name, + "--with-decryption", "--region", region, + "--query", "Parameter.Value", "--output", "text"], + capture_output=True, text=True, + ) + return result.stdout.strip() if result.returncode == 0 else "" + + +def main(): + client_id = os.environ.get("GLOBUS_CLIENT_ID") or get_from_ssm("/mdf/globus-client-id") + client_secret = os.environ.get("GLOBUS_CLIENT_SECRET") or get_from_ssm("/mdf/globus-client-secret") + + if not client_id or not client_secret: + print("ERROR: Could not resolve GLOBUS_CLIENT_ID / GLOBUS_CLIENT_SECRET") + sys.exit(1) + + print(f"Client ID: {client_id}") + print(f"Secret: {'*' * len(client_secret)}") + + import globus_sdk + + cc = globus_sdk.ConfidentialAppAuthClient(client_id, client_secret) + + print("\nRequesting token for search.api.globus.org...") + try: + token_response = cc.oauth2_client_credentials_tokens( + requested_scopes="urn:globus:auth:scope:search.api.globus.org:all" + ) + except Exception as exc: + print(f"ERROR: Token request failed: {exc}") + print("\nThis likely means the Globus app registration does not have the") + print("'urn:globus:auth:scope:search.api.globus.org:all' scope configured.") + print("Go to https://app.globus.org/settings/developers and add it.") + sys.exit(1) + + print(f"by_resource_server keys: {list(token_response.by_resource_server.keys())}") + + search_token_data = token_response.by_resource_server.get("search.api.globus.org") + if not search_token_data: + print("ERROR: No search.api.globus.org entry in token response") + print("The app likely doesn't have the search scope configured.") + sys.exit(1) + + access_token = ( + search_token_data.get("access_token") + if isinstance(search_token_data, dict) + else getattr(search_token_data, "access_token", None) + ) + if not access_token: + print("ERROR: access_token is None") + sys.exit(1) + + print(f"Access token: {access_token[:20]}...") + print("Token obtained successfully!\n") + + # Try a simple query + index_id = "ab19b80b-0887-4337-b9f8-b8cc7feb1fdc" + print(f"Testing search query on index {index_id}...") + authorizer = globus_sdk.AccessTokenAuthorizer(access_token) + sc = globus_sdk.SearchClient(authorizer=authorizer) + + try: + result = sc.search(index_id, "test", limit=5) + data = result.data if hasattr(result, "data") else result + print(f"Search returned {data.get('total', 0)} total results") + for gmeta in data.get("gmeta", []): + for c in gmeta.get("content", []): + print(f" - {c.get('dc', {}).get('title', '?')}") + print("\nSearch query works!") + except Exception as exc: + print(f"Search query failed: {exc}") + print("The token works but the search query failed — check index permissions.") + sys.exit(1) + + +if __name__ == "__main__": + main() diff --git a/aws/v2/search.py b/aws/v2/search.py new file mode 100644 index 0000000..0ee51a9 --- /dev/null +++ b/aws/v2/search.py @@ -0,0 +1,274 @@ +"""Search for MDF v2. + +Provides full-text search across datasets and streams. +Tries Globus Search first (when configured), falls back to local DynamoDB scan. +Pure helper functions imported by the search router. +""" + +import json +import logging +import os +import re +from typing import Any, Dict, List, Optional + +from v2.metadata import parse_metadata +from v2.store import get_store +from v2.stream_store import get_stream_store + +logger = logging.getLogger(__name__) + + +def _env_int(name: str, default: int, minimum: int = 1) -> int: + value = os.environ.get(name) + if value is None: + return default + try: + parsed = int(value) + except (TypeError, ValueError): + return default + return max(minimum, parsed) + + +SEARCH_MAX_DATASET_SCAN = _env_int("SEARCH_MAX_DATASET_SCAN", 1000) +SEARCH_MAX_STREAM_SCAN = _env_int("SEARCH_MAX_STREAM_SCAN", 2000) + + +def _is_searchable_dataset(record: Dict[str, Any]) -> bool: + """Only published datasets are eligible for public search fallback.""" + return record.get("status") == "published" + + +def _extract_searchable_text(record: Dict[str, Any]) -> str: + """Extract all searchable text from a submission record.""" + parts = [] + + # Basic fields + parts.append(record.get("source_id", "")) + parts.append(record.get("organization", "")) + + # Parse metadata using the canonical parser + meta = parse_metadata(record) + + parts.append(meta.title) + for author in meta.authors: + parts.append(author.name) + if author.given_name: + parts.append(author.given_name) + if author.family_name: + parts.append(author.family_name) + parts.append(meta.publisher) + if meta.description: + parts.append(meta.description) + parts.extend(meta.keywords) + parts.extend(meta.methods) + if meta.facility: + parts.append(meta.facility) + parts.extend(meta.fields_of_science) + parts.extend(meta.tags) + parts.extend(meta.domains) + if meta.external_source: + parts.append(meta.external_source) + + # ML metadata + if meta.ml: + parts.append(meta.ml.data_format) + parts.extend(meta.ml.task_type) + parts.extend(meta.ml.domain) + if meta.ml.short_name: + parts.append(meta.ml.short_name) + for key in meta.ml.keys: + parts.append(key.name) + if key.description: + parts.append(key.description) + + # Extensions (flatten for full-text) + if meta.extensions: + parts.append(json.dumps(meta.extensions)) + + return " ".join(str(p) for p in parts if p) + + +def _extract_stream_text(stream: Dict[str, Any]) -> str: + """Extract searchable text from a stream record.""" + parts = [] + parts.append(stream.get("stream_id", "")) + parts.append(stream.get("title", "")) + parts.append(stream.get("lab_id", "")) + parts.append(stream.get("organization", "")) + + metadata = stream.get("metadata") or {} + if isinstance(metadata, str): + try: + metadata = json.loads(metadata) + except Exception: + metadata = {} + if metadata is None: + metadata = {} + + parts.append(metadata.get("run_id", "")) + parts.append(metadata.get("facility", "")) + parts.append(metadata.get("operator", "")) + if metadata.get("instruments"): + instr = metadata["instruments"] + parts.extend(instr if isinstance(instr, list) else [str(instr)]) + + return " ".join(str(p) for p in parts if p) + + +def _simple_match(text: str, query: str) -> float: + """Simple relevance scoring - count query term matches.""" + text_lower = text.lower() + query_terms = re.split(r'\s+', query.lower().strip()) + + score = 0.0 + for term in query_terms: + if term in text_lower: + score += text_lower.count(term) + + return score + + +def _format_dataset_result(record: Dict[str, Any], score: float) -> Dict[str, Any]: + """Format a dataset record for search results.""" + meta = parse_metadata(record) + + description = meta.description or "" + return { + "type": "dataset", + "source_id": record.get("source_id"), + "version": record.get("version"), + "title": meta.title, + "authors": [a.name for a in meta.authors], + "keywords": meta.keywords, + "description": description[:300] if len(description) > 300 else description, + "publication_year": meta.publication_year, + "organization": record.get("organization"), + "domains": meta.domains, + "doi": record.get("doi"), + "license": meta.license.identifier or meta.license.name if meta.license else None, + "size_bytes": record.get("total_bytes"), + "file_count": record.get("file_count"), + "status": record.get("status"), + "score": score, + } + + +def _format_stream_result(stream: Dict[str, Any], score: float) -> Dict[str, Any]: + """Format a stream record for search results.""" + return { + "type": "stream", + "stream_id": stream.get("stream_id"), + "title": stream.get("title"), + "lab_id": stream.get("lab_id"), + "status": stream.get("status"), + "file_count": stream.get("file_count", 0), + "created_at": stream.get("created_at"), + "score": score, + } + + +def search_datasets( + query: str, + limit: int = 20, + offset: int = 0, + filters: Optional[Dict[str, list]] = None, +) -> Dict[str, Any]: + """Search across all datasets, returning results and facets. + + Tries Globus Search faceted_search first. Falls back to local DynamoDB + scan if Globus Search is not configured or the query fails. The fallback + only returns published datasets and does not support facets. + """ + # Try Globus Search first (faceted) + try: + from v2.search_client import get_search_client + client = get_search_client() + result = client.faceted_search(query, limit=limit, offset=offset, filters=filters) + if result.get("success"): + # For mock clients (no data ingested), fall through to DynamoDB + # so dev/test can search SQLite records. For real Globus Search, + # trust the result even when empty (e.g. offset past all results). + if result.get("results") or not result.get("mock"): + return { + "results": result.get("results", []), + "total": result.get("total", 0), + "facets": result.get("facets", {}), + } + else: + logger.warning("Globus Search faceted_search failed: %s", result.get("error")) + except Exception: + logger.warning("Globus Search unavailable, falling back to local scan", exc_info=True) + + # Fallback: local DynamoDB scan (no faceting) + store = get_store() + all_submissions = store.list_all(limit=max(limit + offset, SEARCH_MAX_DATASET_SCAN)) + + results = [] + for record in all_submissions: + if not _is_searchable_dataset(record): + continue + text = _extract_searchable_text(record) + score = _simple_match(text, query) + if score > 0: + results.append((score, record)) + + results.sort(key=lambda x: x[0], reverse=True) + total = len(results) + page = results[offset:offset + limit] + + return { + "results": [_format_dataset_result(r, s) for s, r in page], + "total": total, + "facets": {}, + } + + +def search_streams(query: str, limit: int = 20) -> List[Dict[str, Any]]: + """Search across all streams.""" + stream_store = get_stream_store() + all_streams = stream_store.list_all(limit=max(limit, SEARCH_MAX_STREAM_SCAN)) + + results = [] + for stream in all_streams: + text = _extract_stream_text(stream) + score = _simple_match(text, query) + if score > 0: + results.append((score, stream)) + + results.sort(key=lambda x: x[0], reverse=True) + + return [_format_stream_result(r, s) for s, r in results[:limit]] + + +def search_all( + query: str, + include_datasets: bool = True, + include_streams: bool = True, + limit: int = 20, + offset: int = 0, + filters: Optional[Dict[str, list]] = None, +) -> Dict[str, Any]: + """Search across datasets and streams, with faceted results.""" + results = [] + facets: Dict[str, Any] = {} + total = 0 + + if include_datasets: + ds = search_datasets(query, limit=limit, offset=offset, filters=filters) + results.extend(ds["results"]) + facets = ds.get("facets", {}) + total += ds.get("total", 0) + + if include_streams: + streams = search_streams(query, limit=limit) + results.extend(streams) + + results.sort(key=lambda x: x.get("score", 0), reverse=True) + + return { + "query": query, + "total": total, + "offset": offset, + "results": results[:limit], + "facets": facets, + } diff --git a/aws/v2/search_client.py b/aws/v2/search_client.py new file mode 100644 index 0000000..bd1a688 --- /dev/null +++ b/aws/v2/search_client.py @@ -0,0 +1,569 @@ +"""Globus Search client for MDF v2. + +Provides search ingest and query capabilities via Globus Search indexes. +Falls back to MockGlobusSearchClient when credentials or indexes are not configured. +""" + +import logging +import os +from collections import Counter +from datetime import datetime, timezone +from typing import Any, Dict, List, Optional + +logger = logging.getLogger(__name__) + +MDF_DETAIL_BASE = "https://materialsdatafacility.org/detail" + +DEFAULT_FACETS = [ + {"name": "Year", "field_name": "dc.year", "type": "terms", "size": 20}, + {"name": "Organization", "field_name": "mdf.organization", "type": "terms", "size": 20}, + {"name": "Authors", "field_name": "dc.creators.name", "type": "terms", "size": 20}, + {"name": "Keywords", "field_name": "dc.subjects", "type": "terms", "size": 20}, + {"name": "Domains", "field_name": "mdf.domains", "type": "terms", "size": 20}, +] + + +class GlobusSearchClient: + """Wraps globus_sdk.SearchClient for MDF v2 search operations.""" + + def __init__(self, index_id: str, test_mode: bool = False): + self.index_id = index_id + self.test_mode = test_mode + self._client = None + + def _get_client(self): + if self._client is not None: + return self._client + + import globus_sdk + + client_id = os.environ.get("GLOBUS_CLIENT_ID") + client_secret = os.environ.get("GLOBUS_CLIENT_SECRET") + + if not client_id or not client_secret: + raise RuntimeError("GLOBUS_CLIENT_ID and GLOBUS_CLIENT_SECRET required for Globus Search") + + confidential_client = globus_sdk.ConfidentialAppAuthClient(client_id, client_secret) + token_response = confidential_client.oauth2_client_credentials_tokens( + requested_scopes="urn:globus:auth:scope:search.api.globus.org:all" + ) + search_token = token_response.by_resource_server.get("search.api.globus.org", {}) + access_token = search_token.get("access_token") if isinstance(search_token, dict) else getattr(search_token, "access_token", None) + if not access_token: + raise RuntimeError( + "Failed to obtain Globus Search access token. " + "Ensure the app has the 'urn:globus:auth:scope:search.api.globus.org:all' scope configured." + ) + + authorizer = globus_sdk.AccessTokenAuthorizer(access_token) + self._client = globus_sdk.SearchClient(authorizer=authorizer) + return self._client + + def build_gmeta_entry( + self, submission: Dict[str, Any], version_count: Optional[int] = None, + ) -> Dict[str, Any]: + """Build a GMetaEntry from a submission record.""" + from v2.metadata import parse_metadata + + source_id = submission.get("source_id", "unknown") + version = submission.get("version", "1.0") + meta = parse_metadata(submission) + + subject = f"{MDF_DETAIL_BASE}/{source_id}" + + acl = meta.acl or ["public"] + visible_to = ["public"] if "public" in acl else [f"urn:globus:auth:identity:{a}" for a in acl] + + # Extract data location from first data_source + data_sources = meta.data_sources or [] + location = data_sources[0] if data_sources else None + + mdf_block: Dict[str, Any] = { + "source_id": source_id, + "source_name": source_id.rsplit("-", 1)[0] if "-" in source_id else source_id, + "version": version, + "organization": submission.get("organization", ""), + "acl": acl, + "ingest_date": submission.get("created_at", datetime.now(timezone.utc).isoformat()), + } + + mdf_block["domains"] = meta.domains + + if meta.external: + mdf_block["external_source"] = meta.external.source + if meta.external.doi: + mdf_block["external_doi"] = meta.external.doi + if meta.external.url: + mdf_block["external_url"] = meta.external.url + + dataset_doi = submission.get("dataset_doi") + if dataset_doi: + mdf_block["dataset_doi"] = dataset_doi + if version_count is not None: + mdf_block["version_count"] = version_count + + # Versioning fields + mdf_block["latest"] = meta.latest + if meta.root_version: + mdf_block["root_version"] = meta.root_version + if meta.previous_version: + mdf_block["previous_version"] = meta.previous_version + if meta.version: + mdf_block["version"] = meta.version + + # Download URL + if meta.download_url: + mdf_block["download_url"] = meta.download_url + + content = { + "mdf": mdf_block, + "dc": { + "title": meta.title, + "creators": [{"name": a.name} for a in meta.authors], + "publisher": meta.publisher, + "year": meta.publication_year or datetime.now().year, + "description": meta.description or "", + "subjects": meta.keywords, + "license": meta.license.identifier or meta.license.name if meta.license else "", + }, + "data": { + "location": location, + "size_bytes": submission.get("total_bytes"), + "file_count": submission.get("file_count"), + }, + } + + # dc.doi = version-specific DOI if present, otherwise dataset DOI + doi = submission.get("doi") or submission.get("dataset_doi") + if doi: + content["dc"]["doi"] = doi + + return { + "subject": subject, + "visible_to": visible_to, + "content": content, + } + + def ingest(self, submission: Dict[str, Any], version_count: Optional[int] = None) -> Dict[str, Any]: + """Ingest a single submission into the Globus Search index.""" + client = self._get_client() + entry = self.build_gmeta_entry(submission, version_count=version_count) + + ingest_doc = { + "ingest_type": "GMetaEntry", + "ingest_data": entry, + } + + try: + result = client.ingest(self.index_id, ingest_doc) + return { + "success": True, + "task_id": getattr(result, "data", {}).get("task_id") if hasattr(result, "data") else str(result), + } + except Exception as exc: + logger.exception("Globus Search ingest failed for %s", submission.get("source_id")) + return {"success": False, "error": str(exc)} + + def batch_ingest( + self, submissions: List[Dict[str, Any]], batch_size: int = 100, + ) -> Dict[str, Any]: + """Ingest multiple submissions using GMetaList batches. + + Batches submissions into groups of batch_size and submits each as a + single GMetaList request (one task_id per batch). Much faster than + individual ingest() calls for bulk loading — 10 req/s rate limit and + 10MB per request apply; batch_size=100 stays well within the 10MB cap. + + Returns a summary dict with counts and any per-batch errors. + """ + client = self._get_client() + total = len(submissions) + ingested = 0 + errors = [] + task_ids = [] + + for batch_start in range(0, total, batch_size): + batch = submissions[batch_start:batch_start + batch_size] + gmeta = [] + for sub in batch: + try: + gmeta.append(self.build_gmeta_entry(sub)) + except Exception as exc: + errors.append({"source_id": sub.get("source_id"), "error": str(exc)}) + + if not gmeta: + continue + + ingest_doc = { + "ingest_type": "GMetaList", + "ingest_data": {"gmeta": gmeta}, + } + + try: + result = client.ingest(self.index_id, ingest_doc) + data = result.data if hasattr(result, "data") else {} + task_id = data.get("task_id") + if task_id: + task_ids.append(task_id) + ingested += len(gmeta) + except Exception as exc: + logger.exception( + "Globus Search batch ingest failed (batch %d-%d)", + batch_start, batch_start + len(batch) - 1, + ) + for sub in batch: + errors.append({"source_id": sub.get("source_id"), "error": str(exc)}) + + return { + "success": len(errors) == 0, + "total": total, + "ingested": ingested, + "errors": errors, + "task_ids": task_ids, + } + + def delete_entry(self, source_id: str) -> Dict[str, Any]: + """Delete a subject entry from the index.""" + client = self._get_client() + subject = f"{MDF_DETAIL_BASE}/{source_id}" + + try: + client.delete_entry(self.index_id, subject) + return {"success": True, "source_id": source_id} + except Exception as exc: + logger.exception("Globus Search delete failed for %s", source_id) + return {"success": False, "error": str(exc)} + + def _get_read_client(self): + """Return an unauthenticated SearchClient for public index reads. + + The MDF Search index is public, so queries do not require credentials. + Only ingest/delete operations use the authenticated client from _get_client(). + """ + import globus_sdk + return globus_sdk.SearchClient() + + def search(self, query: str, limit: int = 20, offset: int = 0) -> Dict[str, Any]: + """Search the Globus Search index.""" + client = self._get_read_client() + + try: + result = client.search(self.index_id, query, limit=limit, offset=offset) + data = result.data if hasattr(result, "data") else result + return { + "success": True, + "total": data.get("total", 0), + "results": _format_globus_search_results(data), + } + except Exception as exc: + logger.exception("Globus Search query failed") + return {"success": False, "error": str(exc), "total": 0, "results": []} + + def faceted_search( + self, query: str, limit: int = 20, offset: int = 0, filters: Optional[Dict[str, List]] = None, + ) -> Dict[str, Any]: + """Search with facets and optional filters. + + filters: dict mapping facet field_name → list of selected values + e.g. {"mdf.organization": ["MDF Open"], "dc.year": [2024, 2025]} + """ + from globus_sdk import SearchQuery + + client = self._get_read_client() + sq = SearchQuery(query) + + for facet in DEFAULT_FACETS: + sq.add_facet(**facet) + + if filters: + for field_name, values in filters.items(): + sq.add_filter(field_name, values, type="match_any") + + sq["limit"] = limit + sq["offset"] = offset + + try: + result = client.post_search(self.index_id, sq) + data = result.data if hasattr(result, "data") else result + return { + "success": True, + "total": data.get("total", 0), + "results": _format_globus_search_results(data), + "facets": _format_facet_results(data.get("facet_results", [])), + } + except Exception as exc: + logger.exception("Globus Search faceted query failed") + return {"success": False, "error": str(exc), "total": 0, "results": [], "facets": {}} + + +class MockGlobusSearchClient: + """In-memory mock for Globus Search. Used when USE_MOCK_SEARCH=true.""" + + def __init__(self, index_id: str = "mock-index", test_mode: bool = False): + self.index_id = index_id + self.test_mode = test_mode + self._entries: Dict[str, Dict[str, Any]] = {} + + def build_gmeta_entry( + self, submission: Dict[str, Any], version_count: Optional[int] = None, + ) -> Dict[str, Any]: + # Re-use the real implementation's logic + real = GlobusSearchClient.__new__(GlobusSearchClient) + return real.build_gmeta_entry(submission, version_count=version_count) + + def ingest(self, submission: Dict[str, Any], version_count: Optional[int] = None) -> Dict[str, Any]: + entry = self.build_gmeta_entry(submission, version_count=version_count) + self._entries[entry["subject"]] = entry + return {"success": True, "mock": True, "subject": entry["subject"]} + + def batch_ingest( + self, submissions: List[Dict[str, Any]], batch_size: int = 100, + ) -> Dict[str, Any]: + errors = [] + for sub in submissions: + result = self.ingest(sub) + if not result.get("success"): + errors.append({"source_id": sub.get("source_id"), "error": result.get("error")}) + return { + "success": len(errors) == 0, + "total": len(submissions), + "ingested": len(submissions) - len(errors), + "errors": errors, + "task_ids": [], + "mock": True, + } + + def delete_entry(self, source_id: str) -> Dict[str, Any]: + subject = f"{MDF_DETAIL_BASE}/{source_id}" + self._entries.pop(subject, None) + return {"success": True, "mock": True, "source_id": source_id} + + def search(self, query: str, limit: int = 20, offset: int = 0) -> Dict[str, Any]: + # Simple text match over stored entries + query_lower = query.lower() + matches = [] + for subject, entry in self._entries.items(): + content = entry.get("content", {}) + text = " ".join([ + content.get("dc", {}).get("title", ""), + content.get("dc", {}).get("description", ""), + " ".join(content.get("dc", {}).get("subjects", [])), + content.get("mdf", {}).get("source_id", ""), + ]).lower() + if query_lower in text: + matches.append(entry) + + paginated = matches[offset:offset + limit] + results = [] + for entry in paginated: + content = entry.get("content", {}) + dc = content.get("dc", {}) + mdf = content.get("mdf", {}) + data_block = content.get("data", {}) + description = dc.get("description", "") or "" + results.append({ + "type": "dataset", + "source_id": mdf.get("source_id"), + "title": dc.get("title"), + "authors": [c.get("name", "") for c in dc.get("creators", [])], + "keywords": dc.get("subjects", []), + "description": description[:300] if len(description) > 300 else description, + "publication_year": dc.get("year"), + "organization": mdf.get("organization"), + "domains": mdf.get("domains") or [], + "doi": dc.get("doi") or mdf.get("dataset_doi"), + "license": dc.get("license") or None, + "size_bytes": data_block.get("size_bytes"), + "file_count": data_block.get("file_count"), + "score": 1.0, + }) + + return {"success": True, "total": len(matches), "results": results, "mock": True} + + def faceted_search( + self, query: str, limit: int = 20, offset: int = 0, filters: Optional[Dict[str, List]] = None, + ) -> Dict[str, Any]: + """Faceted search over in-memory entries with filter support.""" + query_lower = query.lower() + matches = [] + for subject, entry in self._entries.items(): + content = entry.get("content", {}) + text = " ".join([ + content.get("dc", {}).get("title", ""), + content.get("dc", {}).get("description", ""), + " ".join(content.get("dc", {}).get("subjects", [])), + content.get("mdf", {}).get("source_id", ""), + ]).lower() + if query_lower == "*" or query_lower in text: + if filters and not self._matches_filters(content, filters): + continue + matches.append(entry) + + paginated = matches[offset:offset + limit] + results = [] + for entry in paginated: + content = entry.get("content", {}) + results.append({ + "type": "dataset", + "source_id": content.get("mdf", {}).get("source_id"), + "title": content.get("dc", {}).get("title"), + "authors": [c.get("name", "") for c in content.get("dc", {}).get("creators", [])], + "score": 1.0, + }) + + return { + "success": True, + "total": len(matches), + "results": results, + "facets": self._compute_facets(matches), + "mock": True, + } + + def _matches_filters(self, content: Dict[str, Any], filters: Dict[str, List]) -> bool: + """Check if a content entry matches all active filters.""" + field_map = { + "dc.year": lambda c: [c.get("dc", {}).get("year")], + "mdf.organization": lambda c: [c.get("mdf", {}).get("organization")], + "dc.creators.name": lambda c: [cr.get("name", "") for cr in c.get("dc", {}).get("creators", [])], + "dc.subjects": lambda c: c.get("dc", {}).get("subjects", []), + "mdf.domains": lambda c: c.get("mdf", {}).get("domains", []), + } + for field_name, values in filters.items(): + extractor = field_map.get(field_name) + if not extractor: + continue + entry_values = [str(v) for v in extractor(content) if v is not None] + filter_values = [str(v) for v in values] + if not any(ev in filter_values for ev in entry_values): + return False + return True + + def _compute_facets(self, entries: List[Dict[str, Any]]) -> Dict[str, List[Dict[str, Any]]]: + """Compute facet counts from a list of matched entries.""" + counters: Dict[str, Counter] = { + "Year": Counter(), + "Organization": Counter(), + "Authors": Counter(), + "Keywords": Counter(), + "Domains": Counter(), + } + for entry in entries: + content = entry.get("content", {}) + dc = content.get("dc", {}) + mdf = content.get("mdf", {}) + + year = dc.get("year") + if year is not None: + counters["Year"][str(year)] += 1 + org = mdf.get("organization") + if org: + counters["Organization"][org] += 1 + for creator in dc.get("creators", []): + name = creator.get("name") + if name: + counters["Authors"][name] += 1 + for kw in dc.get("subjects", []): + if kw: + counters["Keywords"][kw] += 1 + for domain in mdf.get("domains", []): + if domain: + counters["Domains"][domain] += 1 + + facets = {} + for name, counter in counters.items(): + buckets = [{"value": val, "count": count} for val, count in counter.most_common(20)] + facets[name] = buckets + return facets + + +def _format_globus_search_results(data: Dict[str, Any]) -> List[Dict[str, Any]]: + """Normalize Globus Search response into the MDF result format. + + Handles both response shapes: + - POST search (post_search): gmeta[i].entries[j].content (dict) + - GET search (search): gmeta[i].content[j] (dict in a list) + """ + results = [] + for gmeta in data.get("gmeta", []): + # POST search wraps entries; GET search uses content list directly + if gmeta.get("entries") is not None: + contents = [e.get("content", {}) for e in gmeta["entries"]] + else: + raw = gmeta.get("content", []) + contents = raw if isinstance(raw, list) else [raw] + + for content in contents: + if not isinstance(content, dict): + continue + mdf = content.get("mdf", {}) + dc = content.get("dc", {}) + data_block = content.get("data", {}) + description = dc.get("description", "") or "" + result_entry = { + "type": "dataset", + "source_id": mdf.get("source_id"), + "version": mdf.get("version"), + "title": dc.get("title"), + "authors": [c.get("name", "") for c in dc.get("creators", [])], + "keywords": dc.get("subjects", []), + "description": description[:300] if len(description) > 300 else description, + "publication_year": dc.get("year"), + "organization": mdf.get("organization"), + "domains": mdf.get("domains") or [], + "doi": dc.get("doi") or mdf.get("dataset_doi"), + "license": dc.get("license") or None, + "size_bytes": data_block.get("size_bytes"), + "file_count": data_block.get("file_count"), + "status": "published", + "score": gmeta.get("score", 0), + "latest": mdf.get("latest", True), + } + if mdf.get("root_version"): + result_entry["root_version"] = mdf["root_version"] + if mdf.get("download_url"): + result_entry["download_url"] = mdf["download_url"] + results.append(result_entry) + return results + + +def _format_facet_results(facet_results: List[Dict[str, Any]]) -> Dict[str, List[Dict[str, Any]]]: + """Normalize Globus facet_results into frontend-friendly format.""" + facets = {} + for fr in facet_results: + name = fr.get("name", "") + buckets = [ + {"value": b.get("value"), "count": b.get("count", 0)} + for b in fr.get("buckets", []) + if b.get("count", 0) > 0 + ] + facets[name] = buckets + return facets + + +# Singleton for mock client to persist in-memory state within a Lambda invocation +_mock_client: Optional[MockGlobusSearchClient] = None + + +def get_search_client(test_mode: bool = False) -> Any: + """Factory: returns GlobusSearchClient or MockGlobusSearchClient.""" + global _mock_client + + use_mock = os.environ.get("USE_MOCK_SEARCH", "true").lower() == "true" + + if use_mock: + if _mock_client is None: + _mock_client = MockGlobusSearchClient(test_mode=test_mode) + return _mock_client + + if test_mode: + index_id = os.environ.get("TEST_SEARCH_INDEX_UUID", "not-configured") + else: + index_id = os.environ.get("SEARCH_INDEX_UUID", "not-configured") + + if index_id == "not-configured": + logger.warning("Search index UUID not configured, falling back to mock") + if _mock_client is None: + _mock_client = MockGlobusSearchClient(test_mode=test_mode) + return _mock_client + + return GlobusSearchClient(index_id=index_id, test_mode=test_mode) diff --git a/aws/v2/storage/__init__.py b/aws/v2/storage/__init__.py new file mode 100644 index 0000000..8252690 --- /dev/null +++ b/aws/v2/storage/__init__.py @@ -0,0 +1,28 @@ +"""Storage backends for MDF v2. + +Supports multiple storage backends: +- Globus HTTPS endpoints (primary, 1PB free storage) +- S3 (secondary) +- Local filesystem (development only) + +Configuration via environment variables: + STORAGE_BACKEND=globus|s3|local + + # Globus settings + GLOBUS_ENDPOINT_ID= + GLOBUS_BASE_PATH=/mdf/streams + GLOBUS_CLIENT_ID= + GLOBUS_CLIENT_SECRET= + + # S3 settings + S3_BUCKET=mdf-stream-files + S3_PREFIX=streams/ + + # Local settings + FILE_STORE_PATH=/tmp/mdf_files +""" + +from v2.storage.base import StorageBackend, FileMetadata +from v2.storage.factory import get_storage_backend, reset_storage_backend + +__all__ = ["StorageBackend", "FileMetadata", "get_storage_backend", "reset_storage_backend"] diff --git a/aws/v2/storage/base.py b/aws/v2/storage/base.py new file mode 100644 index 0000000..1e08267 --- /dev/null +++ b/aws/v2/storage/base.py @@ -0,0 +1,234 @@ +"""Base storage interface for MDF v2.""" + +from abc import ABC, abstractmethod +from dataclasses import dataclass, field +from datetime import datetime, timezone +import re +from typing import Any, BinaryIO, Dict, List, Optional + + +@dataclass +class FileMetadata: + """Metadata for a stored file.""" + + filename: str + path: str # Full path in storage (e.g., streams/{stream_id}/2026-01-31/file.csv) + size_bytes: int + checksum_md5: str + content_type: str = "application/octet-stream" + stored_at: str = "" + storage_backend: str = "" # globus, s3, local + download_url: Optional[str] = None # Direct download URL if available + custom_metadata: Dict[str, Any] = field(default_factory=dict) + + def __post_init__(self): + if not self.stored_at: + self.stored_at = datetime.now(timezone.utc).isoformat().replace("+00:00", "Z") + + def to_dict(self) -> Dict[str, Any]: + return { + "filename": self.filename, + "path": self.path, + "size_bytes": self.size_bytes, + "checksum_md5": self.checksum_md5, + "content_type": self.content_type, + "stored_at": self.stored_at, + "storage_backend": self.storage_backend, + "download_url": self.download_url, + "metadata": self.custom_metadata, + } + + +class StorageBackend(ABC): + """Abstract base class for storage backends.""" + + @property + @abstractmethod + def backend_name(self) -> str: + """Return the backend name (globus, s3, local).""" + pass + + @abstractmethod + def store_file( + self, + stream_id: str, + filename: str, + content: bytes, + content_type: str = "application/octet-stream", + metadata: Optional[Dict[str, Any]] = None, + ) -> FileMetadata: + """Store a file and return its metadata. + + Args: + stream_id: The stream ID (used for path organization) + filename: Name of the file + content: File contents as bytes + content_type: MIME type + metadata: Optional custom metadata + + Returns: + FileMetadata with storage details + """ + pass + + @abstractmethod + def store_file_stream( + self, + stream_id: str, + filename: str, + file_obj: BinaryIO, + size_bytes: int, + content_type: str = "application/octet-stream", + metadata: Optional[Dict[str, Any]] = None, + ) -> FileMetadata: + """Store a file from a file-like object (for large files). + + Args: + stream_id: The stream ID + filename: Name of the file + file_obj: File-like object to read from + size_bytes: Total size in bytes + content_type: MIME type + metadata: Optional custom metadata + + Returns: + FileMetadata with storage details + """ + pass + + @abstractmethod + def get_file(self, path: str) -> Optional[bytes]: + """Retrieve file contents by path. + + Args: + path: Full path in storage + + Returns: + File contents as bytes, or None if not found + """ + pass + + @abstractmethod + def get_download_url(self, path: str, expires_in: int = 3600) -> Optional[str]: + """Get a direct download URL for a file. + + Args: + path: Full path in storage + expires_in: URL expiration time in seconds + + Returns: + Direct download URL, or None if not supported + """ + pass + + @abstractmethod + def get_upload_url( + self, + stream_id: str, + filename: str, + content_type: str = "application/octet-stream", + expires_in: int = 3600, + ) -> Optional[Dict[str, Any]]: + """Get a pre-signed upload URL for direct client upload. + + This allows clients to upload directly to storage without + going through the API (useful for large files). + + Args: + stream_id: The stream ID + filename: Name of the file + content_type: MIME type + expires_in: URL expiration time in seconds + + Returns: + Dict with 'url', 'method', 'headers', and 'path' + or None if not supported + """ + pass + + @abstractmethod + def list_files(self, stream_id: str) -> List[FileMetadata]: + """List all files in a stream. + + Args: + stream_id: The stream ID + + Returns: + List of FileMetadata objects + """ + pass + + @abstractmethod + def delete_file(self, path: str) -> bool: + """Delete a file. + + Args: + path: Full path in storage + + Returns: + True if deleted, False if not found + """ + pass + + @abstractmethod + def delete_stream_files(self, stream_id: str) -> int: + """Delete all files for a stream. + + Args: + stream_id: The stream ID + + Returns: + Number of files deleted + """ + pass + + @abstractmethod + def get_stream_size(self, stream_id: str) -> int: + """Get total size of all files in a stream. + + Args: + stream_id: The stream ID + + Returns: + Total size in bytes + """ + pass + + def _build_path(self, stream_id: str, filename: str) -> str: + """Build a storage path for a file.""" + safe_stream_id = self._sanitize_stream_id(stream_id) + safe_filename = self._sanitize_filename(filename) + date_prefix = datetime.now(timezone.utc).strftime("%Y-%m-%d") + return f"streams/{safe_stream_id}/{date_prefix}/{safe_filename}" + + def _sanitize_stream_id(self, stream_id: str) -> str: + value = (stream_id or "").strip() + if not value: + raise ValueError("stream_id is required") + if not re.fullmatch(r"[A-Za-z0-9._:-]+", value): + raise ValueError(f"Invalid stream_id: {stream_id!r}") + return value + + def _sanitize_filename(self, filename: str) -> str: + normalized = (filename or "").replace("\\", "/").strip("/") + if not normalized: + raise ValueError("filename is required") + parts = normalized.split("/") + if any(part in ("", ".", "..") for part in parts): + raise ValueError(f"Invalid filename: {filename!r}") + allowed = re.compile(r"^[A-Za-z0-9._()+=,@ -]+$") + for part in parts: + if not allowed.fullmatch(part): + raise ValueError(f"Invalid filename segment: {part!r}") + return normalized + + def _compute_checksum(self, content: bytes) -> str: + """Compute MD5 checksum of content.""" + import hashlib + return hashlib.md5(content).hexdigest() + + def _guess_content_type(self, filename: str) -> str: + """Guess content type from filename.""" + import mimetypes + content_type, _ = mimetypes.guess_type(filename) + return content_type or "application/octet-stream" diff --git a/aws/v2/storage/factory.py b/aws/v2/storage/factory.py new file mode 100644 index 0000000..6fca700 --- /dev/null +++ b/aws/v2/storage/factory.py @@ -0,0 +1,62 @@ +"""Storage backend factory for MDF v2. + +Creates the appropriate storage backend based on configuration. + +Configuration: + STORAGE_BACKEND: Backend type (globus, s3, local) + Default: local (for development) +""" + +import os +from typing import Optional + +from v2.storage.base import StorageBackend + + +# Singleton instance +_storage_backend: Optional[StorageBackend] = None + + +def get_storage_backend(backend_type: Optional[str] = None) -> StorageBackend: + """Get the configured storage backend. + + Args: + backend_type: Override the backend type (globus, local) + + Returns: + StorageBackend instance + """ + global _storage_backend + + # Allow override, otherwise use environment + backend = backend_type or os.environ.get("STORAGE_BACKEND", "local") + + # Return cached instance if same type + if _storage_backend is not None and _storage_backend.backend_name == backend: + return _storage_backend + + if backend == "globus": + from v2.storage.globus_https import GlobusHTTPSStorage + _storage_backend = GlobusHTTPSStorage() + + elif backend == "s3": + from v2.storage.s3 import S3Storage + _storage_backend = S3Storage() + + elif backend == "local": + from v2.storage.local import LocalStorage + _storage_backend = LocalStorage() + + else: + raise ValueError( + f"Unknown storage backend: {backend}. " + f"Use 'globus', 's3', or 'local'" + ) + + return _storage_backend + + +def reset_storage_backend(): + """Reset the cached storage backend (for testing).""" + global _storage_backend + _storage_backend = None diff --git a/aws/v2/storage/globus_https.py b/aws/v2/storage/globus_https.py new file mode 100644 index 0000000..ba24cea --- /dev/null +++ b/aws/v2/storage/globus_https.py @@ -0,0 +1,418 @@ +"""Globus HTTPS storage backend for MDF v2. + +Uses Globus endpoints with HTTPS access for file storage. +This is the primary storage backend for MDF - we have 1PB of free storage. + +Globus HTTPS endpoints provide: +- Direct HTTPS GET/PUT/DELETE operations +- Bearer token authentication +- High-performance data transfer +- Integration with Globus Transfer for large datasets + +Configuration: + GLOBUS_ENDPOINT_ID: The Globus endpoint UUID (default: NCSA MDF endpoint) + GLOBUS_BASE_PATH: Base path on the endpoint (default: /mdf/streams) + GLOBUS_HTTPS_SERVER: Override HTTPS server (default: data.materialsdatafacility.org) + +Authentication (in order of priority): + 1. access_token parameter + 2. GLOBUS_ACCESS_TOKEN environment variable + 3. Cached tokens from ~/.mdf/v2_https_tokens.json + 4. Client credentials flow (GLOBUS_CLIENT_ID + GLOBUS_CLIENT_SECRET) +""" + +import hashlib +import json +import logging +import os +from datetime import datetime, timezone +from io import BytesIO +from typing import Any, BinaryIO, Dict, List, Optional + +import httpx + +logger = logging.getLogger(__name__) + +from v2.storage.base import FileMetadata, StorageBackend + +# NCSA MDF endpoint - default for MDF +NCSA_ENDPOINT_UUID = "82f1b5c6-6e9b-11e5-ba47-22000b92c6ec" +NCSA_HTTPS_SERVER = "data.materialsdatafacility.org" + +# Token cache location +TOKEN_CACHE_FILE = os.path.expanduser("~/.mdf/v2_https_tokens.json") + + +def load_cached_token() -> Optional[str]: + """Load cached access token from disk.""" + if os.path.exists(TOKEN_CACHE_FILE): + try: + with open(TOKEN_CACHE_FILE) as f: + data = json.load(f) + return data.get("access_token") + except Exception: + pass + return None + + +class GlobusHTTPSStorage(StorageBackend): + """Storage backend using Globus HTTPS endpoints.""" + + def __init__( + self, + endpoint_id: Optional[str] = None, + base_path: Optional[str] = None, + https_server: Optional[str] = None, + access_token: Optional[str] = None, + ): + """Initialize Globus HTTPS storage. + + Args: + endpoint_id: Globus endpoint UUID (default: NCSA MDF endpoint) + base_path: Base path on endpoint (default: /tmp/testing for dev) + https_server: Override HTTPS server hostname + access_token: Globus access token (or will use cached/env token) + """ + self.endpoint_id = endpoint_id or os.environ.get("GLOBUS_ENDPOINT_ID", NCSA_ENDPOINT_UUID) + + self.base_path = (base_path or os.environ.get("GLOBUS_BASE_PATH", "/tmp/testing")).rstrip("/") + + # Build HTTPS server URL - use MDF's custom domain by default + self.https_server = https_server or os.environ.get( + "GLOBUS_HTTPS_SERVER", + NCSA_HTTPS_SERVER + ) + self.base_url = f"https://{self.https_server}{self.base_path}" + + # Authentication + self._access_token = access_token + self._token_expires_at: Optional[datetime] = None + + # HTTP client with retry + self._client = httpx.Client( + timeout=60.0, + follow_redirects=True, + ) + + # Metadata storage (could be DynamoDB in production) + self._metadata_cache: Dict[str, FileMetadata] = {} + + @property + def backend_name(self) -> str: + return "globus" + + def _get_token(self) -> str: + """Get a valid Globus access token.""" + # 1. Explicit token passed to constructor + if self._access_token: + return self._access_token + + # 2. Environment variable + token = os.environ.get("GLOBUS_ACCESS_TOKEN") + if token: + return token + + # 3. Cached token from disk (from test_globus_upload.py auth flow) + cached = load_cached_token() + if cached: + self._access_token = cached # Cache in memory too + return cached + + # 4. Client credentials flow (for server deployment) + client_id = os.environ.get("GLOBUS_CLIENT_ID") + client_secret = os.environ.get("GLOBUS_CLIENT_SECRET") + + if client_id and client_secret: + logger.info("Using client credentials flow for Globus HTTPS token") + return self._get_client_credentials_token(client_id, client_secret) + + raise ValueError( + "No Globus authentication configured. Run test_globus_upload.py to authenticate, " + "or set GLOBUS_ACCESS_TOKEN or GLOBUS_CLIENT_ID + GLOBUS_CLIENT_SECRET" + ) + + def _get_client_credentials_token(self, client_id: str, client_secret: str) -> str: + """Get token using client credentials flow for HTTPS endpoint access.""" + # Check if we have a cached valid token + if self._access_token and self._token_expires_at: + if datetime.now(timezone.utc) < self._token_expires_at: + return self._access_token + + # Request token scoped to the HTTPS endpoint (data access) + scope = f"urn:globus:auth:scope:{self.https_server}:all" + logger.info("Requesting client credentials token for scope: %s", scope) + + response = self._client.post( + "https://auth.globus.org/v2/oauth2/token", + data={ + "grant_type": "client_credentials", + "scope": scope, + }, + auth=(client_id, client_secret), + ) + if response.status_code != 200: + logger.error("Client credentials token request failed: %s %s", response.status_code, response.text) + response.raise_for_status() + + data = response.json() + self._access_token = data["access_token"] + # Cache token with some buffer before expiry + expires_in = data.get("expires_in", 3600) + from datetime import timedelta + self._token_expires_at = datetime.now(timezone.utc) + timedelta(seconds=expires_in - 60) + + return self._access_token + + def _headers(self, content_type: str = "application/octet-stream") -> Dict[str, str]: + """Build request headers with auth.""" + return { + "Authorization": f"Bearer {self._get_token()}", + "Content-Type": content_type, + } + + def _full_url(self, path: str) -> str: + """Build full URL for a path.""" + safe_path = self._sanitize_remote_path(path) + # Ensure path doesn't double up the base + if safe_path.startswith(self.base_path): + safe_path = safe_path[len(self.base_path):] + return f"{self.base_url}/{safe_path.lstrip('/')}" + + def _sanitize_remote_path(self, path: str) -> str: + value = (path or "").replace("\\", "/").strip() + if not value: + raise ValueError("path is required") + if "://" in value: + raise ValueError("path must not be a URL") + if value.startswith("/"): + value = value.lstrip("/") + if ".." in value.split("/"): + raise ValueError("path traversal is not allowed") + return value + + def _build_path(self, stream_id: str, filename: str) -> str: + """Build a flat storage path for Globus HTTPS. + + Uses flat structure to avoid directory creation issues: + {stream_id}_{date}_{filename} + + Globus HTTPS doesn't auto-create parent directories, so we use + a flat naming scheme instead of nested directories. + """ + date_prefix = datetime.now(timezone.utc).strftime("%Y%m%d") + safe_stream_id = self._sanitize_stream_id(stream_id) + safe_filename = self._sanitize_filename(filename).replace("/", "_") + return f"{safe_stream_id}_{date_prefix}_{safe_filename}" + + def store_file( + self, + stream_id: str, + filename: str, + content: bytes, + content_type: str = "application/octet-stream", + metadata: Optional[Dict[str, Any]] = None, + user_token: Optional[str] = None, + ) -> FileMetadata: + """Store a file via HTTPS PUT. + + Args: + user_token: User's Globus token for authorization. If provided, + the action is performed on behalf of the user. + """ + path = self._build_path(stream_id, filename) + url = self._full_url(path) + + # Compute checksum before upload + checksum = self._compute_checksum(content) + + # Prefer user token (they authenticated with the data scope and + # have write access to the endpoint); fall back to server credentials + # for background operations (e.g., async worker) where no user token. + if user_token: + token = user_token + logger.info("Using user token for Globus upload") + else: + try: + token = self._get_token() + logger.info("Using server token for Globus upload (len=%d)", len(token)) + except Exception as exc: + raise ValueError(f"No authentication available for Globus storage: {exc}") + + # Upload file + response = self._client.put( + url, + content=content, + headers={ + "Authorization": f"Bearer {token}", + "Content-Type": content_type, + }, + ) + response.raise_for_status() + + # Build metadata + file_meta = FileMetadata( + filename=filename, + path=path, + size_bytes=len(content), + checksum_md5=checksum, + content_type=content_type, + storage_backend=self.backend_name, + download_url=url, + custom_metadata=metadata or {}, + ) + + # Cache metadata (in production, store in DynamoDB) + self._metadata_cache[path] = file_meta + + return file_meta + + def store_file_stream( + self, + stream_id: str, + filename: str, + file_obj: BinaryIO, + size_bytes: int, + content_type: str = "application/octet-stream", + metadata: Optional[Dict[str, Any]] = None, + ) -> FileMetadata: + """Store a file from a stream via HTTPS PUT.""" + path = self._build_path(stream_id, filename) + url = self._full_url(path) + + # For streaming uploads, we need to compute checksum separately + # Read content, compute checksum, then upload + content = file_obj.read() + checksum = self._compute_checksum(content) + + response = self._client.put( + url, + content=content, + headers={ + **self._headers(content_type), + "Content-Length": str(len(content)), + }, + ) + response.raise_for_status() + + file_meta = FileMetadata( + filename=filename, + path=path, + size_bytes=len(content), + checksum_md5=checksum, + content_type=content_type, + storage_backend=self.backend_name, + download_url=url, + custom_metadata=metadata or {}, + ) + + self._metadata_cache[path] = file_meta + return file_meta + + def get_file(self, path: str) -> Optional[bytes]: + """Retrieve file contents via HTTPS GET.""" + try: + url = self._full_url(path) + except ValueError: + return None + + try: + response = self._client.get(url, headers=self._headers()) + response.raise_for_status() + return response.content + except httpx.HTTPStatusError as e: + if e.response.status_code == 404: + return None + raise + + def get_download_url(self, path: str, expires_in: int = 3600) -> Optional[str]: + """Get direct download URL. + + For Globus HTTPS, the URL is the same but requires auth. + For unauthenticated access, you'd need to use Globus sharing. + """ + # The download URL is just the HTTPS endpoint URL + # Client will need to provide auth token + try: + return self._full_url(path) + except ValueError: + return None + + def get_upload_url( + self, + stream_id: str, + filename: str, + content_type: str = "application/octet-stream", + expires_in: int = 3600, + ) -> Optional[Dict[str, Any]]: + """Get upload URL for direct client upload. + + Returns the URL and headers needed for direct PUT. + """ + path = self._build_path(stream_id, filename) + url = self._full_url(path) + + return { + "url": url, + "method": "PUT", + "path": path, + "headers": { + "Content-Type": content_type, + }, + "auth_type": "bearer", + "expires_in": expires_in, + } + + def list_files(self, stream_id: str) -> List[FileMetadata]: + """List files in a stream. + + Uses cached metadata. In production, query DynamoDB. + """ + # Flat structure: {stream_id}_{date}_{filename} + try: + safe_stream_id = self._sanitize_stream_id(stream_id) + except ValueError: + return [] + prefix = f"{safe_stream_id}_" + files = [ + meta for path, meta in self._metadata_cache.items() + if path.startswith(prefix) + ] + # Sort by stored_at descending + files.sort(key=lambda x: x.stored_at, reverse=True) + return files + + def delete_file(self, path: str) -> bool: + """Delete a file via HTTPS DELETE.""" + try: + safe_path = self._sanitize_remote_path(path) + url = self._full_url(safe_path) + except ValueError: + return False + + try: + response = self._client.delete(url, headers=self._headers()) + response.raise_for_status() + self._metadata_cache.pop(safe_path, None) + return True + except httpx.HTTPStatusError as e: + if e.response.status_code == 404: + return False + raise + + def delete_stream_files(self, stream_id: str) -> int: + """Delete all files for a stream.""" + files = self.list_files(stream_id) + count = 0 + for f in files: + if self.delete_file(f.path): + count += 1 + return count + + def get_stream_size(self, stream_id: str) -> int: + """Get total size of all files in a stream.""" + files = self.list_files(stream_id) + return sum(f.size_bytes for f in files) + + def close(self): + """Close the HTTP client.""" + self._client.close() diff --git a/aws/v2/storage/local.py b/aws/v2/storage/local.py new file mode 100644 index 0000000..7f3d947 --- /dev/null +++ b/aws/v2/storage/local.py @@ -0,0 +1,223 @@ +"""Local filesystem storage backend for MDF v2. + +For development and testing only. In production, use Globus or S3. + +Configuration: + FILE_STORE_PATH: Base directory for file storage (default: /tmp/mdf_files) +""" + +import json +import os +import shutil +from datetime import datetime +from pathlib import Path +from typing import Any, BinaryIO, Dict, List, Optional + +from v2.storage.base import FileMetadata, StorageBackend + + +class LocalStorage(StorageBackend): + """Local filesystem storage backend for development.""" + + def __init__(self, base_path: Optional[str] = None): + """Initialize local storage. + + Args: + base_path: Base directory for storage + """ + self.base_path = Path( + base_path or os.environ.get("FILE_STORE_PATH", "/tmp/mdf_files") + ).resolve() + self.base_path.mkdir(parents=True, exist_ok=True) + + @property + def backend_name(self) -> str: + return "local" + + def _full_path(self, path: str) -> Path: + """Get full filesystem path.""" + candidate = (self.base_path / path).resolve() + try: + candidate.relative_to(self.base_path) + except ValueError as exc: + raise ValueError(f"Invalid storage path: {path!r}") from exc + return candidate + + def _meta_path(self, file_path: Path) -> Path: + """Get metadata file path for a file.""" + return file_path.with_suffix(file_path.suffix + ".meta.json") + + def store_file( + self, + stream_id: str, + filename: str, + content: bytes, + content_type: str = "application/octet-stream", + metadata: Optional[Dict[str, Any]] = None, + **kwargs, # Accept user_token etc. (ignored for local storage) + ) -> FileMetadata: + """Store a file locally.""" + path = self._build_path(stream_id, filename) + full_path = self._full_path(path) + + # Ensure directory exists + full_path.parent.mkdir(parents=True, exist_ok=True) + + # Write file + full_path.write_bytes(content) + + # Compute checksum + checksum = self._compute_checksum(content) + + # Build metadata + file_meta = FileMetadata( + filename=filename, + path=path, + size_bytes=len(content), + checksum_md5=checksum, + content_type=content_type, + storage_backend=self.backend_name, + download_url=f"file://{full_path}", + custom_metadata=metadata or {}, + ) + + # Store metadata file + meta_path = self._meta_path(full_path) + meta_path.write_text(json.dumps(file_meta.to_dict(), indent=2)) + + return file_meta + + def store_file_stream( + self, + stream_id: str, + filename: str, + file_obj: BinaryIO, + size_bytes: int, + content_type: str = "application/octet-stream", + metadata: Optional[Dict[str, Any]] = None, + ) -> FileMetadata: + """Store a file from a file-like object.""" + content = file_obj.read() + return self.store_file(stream_id, filename, content, content_type, metadata) + + def get_file(self, path: str) -> Optional[bytes]: + """Retrieve file contents.""" + try: + full_path = self._full_path(path) + except ValueError: + return None + if full_path.exists() and full_path.is_file(): + return full_path.read_bytes() + return None + + def get_download_url(self, path: str, expires_in: int = 3600) -> Optional[str]: + """Get download URL (file:// URL for local).""" + try: + full_path = self._full_path(path) + except ValueError: + return None + if full_path.exists(): + return f"file://{full_path}" + return None + + def get_upload_url( + self, + stream_id: str, + filename: str, + content_type: str = "application/octet-stream", + expires_in: int = 3600, + ) -> Optional[Dict[str, Any]]: + """Local storage doesn't support pre-signed upload URLs.""" + # Return info for direct upload through API + path = self._build_path(stream_id, filename) + return { + "url": f"/stream/{stream_id}/upload", + "method": "POST", + "path": path, + "note": "Local storage - upload through API", + } + + def list_files(self, stream_id: str) -> List[FileMetadata]: + """List all files in a stream.""" + try: + safe_stream_id = self._sanitize_stream_id(stream_id) + stream_path = self._full_path(f"streams/{safe_stream_id}") + except ValueError: + return [] + files = [] + + if not stream_path.exists(): + return files + + for meta_path in stream_path.rglob("*.meta.json"): + try: + meta_dict = json.loads(meta_path.read_text()) + files.append(FileMetadata( + filename=meta_dict.get("filename", ""), + path=meta_dict.get("path", ""), + size_bytes=meta_dict.get("size_bytes", 0), + checksum_md5=meta_dict.get("checksum_md5", ""), + content_type=meta_dict.get("content_type", "application/octet-stream"), + stored_at=meta_dict.get("stored_at", ""), + storage_backend=meta_dict.get("storage_backend", "local"), + download_url=meta_dict.get("download_url"), + custom_metadata=meta_dict.get("metadata", {}), + )) + except Exception: + continue + + # Sort by stored_at descending + files.sort(key=lambda x: x.stored_at, reverse=True) + return files + + def delete_file(self, path: str) -> bool: + """Delete a file.""" + try: + full_path = self._full_path(path) + except ValueError: + return False + meta_path = self._meta_path(full_path) + + deleted = False + if full_path.exists(): + full_path.unlink() + deleted = True + if meta_path.exists(): + meta_path.unlink() + + return deleted + + def delete_stream_files(self, stream_id: str) -> int: + """Delete all files for a stream.""" + try: + safe_stream_id = self._sanitize_stream_id(stream_id) + stream_path = self._full_path(f"streams/{safe_stream_id}") + except ValueError: + return 0 + + if not stream_path.exists(): + return 0 + + # Count files (not including .meta.json) + count = sum(1 for f in stream_path.rglob("*") if f.is_file() and not f.name.endswith(".meta.json")) + + shutil.rmtree(stream_path) + return count + + def get_stream_size(self, stream_id: str) -> int: + """Get total size of all files in a stream.""" + try: + safe_stream_id = self._sanitize_stream_id(stream_id) + stream_path = self._full_path(f"streams/{safe_stream_id}") + except ValueError: + return 0 + + if not stream_path.exists(): + return 0 + + total = 0 + for file_path in stream_path.rglob("*"): + if file_path.is_file() and not file_path.name.endswith(".meta.json"): + total += file_path.stat().st_size + + return total diff --git a/aws/v2/storage/s3.py b/aws/v2/storage/s3.py new file mode 100644 index 0000000..3627ba7 --- /dev/null +++ b/aws/v2/storage/s3.py @@ -0,0 +1,239 @@ +"""S3 storage backend for MDF v2. + +For staging and production deployments where Globus HTTPS is not needed. + +Configuration: + S3_BUCKET: S3 bucket name for file storage + S3_PREFIX: Key prefix within the bucket (default: "streams/") +""" + +import io +import os +from typing import Any, BinaryIO, Dict, List, Optional + +import boto3 +from botocore.exceptions import ClientError + +from v2.storage.base import FileMetadata, StorageBackend + + +class S3Storage(StorageBackend): + """S3 storage backend for staging/production.""" + + def __init__( + self, + bucket: Optional[str] = None, + prefix: Optional[str] = None, + ): + self.bucket = bucket or os.environ.get("S3_BUCKET", "") + if not self.bucket: + raise ValueError("S3_BUCKET environment variable is required for S3 storage backend") + self.prefix = prefix or os.environ.get("S3_PREFIX", "streams/") + self._s3 = boto3.client("s3") + + @property + def backend_name(self) -> str: + return "s3" + + def _s3_key(self, path: str) -> str: + """Build a full S3 key from a storage path.""" + return f"{self.prefix}{path}" if not path.startswith(self.prefix) else path + + def store_file( + self, + stream_id: str, + filename: str, + content: bytes, + content_type: str = "application/octet-stream", + metadata: Optional[Dict[str, Any]] = None, + **kwargs, + ) -> FileMetadata: + path = self._build_path(stream_id, filename) + key = self._s3_key(path) + checksum = self._compute_checksum(content) + + s3_metadata = {} + if metadata: + s3_metadata = {k: str(v) for k, v in metadata.items()} + + self._s3.put_object( + Bucket=self.bucket, + Key=key, + Body=content, + ContentType=content_type, + Metadata=s3_metadata, + ) + + return FileMetadata( + filename=filename, + path=path, + size_bytes=len(content), + checksum_md5=checksum, + content_type=content_type, + storage_backend=self.backend_name, + custom_metadata=metadata or {}, + ) + + def store_file_stream( + self, + stream_id: str, + filename: str, + file_obj: BinaryIO, + size_bytes: int, + content_type: str = "application/octet-stream", + metadata: Optional[Dict[str, Any]] = None, + ) -> FileMetadata: + path = self._build_path(stream_id, filename) + key = self._s3_key(path) + + s3_metadata = {} + if metadata: + s3_metadata = {k: str(v) for k, v in metadata.items()} + + self._s3.upload_fileobj( + file_obj, + self.bucket, + key, + ExtraArgs={ + "ContentType": content_type, + "Metadata": s3_metadata, + }, + ) + + # Read back for checksum if possible, otherwise use empty + checksum = "" + try: + file_obj.seek(0) + content = file_obj.read() + checksum = self._compute_checksum(content) + except Exception: + pass + + return FileMetadata( + filename=filename, + path=path, + size_bytes=size_bytes, + checksum_md5=checksum, + content_type=content_type, + storage_backend=self.backend_name, + custom_metadata=metadata or {}, + ) + + def get_file(self, path: str) -> Optional[bytes]: + key = self._s3_key(path) + try: + resp = self._s3.get_object(Bucket=self.bucket, Key=key) + return resp["Body"].read() + except ClientError as e: + if e.response["Error"]["Code"] == "NoSuchKey": + return None + raise + + def get_download_url(self, path: str, expires_in: int = 3600) -> Optional[str]: + key = self._s3_key(path) + try: + self._s3.head_object(Bucket=self.bucket, Key=key) + except ClientError: + return None + + return self._s3.generate_presigned_url( + "get_object", + Params={"Bucket": self.bucket, "Key": key}, + ExpiresIn=expires_in, + ) + + def get_upload_url( + self, + stream_id: str, + filename: str, + content_type: str = "application/octet-stream", + expires_in: int = 3600, + ) -> Optional[Dict[str, Any]]: + path = self._build_path(stream_id, filename) + key = self._s3_key(path) + + url = self._s3.generate_presigned_url( + "put_object", + Params={ + "Bucket": self.bucket, + "Key": key, + "ContentType": content_type, + }, + ExpiresIn=expires_in, + ) + + return { + "url": url, + "method": "PUT", + "headers": {"Content-Type": content_type}, + "path": path, + "expires_in": expires_in, + } + + def list_files(self, stream_id: str) -> List[FileMetadata]: + safe_stream_id = self._sanitize_stream_id(stream_id) + prefix = self._s3_key(f"streams/{safe_stream_id}/") + + files = [] + paginator = self._s3.get_paginator("list_objects_v2") + for page in paginator.paginate(Bucket=self.bucket, Prefix=prefix): + for obj in page.get("Contents", []): + key = obj["Key"] + # Extract filename from key (last path component) + filename = key.rsplit("/", 1)[-1] + # Strip the prefix to get the storage path + path = key[len(self.prefix):] if key.startswith(self.prefix) else key + + files.append(FileMetadata( + filename=filename, + path=path, + size_bytes=obj["Size"], + checksum_md5=obj.get("ETag", "").strip('"'), + content_type=self._guess_content_type(filename), + stored_at=obj["LastModified"].isoformat().replace("+00:00", "Z"), + storage_backend=self.backend_name, + )) + + files.sort(key=lambda x: x.stored_at, reverse=True) + return files + + def delete_file(self, path: str) -> bool: + key = self._s3_key(path) + try: + self._s3.head_object(Bucket=self.bucket, Key=key) + except ClientError: + return False + + self._s3.delete_object(Bucket=self.bucket, Key=key) + return True + + def delete_stream_files(self, stream_id: str) -> int: + safe_stream_id = self._sanitize_stream_id(stream_id) + prefix = self._s3_key(f"streams/{safe_stream_id}/") + + count = 0 + paginator = self._s3.get_paginator("list_objects_v2") + for page in paginator.paginate(Bucket=self.bucket, Prefix=prefix): + objects = page.get("Contents", []) + if not objects: + continue + delete_keys = [{"Key": obj["Key"]} for obj in objects] + self._s3.delete_objects( + Bucket=self.bucket, + Delete={"Objects": delete_keys}, + ) + count += len(delete_keys) + + return count + + def get_stream_size(self, stream_id: str) -> int: + safe_stream_id = self._sanitize_stream_id(stream_id) + prefix = self._s3_key(f"streams/{safe_stream_id}/") + + total = 0 + paginator = self._s3.get_paginator("list_objects_v2") + for page in paginator.paginate(Bucket=self.bucket, Prefix=prefix): + for obj in page.get("Contents", []): + total += obj["Size"] + + return total diff --git a/aws/v2/store.py b/aws/v2/store.py new file mode 100644 index 0000000..2a2adba --- /dev/null +++ b/aws/v2/store.py @@ -0,0 +1,432 @@ +import json +import os +import sqlite3 +from datetime import datetime, timezone +from typing import Any, Dict, List, Optional, Tuple + +from v2.config import AWS_REGION, DYNAMO_ENDPOINT_URL, DYNAMO_SUBMISSIONS_TABLE +from v2.submission_utils import latest_version + + +class SubmissionStore: + def get(self, source_id: str, version: Optional[str] = None) -> Optional[Dict[str, Any]]: + """Get a submission, optionally by version. If no version, gets latest.""" + if version: + return self.get_submission(source_id, version) + # Get latest version using semantic version sorting + versions = self.list_versions(source_id) + if not versions: + return None + latest = latest_version(versions) + if not latest: + return None + for item in versions: + if item.get("version") == latest: + return item + return None + + def get_submission(self, source_id: str, version: str) -> Optional[Dict[str, Any]]: + raise NotImplementedError + + def list_versions(self, source_id: str) -> List[Dict[str, Any]]: + raise NotImplementedError + + def put_submission(self, record: Dict[str, Any]) -> None: + raise NotImplementedError + + def upsert_submission(self, record: Dict[str, Any]) -> None: + """Put a submission without condition check (for updates like curation).""" + raise NotImplementedError + + def update_status(self, source_id: str, version: str, status: str) -> None: + raise NotImplementedError + + def list_by_user( + self, user_id: str, limit: int = 50, start_key: Optional[Dict[str, Any]] = None + ) -> Tuple[List[Dict[str, Any]], Optional[Dict[str, Any]]]: + raise NotImplementedError + + def list_by_org( + self, organization: str, limit: int = 50, start_key: Optional[Dict[str, Any]] = None + ) -> Tuple[List[Dict[str, Any]], Optional[Dict[str, Any]]]: + raise NotImplementedError + + def list_by_status(self, statuses: List[str], limit: int = 100) -> List[Dict[str, Any]]: + raise NotImplementedError + + def update_profile(self, source_id: str, version: str, profile_json: str) -> None: + raise NotImplementedError + + def scan_by_transfer_status(self, transfer_status: str) -> List[Dict[str, Any]]: + """Return submissions with the given transfer_status (e.g. 'active').""" + raise NotImplementedError + + def list_all(self, limit: int = 1000) -> List[Dict[str, Any]]: + """List all submissions (for search).""" + raise NotImplementedError + + +class DynamoSubmissionStore(SubmissionStore): + def __init__(self): + import boto3 + from boto3.dynamodb.conditions import Key + + resource_kwargs = {"region_name": AWS_REGION} + if DYNAMO_ENDPOINT_URL: + resource_kwargs["endpoint_url"] = DYNAMO_ENDPOINT_URL + self._resource = boto3.resource("dynamodb", **resource_kwargs) + self.table = self._resource.Table(DYNAMO_SUBMISSIONS_TABLE) + self._key = Key + + def get_submission(self, source_id: str, version: str) -> Optional[Dict[str, Any]]: + resp = self.table.get_item(Key={"source_id": source_id, "version": version}) + return resp.get("Item") + + def list_versions(self, source_id: str) -> List[Dict[str, Any]]: + resp = self.table.query(KeyConditionExpression=self._key("source_id").eq(source_id)) + return resp.get("Items", []) + + def put_submission(self, record: Dict[str, Any]) -> None: + self.table.put_item( + Item=record, + ConditionExpression="attribute_not_exists(source_id) AND attribute_not_exists(version)", + ) + + def upsert_submission(self, record: Dict[str, Any]) -> None: + self.table.put_item(Item=record) + + def update_status(self, source_id: str, version: str, status: str) -> None: + now = datetime.now(timezone.utc).isoformat().replace("+00:00", "Z") + self.table.update_item( + Key={"source_id": source_id, "version": version}, + UpdateExpression="SET #status = :status, updated_at = :updated_at", + ExpressionAttributeNames={"#status": "status"}, + ExpressionAttributeValues={":status": status, ":updated_at": now}, + ) + + def list_by_user(self, user_id: str, limit: int = 50, start_key: Optional[Dict[str, Any]] = None): + kwargs = { + "IndexName": os.environ.get("GSI_USER_INDEX", "user-submissions"), + "KeyConditionExpression": self._key("user_id").eq(user_id), + "Limit": limit, + "ScanIndexForward": False, + } + if start_key: + kwargs["ExclusiveStartKey"] = start_key + resp = self.table.query(**kwargs) + return resp.get("Items", []), resp.get("LastEvaluatedKey") + + def list_by_org(self, organization: str, limit: int = 50, start_key: Optional[Dict[str, Any]] = None): + kwargs = { + "IndexName": os.environ.get("GSI_ORG_INDEX", "org-submissions"), + "KeyConditionExpression": self._key("organization").eq(organization), + "Limit": limit, + "ScanIndexForward": False, + } + if start_key: + kwargs["ExclusiveStartKey"] = start_key + resp = self.table.query(**kwargs) + return resp.get("Items", []), resp.get("LastEvaluatedKey") + + def list_by_status(self, statuses: List[str], limit: int = 100) -> List[Dict[str, Any]]: + # Scan with pagination because FilterExpression is applied post-scan. + from boto3.dynamodb.conditions import Attr + if not statuses: + return [] + filter_expr = Attr("status").eq(statuses[0]) + for s in statuses[1:]: + filter_expr = filter_expr | Attr("status").eq(s) + items: List[Dict[str, Any]] = [] + last_key = None + while len(items) < limit: + kwargs: Dict[str, Any] = {"FilterExpression": filter_expr} + if last_key: + kwargs["ExclusiveStartKey"] = last_key + resp = self.table.scan(**kwargs) + for item in resp.get("Items", []): + items.append(item) + if len(items) >= limit: + break + last_key = resp.get("LastEvaluatedKey") + if not last_key: + break + return items[:limit] + + def scan_by_transfer_status(self, transfer_status: str) -> List[Dict[str, Any]]: + from boto3.dynamodb.conditions import Attr + + items: List[Dict[str, Any]] = [] + last_key = None + while True: + kwargs: Dict[str, Any] = { + "FilterExpression": Attr("transfer_status").eq(transfer_status), + } + if last_key: + kwargs["ExclusiveStartKey"] = last_key + resp = self.table.scan(**kwargs) + items.extend(resp.get("Items", [])) + last_key = resp.get("LastEvaluatedKey") + if not last_key: + break + return items + + def update_profile(self, source_id: str, version: str, profile_json: str) -> None: + now = datetime.now(timezone.utc).isoformat().replace("+00:00", "Z") + self.table.update_item( + Key={"source_id": source_id, "version": version}, + UpdateExpression="SET dataset_profile = :profile, updated_at = :updated_at", + ExpressionAttributeValues={":profile": profile_json, ":updated_at": now}, + ) + + def list_all(self, limit: int = 1000) -> List[Dict[str, Any]]: + # Scan is expensive but acceptable for search + items: List[Dict[str, Any]] = [] + last_key = None + while len(items) < limit: + page_limit = min(limit - len(items), 1000) + kwargs: Dict[str, Any] = {"Limit": page_limit} + if last_key: + kwargs["ExclusiveStartKey"] = last_key + resp = self.table.scan(**kwargs) + items.extend(resp.get("Items", [])) + last_key = resp.get("LastEvaluatedKey") + if not last_key: + break + return items[:limit] + + +class SqliteSubmissionStore(SubmissionStore): + def __init__(self, path: Optional[str] = None): + db_path = path or os.environ.get("SQLITE_PATH", "/tmp/mdf_connect_v2.db") + self.conn = sqlite3.connect(db_path, check_same_thread=False) + self.conn.row_factory = sqlite3.Row + self._init_schema() + + def _init_schema(self) -> None: + with self.conn: + self.conn.execute( + """ + CREATE TABLE IF NOT EXISTS submissions ( + source_id TEXT NOT NULL, + version TEXT NOT NULL, + versioned_source_id TEXT, + user_id TEXT, + user_email TEXT, + organization TEXT, + status TEXT, + dataset_mdata TEXT, + test INTEGER, + created_at TEXT, + updated_at TEXT, + action_id TEXT, + doi TEXT, + dataset_doi TEXT, + published_at TEXT, + approved_at TEXT, + approved_by TEXT, + rejected_at TEXT, + rejected_by TEXT, + rejection_reason TEXT, + curation_history TEXT, + dataset_profile TEXT, + PRIMARY KEY (source_id, version) + ) + """ + ) + self.conn.execute( + "CREATE INDEX IF NOT EXISTS idx_submissions_user ON submissions(user_id, updated_at)" + ) + self.conn.execute( + "CREATE INDEX IF NOT EXISTS idx_submissions_org ON submissions(organization, source_id)" + ) + self.conn.execute( + "CREATE INDEX IF NOT EXISTS idx_submissions_status ON submissions(status)" + ) + # Migrations: add columns if missing + cur = self.conn.execute("PRAGMA table_info(submissions)") + col_names = {row["name"] for row in cur.fetchall()} + if "dataset_profile" not in col_names: + self.conn.execute("ALTER TABLE submissions ADD COLUMN dataset_profile TEXT") + if "dataset_doi" not in col_names: + self.conn.execute("ALTER TABLE submissions ADD COLUMN dataset_doi TEXT") + + def _row_to_dict(self, row: sqlite3.Row) -> Dict[str, Any]: + data = dict(row) + # Deserialize JSON fields + if data.get("dataset_mdata"): + try: + data["dataset_mdata"] = json.loads(data["dataset_mdata"]) + except Exception: + pass + if data.get("curation_history"): + try: + data["curation_history"] = json.loads(data["curation_history"]) + except Exception: + pass + if data.get("dataset_profile"): + try: + data["dataset_profile"] = json.loads(data["dataset_profile"]) + except Exception: + pass + return data + + def get_submission(self, source_id: str, version: str) -> Optional[Dict[str, Any]]: + cur = self.conn.execute( + "SELECT * FROM submissions WHERE source_id = ? AND version = ?", + (source_id, version), + ) + row = cur.fetchone() + return self._row_to_dict(row) if row else None + + def list_versions(self, source_id: str) -> List[Dict[str, Any]]: + cur = self.conn.execute( + "SELECT * FROM submissions WHERE source_id = ?", + (source_id,), + ) + return [self._row_to_dict(row) for row in cur.fetchall()] + + def _write_submission(self, record: Dict[str, Any]) -> None: + dataset_mdata = record.get("dataset_mdata") + if isinstance(dataset_mdata, dict): + dataset_mdata = json.dumps(dataset_mdata) + + curation_history = record.get("curation_history") + if isinstance(curation_history, list): + curation_history = json.dumps(curation_history) + + dataset_profile = record.get("dataset_profile") + if isinstance(dataset_profile, dict): + dataset_profile = json.dumps(dataset_profile) + + with self.conn: + self.conn.execute( + """ + INSERT OR REPLACE INTO submissions ( + source_id, version, versioned_source_id, user_id, user_email, + organization, status, dataset_mdata, test, created_at, updated_at, action_id, + doi, dataset_doi, published_at, approved_at, approved_by, rejected_at, rejected_by, + rejection_reason, curation_history, dataset_profile + ) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?) + """, + ( + record.get("source_id"), + record.get("version"), + record.get("versioned_source_id"), + record.get("user_id"), + record.get("user_email"), + record.get("organization"), + record.get("status"), + dataset_mdata, + int(record.get("test") or 0), + record.get("created_at"), + record.get("updated_at"), + record.get("action_id"), + record.get("doi"), + record.get("dataset_doi"), + record.get("published_at"), + record.get("approved_at"), + record.get("approved_by"), + record.get("rejected_at"), + record.get("rejected_by"), + record.get("rejection_reason"), + curation_history, + dataset_profile, + ), + ) + + def put_submission(self, record: Dict[str, Any]) -> None: + self._write_submission(record) + + def upsert_submission(self, record: Dict[str, Any]) -> None: + self._write_submission(record) + + def update_status(self, source_id: str, version: str, status: str) -> None: + now = datetime.now(timezone.utc).isoformat().replace("+00:00", "Z") + with self.conn: + self.conn.execute( + "UPDATE submissions SET status = ?, updated_at = ? WHERE source_id = ? AND version = ?", + (status, now, source_id, version), + ) + + def list_by_user(self, user_id: str, limit: int = 50, start_key: Optional[Dict[str, Any]] = None): + offset = int(start_key.get("offset")) if start_key and "offset" in start_key else 0 + cur = self.conn.execute( + """ + SELECT * FROM submissions + WHERE user_id = ? + ORDER BY updated_at DESC + LIMIT ? OFFSET ? + """, + (user_id, limit, offset), + ) + rows = [self._row_to_dict(row) for row in cur.fetchall()] + next_key = {"offset": offset + limit} if len(rows) == limit else None + return rows, next_key + + def list_by_org(self, organization: str, limit: int = 50, start_key: Optional[Dict[str, Any]] = None): + offset = int(start_key.get("offset")) if start_key and "offset" in start_key else 0 + cur = self.conn.execute( + """ + SELECT * FROM submissions + WHERE organization = ? + ORDER BY updated_at DESC + LIMIT ? OFFSET ? + """, + (organization, limit, offset), + ) + rows = [self._row_to_dict(row) for row in cur.fetchall()] + next_key = {"offset": offset + limit} if len(rows) == limit else None + return rows, next_key + + def list_by_status(self, statuses: List[str], limit: int = 100) -> List[Dict[str, Any]]: + if not statuses: + return [] + placeholders = ",".join("?" for _ in statuses) + query = ( + f"SELECT * FROM submissions WHERE status IN ({placeholders}) " + "ORDER BY updated_at DESC LIMIT ?" + ) + cur = self.conn.execute(query, (*statuses, limit)) + return [self._row_to_dict(row) for row in cur.fetchall()] + + def scan_by_transfer_status(self, transfer_status: str) -> List[Dict[str, Any]]: + # Transfer fields aren't persisted in the SQLite schema (dev-only store). + # Globus transfers only run in production with DynamoDB. + return [] + + def update_profile(self, source_id: str, version: str, profile_json: str) -> None: + now = datetime.now(timezone.utc).isoformat().replace("+00:00", "Z") + with self.conn: + self.conn.execute( + "UPDATE submissions SET dataset_profile = ?, updated_at = ? WHERE source_id = ? AND version = ?", + (profile_json, now, source_id, version), + ) + + def list_all(self, limit: int = 1000) -> List[Dict[str, Any]]: + cur = self.conn.execute( + "SELECT * FROM submissions ORDER BY updated_at DESC LIMIT ?", + (limit,), + ) + return [self._row_to_dict(row) for row in cur.fetchall()] + + +def get_store() -> SubmissionStore: + backend = os.environ.get("STORE_BACKEND", "dynamo").lower() + if backend == "sqlite": + return SqliteSubmissionStore() + return DynamoSubmissionStore() + + +def parse_pagination_key(key_str: Optional[str]) -> Optional[Dict[str, Any]]: + if not key_str: + return None + try: + return json.loads(key_str) + except Exception: + return None + + +def serialize_pagination_key(key: Optional[Dict[str, Any]]) -> Optional[str]: + if not key: + return None + return json.dumps(key) diff --git a/aws/v2/stream_store.py b/aws/v2/stream_store.py new file mode 100644 index 0000000..e3b0d0b --- /dev/null +++ b/aws/v2/stream_store.py @@ -0,0 +1,294 @@ +import json +import os +import sqlite3 +from datetime import datetime, timezone +from typing import Any, Dict, List, Optional + +from v2.config import AWS_REGION, DYNAMO_ENDPOINT_URL, DYNAMO_STREAMS_TABLE + + +class StreamStore: + def create_stream(self, record: Dict[str, Any]) -> Dict[str, Any]: + raise NotImplementedError + + def get_stream(self, stream_id: str) -> Optional[Dict[str, Any]]: + raise NotImplementedError + + def append_stream(self, stream_id: str, file_count: int, total_bytes: int, last_file: Optional[Dict[str, Any]] = None) -> Optional[Dict[str, Any]]: + raise NotImplementedError + + def close_stream(self, stream_id: str) -> Optional[Dict[str, Any]]: + raise NotImplementedError + + def update_stream_metadata(self, stream_id: str, updates: Dict[str, Any]) -> Optional[Dict[str, Any]]: + raise NotImplementedError + + def list_all(self, limit: int = 1000) -> List[Dict[str, Any]]: + """List all streams (for search).""" + raise NotImplementedError + + +class DynamoStreamStore(StreamStore): + def __init__(self): + import boto3 + from boto3.dynamodb.conditions import Key + + resource_kwargs = {"region_name": AWS_REGION} + if DYNAMO_ENDPOINT_URL: + resource_kwargs["endpoint_url"] = DYNAMO_ENDPOINT_URL + self._resource = boto3.resource("dynamodb", **resource_kwargs) + self.table = self._resource.Table(DYNAMO_STREAMS_TABLE) + self._key = Key + + def create_stream(self, record: Dict[str, Any]) -> Dict[str, Any]: + # Serialize metadata if dict + item = dict(record) + if isinstance(item.get("metadata"), dict): + item["metadata"] = json.dumps(item["metadata"]) + if isinstance(item.get("last_file"), dict): + item["last_file"] = json.dumps(item["last_file"]) + self.table.put_item( + Item=item, + ConditionExpression="attribute_not_exists(stream_id)", + ) + return record + + def _deserialize(self, item: Dict[str, Any]) -> Dict[str, Any]: + if item.get("metadata") and isinstance(item["metadata"], str): + try: + item["metadata"] = json.loads(item["metadata"]) + except Exception: + pass + if item.get("last_file") and isinstance(item["last_file"], str): + try: + item["last_file"] = json.loads(item["last_file"]) + except Exception: + pass + return item + + def get_stream(self, stream_id: str) -> Optional[Dict[str, Any]]: + resp = self.table.get_item(Key={"stream_id": stream_id}) + item = resp.get("Item") + return self._deserialize(item) if item else None + + def append_stream(self, stream_id: str, file_count: int, total_bytes: int, last_file: Optional[Dict[str, Any]] = None) -> Optional[Dict[str, Any]]: + now = datetime.now(timezone.utc).isoformat().replace("+00:00", "Z") + update_expr = ( + "SET file_count = file_count + :fc, total_bytes = total_bytes + :tb, " + "last_append_at = :now, updated_at = :now" + ) + expr_values = { + ":fc": file_count, + ":tb": total_bytes, + ":now": now, + } + if last_file: + update_expr += ", last_file = :lf" + expr_values[":lf"] = json.dumps(last_file) if isinstance(last_file, dict) else last_file + + self.table.update_item( + Key={"stream_id": stream_id}, + UpdateExpression=update_expr, + ExpressionAttributeValues=expr_values, + ) + return self.get_stream(stream_id) + + def close_stream(self, stream_id: str) -> Optional[Dict[str, Any]]: + now = datetime.now(timezone.utc).isoformat().replace("+00:00", "Z") + self.table.update_item( + Key={"stream_id": stream_id}, + UpdateExpression="SET #status = :status, updated_at = :now", + ExpressionAttributeNames={"#status": "status"}, + ExpressionAttributeValues={":status": "closed", ":now": now}, + ) + return self.get_stream(stream_id) + + def update_stream_metadata(self, stream_id: str, updates: Dict[str, Any]) -> Optional[Dict[str, Any]]: + now = datetime.now(timezone.utc).isoformat().replace("+00:00", "Z") + stream = self.get_stream(stream_id) + if not stream: + return None + + existing_metadata = stream.get("metadata") or {} + if isinstance(existing_metadata, str): + try: + existing_metadata = json.loads(existing_metadata) + except Exception: + existing_metadata = {} + existing_metadata.update(updates) + + self.table.update_item( + Key={"stream_id": stream_id}, + UpdateExpression="SET metadata = :meta, updated_at = :now", + ExpressionAttributeValues={ + ":meta": json.dumps(existing_metadata), + ":now": now, + }, + ) + return self.get_stream(stream_id) + + def list_all(self, limit: int = 1000) -> List[Dict[str, Any]]: + items: List[Dict[str, Any]] = [] + last_key = None + while len(items) < limit: + page_limit = min(limit - len(items), 1000) + kwargs: Dict[str, Any] = {"Limit": page_limit} + if last_key: + kwargs["ExclusiveStartKey"] = last_key + resp = self.table.scan(**kwargs) + items.extend(resp.get("Items", [])) + last_key = resp.get("LastEvaluatedKey") + if not last_key: + break + return [self._deserialize(item) for item in items[:limit]] + + +class SqliteStreamStore(StreamStore): + def __init__(self, path: Optional[str] = None): + db_path = path or os.environ.get("SQLITE_PATH", "/tmp/mdf_connect_v2.db") + self.conn = sqlite3.connect(db_path, check_same_thread=False) + self.conn.row_factory = sqlite3.Row + self._init_schema() + + def _init_schema(self) -> None: + with self.conn: + self.conn.execute( + """ + CREATE TABLE IF NOT EXISTS streams ( + stream_id TEXT PRIMARY KEY, + lab_id TEXT, + title TEXT, + status TEXT, + file_count INTEGER, + total_bytes INTEGER, + last_append_at TEXT, + created_at TEXT, + updated_at TEXT, + user_id TEXT, + organization TEXT, + last_file TEXT, + metadata TEXT + ) + """ + ) + self.conn.execute( + "CREATE INDEX IF NOT EXISTS idx_streams_user ON streams(user_id, updated_at)" + ) + + def _row_to_dict(self, row: sqlite3.Row) -> Dict[str, Any]: + data = dict(row) + if data.get("metadata"): + try: + data["metadata"] = json.loads(data["metadata"]) + except Exception: + pass + if data.get("last_file"): + try: + data["last_file"] = json.loads(data["last_file"]) + except Exception: + pass + return data + + def create_stream(self, record: Dict[str, Any]) -> Dict[str, Any]: + with self.conn: + self.conn.execute( + """ + INSERT INTO streams ( + stream_id, lab_id, title, status, file_count, total_bytes, + last_append_at, created_at, updated_at, user_id, organization, + last_file, metadata + ) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?) + """, + ( + record.get("stream_id"), + record.get("lab_id"), + record.get("title"), + record.get("status"), + record.get("file_count"), + record.get("total_bytes"), + record.get("last_append_at"), + record.get("created_at"), + record.get("updated_at"), + record.get("user_id"), + record.get("organization"), + json.dumps(record.get("last_file")) if record.get("last_file") else None, + json.dumps(record.get("metadata")) if record.get("metadata") else None, + ), + ) + return record + + def get_stream(self, stream_id: str) -> Optional[Dict[str, Any]]: + cur = self.conn.execute( + "SELECT * FROM streams WHERE stream_id = ?", + (stream_id,), + ) + row = cur.fetchone() + return self._row_to_dict(row) if row else None + + def append_stream(self, stream_id: str, file_count: int, total_bytes: int, last_file: Optional[Dict[str, Any]] = None): + now = datetime.now(timezone.utc).isoformat().replace("+00:00", "Z") + with self.conn: + self.conn.execute( + """ + UPDATE streams + SET file_count = file_count + ?, total_bytes = total_bytes + ?, + last_append_at = ?, updated_at = ?, last_file = ? + WHERE stream_id = ? + """, + ( + file_count, + total_bytes, + now, + now, + json.dumps(last_file) if last_file else None, + stream_id, + ), + ) + return self.get_stream(stream_id) + + def close_stream(self, stream_id: str) -> Optional[Dict[str, Any]]: + now = datetime.now(timezone.utc).isoformat().replace("+00:00", "Z") + with self.conn: + self.conn.execute( + "UPDATE streams SET status = ?, updated_at = ? WHERE stream_id = ?", + ("closed", now, stream_id), + ) + return self.get_stream(stream_id) + + def update_stream_metadata(self, stream_id: str, updates: Dict[str, Any]) -> Optional[Dict[str, Any]]: + """Update stream metadata fields (like DOI, published_at).""" + now = datetime.now(timezone.utc).isoformat().replace("+00:00", "Z") + stream = self.get_stream(stream_id) + if not stream: + return None + + # Merge updates into existing metadata + existing_metadata = stream.get("metadata") or {} + if isinstance(existing_metadata, str): + try: + existing_metadata = json.loads(existing_metadata) + except Exception: + existing_metadata = {} + + existing_metadata.update(updates) + + with self.conn: + self.conn.execute( + "UPDATE streams SET metadata = ?, updated_at = ? WHERE stream_id = ?", + (json.dumps(existing_metadata), now, stream_id), + ) + return self.get_stream(stream_id) + + def list_all(self, limit: int = 1000) -> List[Dict[str, Any]]: + cur = self.conn.execute( + "SELECT * FROM streams ORDER BY updated_at DESC LIMIT ?", + (limit,), + ) + return [self._row_to_dict(row) for row in cur.fetchall()] + + +def get_stream_store() -> StreamStore: + backend = os.environ.get("STORE_BACKEND", "dynamo").lower() + if backend == "sqlite": + return SqliteStreamStore() + return DynamoStreamStore() diff --git a/aws/v2/submission_utils.py b/aws/v2/submission_utils.py new file mode 100644 index 0000000..fda0b94 --- /dev/null +++ b/aws/v2/submission_utils.py @@ -0,0 +1,38 @@ +import uuid +from typing import Any, Dict, List, Optional + + +def latest_version(items: List[Dict[str, Any]]) -> Optional[str]: + if not items: + return None + versions = [item.get("version") for item in items if item.get("version")] + if not versions: + return None + + def sort_key(value: str): + parts = [] + for part in value.split("."): + if part.isdigit(): + parts.append(int(part)) + else: + parts.append(part) + return parts + + versions_sorted = sorted(versions, key=sort_key) + return versions_sorted[-1] + + +def increment_version(current: Optional[str], major: bool = False) -> str: + if not current: + return "1.0" + try: + maj, _min = current.split(".") + if major: + return "{}.0".format(int(maj) + 1) + return "{}.{}".format(maj, int(_min) + 1) + except Exception: + return "1.0" + + +def generate_source_id(prefix: str = "mdf") -> str: + return "{}-{}".format(prefix, uuid.uuid4().hex) diff --git a/aws/v2/test_v2_async_jobs.py b/aws/v2/test_v2_async_jobs.py new file mode 100644 index 0000000..2325bc6 --- /dev/null +++ b/aws/v2/test_v2_async_jobs.py @@ -0,0 +1,121 @@ +from __future__ import annotations + +import base64 +from pathlib import Path + +import pytest +from fastapi.testclient import TestClient + +import sys + +sys.path.insert(0, str(Path(__file__).resolve().parents[1])) + +from v2.app import app +from v2.app.middleware import reset_middleware_state +from v2.async_jobs import run_sqlite_worker_once +from v2.storage import reset_storage_backend + + +@pytest.fixture() +def async_sqlite_env(tmp_path: Path, monkeypatch: pytest.MonkeyPatch): + db_path = tmp_path / "store.db" + file_store = tmp_path / "files" + monkeypatch.setenv("STORE_BACKEND", "sqlite") + monkeypatch.setenv("SQLITE_PATH", str(db_path)) + monkeypatch.setenv("ASYNC_SQLITE_PATH", str(db_path)) + monkeypatch.setenv("ASYNC_DISPATCH_MODE", "sqlite") + monkeypatch.setenv("STORAGE_BACKEND", "local") + monkeypatch.setenv("FILE_STORE_PATH", str(file_store)) + monkeypatch.setenv("AUTH_MODE", "dev") + monkeypatch.setenv("ALLOW_ALL_CURATORS", "true") + monkeypatch.setenv("USE_MOCK_DATACITE", "true") + reset_storage_backend() + reset_middleware_state() + yield + reset_storage_backend() + reset_middleware_state() + + +def test_async_profile_job_with_sqlite_worker(async_sqlite_env): + client = TestClient(app) + headers = {"X-User-Id": "owner-user"} + + stream = client.post("/stream/create", headers=headers, json={"title": "Async Profile Stream"}) + assert stream.status_code == 200 + stream_id = stream.json()["stream_id"] + + content_b64 = base64.b64encode(b"a,b\n1,2\n3,4\n").decode("ascii") + upload = client.post( + f"/stream/{stream_id}/upload", + headers=headers, + json={"filename": "sample.csv", "content_base64": content_b64, "content_type": "text/csv"}, + ) + assert upload.status_code == 200 + + snap = client.post( + f"/stream/{stream_id}/snapshot", + headers=headers, + json={"title": "Snapshot"}, + ) + assert snap.status_code == 200 + body = snap.json() + source_id = body["source_id"] + assert body["profile_job"]["queued"] is True + assert body["profile_job"]["mode"] == "sqlite" + + before = client.get(f"/status/{source_id}") + assert before.status_code == 200 + assert before.json()["submission"]["status"] == "pending_curation" + assert before.json()["submission"].get("dataset_profile") in (None, "") + + worker_result = run_sqlite_worker_once(limit=10) + assert worker_result["processed"] >= 1 + assert worker_result["failed"] == 0 + + after = client.get(f"/status/{source_id}") + assert after.status_code == 200 + profile = after.json()["submission"].get("dataset_profile") + assert isinstance(profile, dict) + assert profile.get("total_files", 0) >= 1 + + +def test_async_submission_doi_job_with_sqlite_worker(async_sqlite_env): + client = TestClient(app) + headers = {"X-User-Id": "curator-user"} + + submit = client.post( + "/submit", + headers=headers, + json={ + "title": "Async DOI Dataset", + "authors": [{"name": "Curator"}], + "data_sources": ["https://example.com/data.csv"], + }, + ) + assert submit.status_code == 200 + source_id = submit.json()["source_id"] + + approve = client.post( + f"/curation/{source_id}/approve", + headers=headers, + json={"mint_doi": True}, + ) + assert approve.status_code == 200 + approve_body = approve.json() + assert approve_body["status"] == "approved" + assert approve_body["publish_job"]["queued"] is True + assert approve_body["publish_job"]["mode"] == "sqlite" + + before = client.get(f"/status/{source_id}") + assert before.status_code == 200 + assert before.json()["submission"]["status"] == "approved" + assert not before.json()["submission"].get("doi") + + worker_result = run_sqlite_worker_once(limit=10) + assert worker_result["processed"] >= 1 + assert worker_result["failed"] == 0 + + after = client.get(f"/status/{source_id}") + assert after.status_code == 200 + assert after.json()["submission"]["status"] == "published" + assert after.json()["submission"].get("doi") diff --git a/aws/v2/test_v2_hardening.py b/aws/v2/test_v2_hardening.py new file mode 100644 index 0000000..9c0a734 --- /dev/null +++ b/aws/v2/test_v2_hardening.py @@ -0,0 +1,287 @@ +from __future__ import annotations + +import os +import tempfile +from pathlib import Path + +import pytest +from fastapi.testclient import TestClient + +import sys + +sys.path.insert(0, str(Path(__file__).resolve().parents[1])) + +from v2.app import app +from v2.clone import StreamCloner +from v2.storage import reset_storage_backend +from v2.storage.globus_https import GlobusHTTPSStorage +from v2.storage.local import LocalStorage +from v2.store import SqliteSubmissionStore, SubmissionStore + + +class _DummyStore(SubmissionStore): + def get_submission(self, source_id, version): + return {"source_id": source_id, "version": version} + + def list_versions(self, source_id): + return [{"version": "1.9"}, {"version": "1.10"}] + + def put_submission(self, record): + raise NotImplementedError + + def upsert_submission(self, record): + raise NotImplementedError + + def update_status(self, source_id, version, status): + raise NotImplementedError + + def list_by_user(self, user_id, limit=50, start_key=None): + raise NotImplementedError + + def list_by_org(self, organization, limit=50, start_key=None): + raise NotImplementedError + + def list_by_status(self, statuses, limit=100): + raise NotImplementedError + + def update_profile(self, source_id, version, profile_json): + raise NotImplementedError + + def list_all(self, limit=1000): + raise NotImplementedError + + +def test_submission_store_get_uses_semver(): + store = _DummyStore() + latest = store.get("source") + assert latest["version"] == "1.10" + + +def test_sqlite_upsert_preserves_dataset_profile(): + fd, db_path = tempfile.mkstemp(prefix="mdf_v2_", suffix=".db") + os.close(fd) + try: + store = SqliteSubmissionStore(path=db_path) + record = { + "source_id": "source-1", + "version": "1.0", + "versioned_source_id": "source-1-1.0", + "user_id": "user-1", + "user_email": "user@example.com", + "organization": "org", + "status": "submitted", + "dataset_mdata": {"title": "t", "authors": [{"name": "a"}], "data_sources": ["x"]}, + "test": 0, + "created_at": "2026-01-01T00:00:00Z", + "updated_at": "2026-01-01T00:00:00Z", + } + store.put_submission(record) + store.update_profile("source-1", "1.0", '{"total_files": 1}') + current = store.get_submission("source-1", "1.0") + current["status"] = "approved" + store.upsert_submission(current) + updated = store.get_submission("source-1", "1.0") + assert updated["dataset_profile"]["total_files"] == 1 + finally: + os.remove(db_path) + + +def test_local_storage_rejects_path_traversal(tmp_path: Path): + storage = LocalStorage(str(tmp_path)) + with pytest.raises(ValueError): + storage.store_file("stream-1", "../../escape.txt", b"bad") + + secret = tmp_path.parent / "secret.txt" + secret.write_text("top-secret") + assert storage.get_file("../secret.txt") is None + assert storage.get_download_url("../secret.txt") is None + + +def test_globus_upload_url_does_not_expose_server_auth_header(): + storage = GlobusHTTPSStorage(access_token="server-secret") + try: + upload = storage.get_upload_url("stream-1", "file.csv", content_type="text/csv") + assert "Authorization" not in upload.get("headers", {}) + assert upload.get("auth_type") == "bearer" + finally: + storage.close() + + +def test_stream_append_requires_owner(tmp_path: Path, monkeypatch: pytest.MonkeyPatch): + db_path = tmp_path / "store.db" + file_store = tmp_path / "files" + monkeypatch.setenv("STORE_BACKEND", "sqlite") + monkeypatch.setenv("SQLITE_PATH", str(db_path)) + monkeypatch.setenv("STORAGE_BACKEND", "local") + monkeypatch.setenv("FILE_STORE_PATH", str(file_store)) + monkeypatch.setenv("AUTH_MODE", "dev") + monkeypatch.setenv("ALLOW_ALL_CURATORS", "false") + reset_storage_backend() + + client = TestClient(app) + create = client.post( + "/stream/create", + headers={"X-User-Id": "owner-user"}, + json={"title": "Owner stream"}, + ) + assert create.status_code == 200 + stream_id = create.json()["stream_id"] + + denied = client.post( + f"/stream/{stream_id}/append", + headers={"X-User-Id": "other-user"}, + json={"file_count": 1, "total_bytes": 10}, + ) + assert denied.status_code == 403 + + +def test_status_update_requires_curator(tmp_path: Path, monkeypatch: pytest.MonkeyPatch): + db_path = tmp_path / "store.db" + file_store = tmp_path / "files" + monkeypatch.setenv("STORE_BACKEND", "sqlite") + monkeypatch.setenv("SQLITE_PATH", str(db_path)) + monkeypatch.setenv("STORAGE_BACKEND", "local") + monkeypatch.setenv("FILE_STORE_PATH", str(file_store)) + monkeypatch.setenv("AUTH_MODE", "dev") + monkeypatch.setenv("ALLOW_ALL_CURATORS", "false") + reset_storage_backend() + + client = TestClient(app) + submit = client.post( + "/submit", + headers={"X-User-Id": "submitter"}, + json={"title": "Dataset", "authors": [{"name": "A"}], "data_sources": ["https://example.com/a.csv"]}, + ) + assert submit.status_code == 200 + source_id = submit.json()["source_id"] + + denied = client.post( + "/status/update", + headers={"X-User-Id": "submitter"}, + json={"source_id": source_id, "version": "1.0", "status": "processing"}, + ) + assert denied.status_code == 403 + + monkeypatch.setenv("ALLOW_ALL_CURATORS", "true") + allowed = client.post( + "/status/update", + headers={"X-User-Id": "submitter"}, + json={"source_id": source_id, "version": "1.0", "status": "approved"}, + ) + assert allowed.status_code == 200 + + +def test_clone_rejects_untrusted_host_for_token_use(monkeypatch: pytest.MonkeyPatch): + monkeypatch.setenv("GLOBUS_HTTPS_SERVER", "data.materialsdatafacility.org") + cloner = StreamCloner(dest_dir=".", token="dummy-token", verbose=False) + try: + with pytest.raises(ValueError): + cloner._validate_globus_url("https://evil.example.org/path/file.csv") + cloner._validate_globus_url("https://data.materialsdatafacility.org/path/file.csv") + finally: + cloner.close() + + +def test_upload_confirm_requires_existing_file(tmp_path: Path, monkeypatch: pytest.MonkeyPatch): + db_path = tmp_path / "store.db" + file_store = tmp_path / "files" + monkeypatch.setenv("STORE_BACKEND", "sqlite") + monkeypatch.setenv("SQLITE_PATH", str(db_path)) + monkeypatch.setenv("STORAGE_BACKEND", "local") + monkeypatch.setenv("FILE_STORE_PATH", str(file_store)) + monkeypatch.setenv("AUTH_MODE", "dev") + monkeypatch.setenv("ALLOW_ALL_CURATORS", "false") + reset_storage_backend() + + client = TestClient(app) + create = client.post( + "/stream/create", + headers={"X-User-Id": "owner-user"}, + json={"title": "Owner stream"}, + ) + assert create.status_code == 200 + stream_id = create.json()["stream_id"] + + missing = client.post( + f"/stream/{stream_id}/upload-confirm", + headers={"X-User-Id": "owner-user"}, + json={"path": f"streams/{stream_id}/2026-02-06/not-there.csv", "size_bytes": 10}, + ) + assert missing.status_code == 400 + + +def test_curation_without_version_uses_latest(tmp_path: Path, monkeypatch: pytest.MonkeyPatch): + db_path = tmp_path / "store.db" + file_store = tmp_path / "files" + monkeypatch.setenv("STORE_BACKEND", "sqlite") + monkeypatch.setenv("SQLITE_PATH", str(db_path)) + monkeypatch.setenv("STORAGE_BACKEND", "local") + monkeypatch.setenv("FILE_STORE_PATH", str(file_store)) + monkeypatch.setenv("AUTH_MODE", "dev") + monkeypatch.setenv("ALLOW_ALL_CURATORS", "true") + reset_storage_backend() + + store = SqliteSubmissionStore(path=str(db_path)) + common = { + "source_id": "src-1", + "user_id": "submitter", + "user_email": "submitter@example.com", + "organization": "org", + "status": "pending_curation", + "dataset_mdata": {"title": "Dataset", "authors": [{"name": "Author"}], "data_sources": ["https://example.com/a.csv"]}, + "test": 0, + "created_at": "2026-01-01T00:00:00Z", + "updated_at": "2026-01-01T00:00:00Z", + } + v1 = dict(common) + v1.update({"version": "1.0", "versioned_source_id": "src-1-1.0"}) + v2 = dict(common) + v2.update({"version": "1.10", "versioned_source_id": "src-1-1.10"}) + store.put_submission(v1) + store.put_submission(v2) + + client = TestClient(app) + resp = client.get("/curation/src-1", headers={"X-User-Id": "curator"}) + assert resp.status_code == 200 + assert resp.json()["submission"]["version"] == "1.10" + + +def test_submit_requires_submitter_group(tmp_path: Path, monkeypatch: pytest.MonkeyPatch): + """In production auth mode the submitter group is enforced; in dev mode it's bypassed.""" + db_path = tmp_path / "store.db" + file_store = tmp_path / "files" + monkeypatch.setenv("STORE_BACKEND", "sqlite") + monkeypatch.setenv("SQLITE_PATH", str(db_path)) + monkeypatch.setenv("STORAGE_BACKEND", "local") + monkeypatch.setenv("FILE_STORE_PATH", str(file_store)) + monkeypatch.setenv("AUTH_MODE", "dev") + monkeypatch.setenv("ALLOW_ALL_CURATORS", "false") + monkeypatch.setenv("REQUIRED_GROUP_MEMBERSHIP", "cc192dca-3751-11e8-90c1-0a7c735d220a") + reset_storage_backend() + + client = TestClient(app) + payload = {"title": "Dataset", "authors": [{"name": "A"}], "data_sources": ["https://example.com/a.csv"]} + + # In dev mode, group check is bypassed — submit should succeed + resp = client.post("/submit", headers={"X-User-Id": "anybody"}, json=payload) + assert resp.status_code == 200 + + # Switch to production auth mode — without group membership, should be denied. + # We can't do full Globus auth in tests, so we test the is_submitter function directly. + from v2.app.auth import is_submitter + from v2.app.models import AuthContext + + monkeypatch.setenv("AUTH_MODE", "production") + + no_groups = AuthContext(user_id="outsider", group_info={}) + assert is_submitter(no_groups) is False + + has_group = AuthContext( + user_id="member", + group_info={"cc192dca-3751-11e8-90c1-0a7c735d220a": {"name": "MDF"}}, + ) + assert is_submitter(has_group) is True + + # Empty REQUIRED_GROUP_MEMBERSHIP means everyone is allowed + monkeypatch.setenv("REQUIRED_GROUP_MEMBERSHIP", "") + assert is_submitter(no_groups) is True diff --git a/aws/v2/test_v2_integration.py b/aws/v2/test_v2_integration.py new file mode 100644 index 0000000..eaad332 --- /dev/null +++ b/aws/v2/test_v2_integration.py @@ -0,0 +1,316 @@ +from __future__ import annotations + +import base64 +import json +from datetime import datetime, timedelta, timezone +from pathlib import Path + +import pytest +from fastapi.testclient import TestClient + +import sys + +sys.path.insert(0, str(Path(__file__).resolve().parents[1])) + +from v2.app import app +from v2.app.middleware import reset_middleware_state +from v2.storage import reset_storage_backend +from v2.store import SqliteSubmissionStore + + +@pytest.fixture() +def local_env(tmp_path: Path, monkeypatch: pytest.MonkeyPatch): + db_path = tmp_path / "store.db" + file_store = tmp_path / "files" + monkeypatch.setenv("STORE_BACKEND", "sqlite") + monkeypatch.setenv("SQLITE_PATH", str(db_path)) + monkeypatch.setenv("STORAGE_BACKEND", "local") + monkeypatch.setenv("FILE_STORE_PATH", str(file_store)) + monkeypatch.setenv("AUTH_MODE", "dev") + monkeypatch.setenv("ALLOW_ALL_CURATORS", "true") + monkeypatch.setenv("USE_MOCK_DATACITE", "true") + monkeypatch.setenv("MAX_REQUEST_BYTES", str(1024 * 1024)) + monkeypatch.setenv("MAX_SUBMIT_METADATA_BYTES", str(256 * 1024)) + monkeypatch.setenv("RATE_LIMIT_DEFAULT_PER_MIN", "200") + monkeypatch.setenv("RATE_LIMIT_SUBMIT_PER_MIN", "100") + monkeypatch.setenv("RATE_LIMIT_WINDOW_SECONDS", "60") + reset_storage_backend() + reset_middleware_state() + yield db_path + reset_storage_backend() + reset_middleware_state() + + +def test_happy_path_submit_to_curation_to_card(local_env: Path): + client = TestClient(app) + headers = {"X-User-Id": "owner-user"} + + submit = client.post( + "/submit", + headers=headers, + json={ + "title": "Integration Dataset", + "authors": [{"name": "Jane Scientist"}], + "description": "integration test dataset", + "data_sources": ["https://example.com/data.csv"], + }, + ) + assert submit.status_code == 200 + source_id = submit.json()["source_id"] + + pending = client.post( + "/status/update", + headers=headers, + json={"source_id": source_id, "version": "1.0", "status": "pending_curation"}, + ) + assert pending.status_code == 200 + + queue = client.get("/curation/pending", headers=headers) + assert queue.status_code == 200 + queued = {(x["source_id"], x["version"]) for x in queue.json().get("submissions", [])} + assert (source_id, "1.0") in queued + + approve = client.post( + f"/curation/{source_id}/approve", + headers=headers, + json={"notes": "looks good", "mint_doi": True}, + ) + assert approve.status_code == 200 + approve_body = approve.json() + assert approve_body["status"] == "published" + assert approve_body["doi"]["success"] is True + + card = client.get(f"/card/{source_id}") + assert card.status_code == 200 + card_body = card.json()["card"] + assert card_body["source_id"] == source_id + assert card_body["status"] == "published" + assert card_body.get("doi") + + +def test_throttling_and_request_size_limits(local_env: Path, monkeypatch: pytest.MonkeyPatch): + client = TestClient(app) + headers = {"X-User-Id": "limited-user"} + + monkeypatch.setenv("RATE_LIMIT_SUBMIT_PER_MIN", "2") + monkeypatch.setenv("RATE_LIMIT_WINDOW_SECONDS", "60") + reset_middleware_state() + payload = { + "title": "RL dataset", + "authors": [{"name": "A"}], + "data_sources": ["https://example.com/a.csv"], + } + assert client.post("/submit", headers=headers, json=payload).status_code == 200 + assert client.post("/submit", headers=headers, json=payload).status_code == 200 + limited = client.post("/submit", headers=headers, json=payload) + assert limited.status_code == 429 + assert limited.json()["error"] == "Rate limit exceeded" + + monkeypatch.setenv("MAX_REQUEST_BYTES", "300") + reset_middleware_state() + oversized_payload = { + "title": "X" * 500, + "authors": [{"name": "A"}], + "data_sources": ["https://example.com/a.csv"], + } + too_large = client.post("/submit", headers=headers, json=oversized_payload) + assert too_large.status_code == 413 + + +def test_submissions_pagination(local_env: Path): + client = TestClient(app) + headers = {"X-User-Id": "pager-user"} + store = SqliteSubmissionStore(path=str(local_env)) + base_time = datetime(2026, 1, 1, tzinfo=timezone.utc) + + for i in range(55): + ts = (base_time + timedelta(minutes=i)).isoformat().replace("+00:00", "Z") + source_id = f"src-{i:03d}" + store.put_submission( + { + "source_id": source_id, + "version": "1.0", + "versioned_source_id": f"{source_id}-1.0", + "user_id": "pager-user", + "user_email": "pager@example.com", + "organization": "org", + "status": "submitted", + "dataset_mdata": json.dumps( + {"title": f"Dataset {i}", "authors": [{"name": "A"}], "data_sources": ["https://example.com/a.csv"]} + ), + "test": 0, + "created_at": ts, + "updated_at": ts, + } + ) + + p1 = client.get("/submissions", headers=headers, params={"limit": 20}) + assert p1.status_code == 200 + b1 = p1.json() + assert len(b1["submissions"]) == 20 + assert b1["next_key"] + + p2 = client.get("/submissions", headers=headers, params={"limit": 20, "start_key": b1["next_key"]}) + assert p2.status_code == 200 + b2 = p2.json() + assert len(b2["submissions"]) == 20 + assert b2["next_key"] + + p3 = client.get("/submissions", headers=headers, params={"limit": 20, "start_key": b2["next_key"]}) + assert p3.status_code == 200 + b3 = p3.json() + assert len(b3["submissions"]) == 15 + assert b3["next_key"] is None + + +def test_path_validation_edge_cases(local_env: Path): + client = TestClient(app) + headers = {"X-User-Id": "owner-user"} + + created = client.post("/stream/create", headers=headers, json={"title": "Edge Stream"}) + assert created.status_code == 200 + stream_id = created.json()["stream_id"] + + content_b64 = base64.b64encode(b"col1,col2\n1,2\n").decode("ascii") + uploaded = client.post( + f"/stream/{stream_id}/upload", + headers=headers, + json={"filename": "edge.csv", "content_base64": content_b64, "content_type": "text/csv"}, + ) + assert uploaded.status_code == 200 + path = uploaded.json()["files"][0]["path"] + + valid_download = client.post( + f"/stream/{stream_id}/download-url", + headers=headers, + json={"path": path}, + ) + assert valid_download.status_code == 200 + + invalid_download = client.post( + f"/stream/{stream_id}/download-url", + headers=headers, + json={"path": "streams/other-stream/2026-02-06/edge.csv"}, + ) + assert invalid_download.status_code == 400 + + +def test_curation_reject_transition_rules(local_env: Path): + """Submissions land as pending_curation; reject works once, then fails on double-reject.""" + client = TestClient(app) + headers = {"X-User-Id": "curator-user"} + + submit = client.post( + "/submit", + headers=headers, + json={ + "title": "Reject Transition Dataset", + "authors": [{"name": "Reviewer"}], + "data_sources": ["https://example.com/data.csv"], + }, + ) + assert submit.status_code == 200 + source_id = submit.json()["source_id"] + + # Submission is already pending_curation — reject should succeed immediately + reject = client.post( + f"/curation/{source_id}/reject", + headers=headers, + json={"reason": "missing metadata"}, + ) + assert reject.status_code == 200 + assert reject.json()["status"] == "rejected" + + # Rejecting again should fail — it's no longer pending_curation + reject_again = client.post( + f"/curation/{source_id}/reject", + headers=headers, + json={"reason": "double reject"}, + ) + assert reject_again.status_code == 400 + + +def test_search_limit_is_capped(local_env: Path): + client = TestClient(app) + store = SqliteSubmissionStore(path=str(local_env)) + base_time = datetime(2026, 1, 1, tzinfo=timezone.utc) + + for i in range(60): + ts = (base_time + timedelta(minutes=i)).isoformat().replace("+00:00", "Z") + source_id = f"search-cap-{i:03d}" + store.put_submission( + { + "source_id": source_id, + "version": "1.0", + "versioned_source_id": f"{source_id}-1.0", + "user_id": "search-user", + "user_email": "search@example.com", + "organization": "org", + "status": "published", + "dataset_mdata": json.dumps( + { + "title": f"Search Cap Dataset {i}", + "authors": [{"name": "Cap Tester"}], + "description": "Used to test search limit clamping", + "data_sources": ["https://example.com/a.csv"], + } + ), + "test": 0, + "created_at": ts, + "updated_at": ts, + } + ) + + resp = client.get("/search", params={"q": "Search Cap Dataset", "type": "datasets", "limit": 500}) + assert resp.status_code == 200 + body = resp.json() + assert len(body["results"]) == 50 + + +def test_search_fallback_excludes_unpublished_datasets( + local_env: Path, + monkeypatch: pytest.MonkeyPatch, +): + client = TestClient(app) + store = SqliteSubmissionStore(path=str(local_env)) + + def _search_unavailable(): + raise RuntimeError("search temporarily unavailable") + + monkeypatch.setattr("v2.search_client.get_search_client", _search_unavailable) + + for source_id, status in ( + ("search-published", "published"), + ("search-pending", "pending_curation"), + ): + store.put_submission( + { + "source_id": source_id, + "version": "1.0", + "versioned_source_id": f"{source_id}-1.0", + "user_id": "search-user", + "user_email": "search@example.com", + "organization": "org", + "status": status, + "dataset_mdata": json.dumps( + { + "title": "Fallback Visibility Dataset", + "authors": [{"name": "Visibility Tester"}], + "description": "Used to verify fallback search filtering", + "data_sources": ["https://example.com/a.csv"], + } + ), + "test": 0, + "created_at": "2026-01-01T00:00:00Z", + "updated_at": "2026-01-01T00:00:00Z", + } + ) + + resp = client.get( + "/search", + params={"q": "Fallback Visibility Dataset", "type": "datasets"}, + ) + assert resp.status_code == 200 + body = resp.json() + assert body["total"] == 1 + assert [item["source_id"] for item in body["results"]] == ["search-published"] diff --git a/aws/v2/test_v2_publish_pipeline.py b/aws/v2/test_v2_publish_pipeline.py new file mode 100644 index 0000000..bd32104 --- /dev/null +++ b/aws/v2/test_v2_publish_pipeline.py @@ -0,0 +1,575 @@ +"""Tests for the v2 publication pipeline. + +Covers: +- Data source format validation +- Status transitions (pending_curation on submit) +- Publish pipeline (approve → DOI + search ingest + published) +- Search client (mock) +""" + +from __future__ import annotations + +import sys +from pathlib import Path + +import pytest +from fastapi.testclient import TestClient + +sys.path.insert(0, str(Path(__file__).resolve().parents[1])) + +from v2.app import app +from v2.app.middleware import reset_middleware_state +from v2.async_jobs import run_sqlite_worker_once +from v2.storage import reset_storage_backend + + +@pytest.fixture() +def env(tmp_path: Path, monkeypatch: pytest.MonkeyPatch): + """Standard test environment with SQLite store and inline dispatch.""" + db_path = tmp_path / "store.db" + file_store = tmp_path / "files" + monkeypatch.setenv("STORE_BACKEND", "sqlite") + monkeypatch.setenv("SQLITE_PATH", str(db_path)) + monkeypatch.setenv("ASYNC_DISPATCH_MODE", "inline") + monkeypatch.setenv("STORAGE_BACKEND", "local") + monkeypatch.setenv("FILE_STORE_PATH", str(file_store)) + monkeypatch.setenv("AUTH_MODE", "dev") + monkeypatch.setenv("ALLOW_ALL_CURATORS", "true") + monkeypatch.setenv("USE_MOCK_DATACITE", "true") + monkeypatch.setenv("USE_MOCK_SEARCH", "true") + reset_storage_backend() + reset_middleware_state() + yield + reset_storage_backend() + reset_middleware_state() + + +@pytest.fixture() +def sqlite_env(tmp_path: Path, monkeypatch: pytest.MonkeyPatch): + """Test environment with SQLite async dispatch (queued jobs).""" + db_path = tmp_path / "store.db" + file_store = tmp_path / "files" + monkeypatch.setenv("STORE_BACKEND", "sqlite") + monkeypatch.setenv("SQLITE_PATH", str(db_path)) + monkeypatch.setenv("ASYNC_SQLITE_PATH", str(db_path)) + monkeypatch.setenv("ASYNC_DISPATCH_MODE", "sqlite") + monkeypatch.setenv("STORAGE_BACKEND", "local") + monkeypatch.setenv("FILE_STORE_PATH", str(file_store)) + monkeypatch.setenv("AUTH_MODE", "dev") + monkeypatch.setenv("ALLOW_ALL_CURATORS", "true") + monkeypatch.setenv("USE_MOCK_DATACITE", "true") + monkeypatch.setenv("USE_MOCK_SEARCH", "true") + reset_storage_backend() + reset_middleware_state() + yield + reset_storage_backend() + reset_middleware_state() + + +HEADERS = {"X-User-Id": "test-user"} + +VALID_SUBMISSION = { + "title": "Test Dataset", + "authors": [{"name": "Test User"}], + "data_sources": ["https://example.com/data.csv"], +} + + +# ========================================================================= +# Status transitions +# ========================================================================= + + +class TestStatusTransitions: + """Submissions land as pending_curation and follow the v2 lifecycle.""" + + def test_submit_lands_as_pending_curation(self, env): + client = TestClient(app) + resp = client.post("/submit", headers=HEADERS, json=VALID_SUBMISSION) + assert resp.status_code == 200 + source_id = resp.json()["source_id"] + + status = client.get(f"/status/{source_id}") + assert status.json()["submission"]["status"] == "pending_curation" + + def test_allowed_status_updates(self, env): + client = TestClient(app) + resp = client.post("/submit", headers=HEADERS, json=VALID_SUBMISSION) + source_id = resp.json()["source_id"] + + for target in ["approved", "published", "rejected"]: + r = client.post( + "/status/update", + headers=HEADERS, + json={"source_id": source_id, "version": "1.0", "status": target}, + ) + assert r.status_code == 200, f"Failed to set status to {target}" + + def test_disallowed_status_update(self, env): + client = TestClient(app) + resp = client.post("/submit", headers=HEADERS, json=VALID_SUBMISSION) + source_id = resp.json()["source_id"] + + r = client.post( + "/status/update", + headers=HEADERS, + json={"source_id": source_id, "version": "1.0", "status": "processing"}, + ) + assert r.status_code == 400 + + +# ========================================================================= +# Data source validation +# ========================================================================= + + +class TestDataSourceValidation: + """Server-side format validation of data_sources.""" + + def test_valid_globus_uri(self, env): + client = TestClient(app) + submission = { + **VALID_SUBMISSION, + "data_sources": ["globus://82f1b5c6-6e9b-11e5-ba47-22000b92c6ec/tmp/data.csv"], + } + resp = client.post("/submit", headers=HEADERS, json=submission) + assert resp.status_code == 200 + + def test_valid_https_url(self, env): + client = TestClient(app) + submission = { + **VALID_SUBMISSION, + "data_sources": ["https://example.com/data.csv"], + } + resp = client.post("/submit", headers=HEADERS, json=submission) + assert resp.status_code == 200 + + def test_valid_stream_uri(self, env): + client = TestClient(app) + submission = { + **VALID_SUBMISSION, + "data_sources": ["stream://my-stream-id"], + } + resp = client.post("/submit", headers=HEADERS, json=submission) + assert resp.status_code == 200 + + def test_globus_uri_invalid_uuid(self, env): + client = TestClient(app) + submission = { + **VALID_SUBMISSION, + "data_sources": ["globus://not-a-uuid/path/data"], + } + resp = client.post("/submit", headers=HEADERS, json=submission) + assert resp.status_code == 400 + assert "invalid collection UUID" in resp.json()["detail"] + + def test_globus_uri_missing_path(self, env): + client = TestClient(app) + submission = { + **VALID_SUBMISSION, + "data_sources": ["globus://82f1b5c6-6e9b-11e5-ba47-22000b92c6ec"], + } + resp = client.post("/submit", headers=HEADERS, json=submission) + assert resp.status_code == 400 + assert "missing path" in resp.json()["detail"] + + def test_stream_uri_empty_id(self, env): + client = TestClient(app) + submission = { + **VALID_SUBMISSION, + "data_sources": ["stream://"], + } + resp = client.post("/submit", headers=HEADERS, json=submission) + assert resp.status_code == 400 + assert "empty ID" in resp.json()["detail"] + + def test_mixed_valid_sources(self, env): + client = TestClient(app) + submission = { + **VALID_SUBMISSION, + "data_sources": [ + "globus://82f1b5c6-6e9b-11e5-ba47-22000b92c6ec/path/data.csv", + "https://zenodo.org/record/12345/files/data.zip", + "stream://my-stream", + ], + } + resp = client.post("/submit", headers=HEADERS, json=submission) + assert resp.status_code == 200 + + +# ========================================================================= +# Publish pipeline (inline dispatch) +# ========================================================================= + + +class TestPublishPipelineInline: + """Full publish pipeline with inline (synchronous) dispatch.""" + + def test_approve_triggers_publish_and_doi(self, env): + """Approve with mint_doi=true → published status + mock DOI.""" + client = TestClient(app) + + # Submit + resp = client.post("/submit", headers=HEADERS, json=VALID_SUBMISSION) + assert resp.status_code == 200 + source_id = resp.json()["source_id"] + + # Approve (inline dispatch → runs synchronously) + approve = client.post( + f"/curation/{source_id}/approve", + headers=HEADERS, + json={"mint_doi": True}, + ) + assert approve.status_code == 200 + body = approve.json() + assert body["success"] is True + # Inline dispatch: publish job ran immediately + assert body.get("publish_job", {}).get("mode") == "inline" + assert body["status"] == "published" + + # Verify final state + status = client.get(f"/status/{source_id}") + sub = status.json()["submission"] + assert sub["status"] == "published" + assert sub.get("doi") is not None + assert sub.get("published_at") is not None + + def test_approve_without_doi(self, env): + """Approve with mint_doi=false → published but no DOI.""" + client = TestClient(app) + + resp = client.post("/submit", headers=HEADERS, json=VALID_SUBMISSION) + source_id = resp.json()["source_id"] + + approve = client.post( + f"/curation/{source_id}/approve", + headers=HEADERS, + json={"mint_doi": False}, + ) + assert approve.status_code == 200 + assert approve.json()["status"] == "published" + + status = client.get(f"/status/{source_id}") + sub = status.json()["submission"] + assert sub["status"] == "published" + assert sub.get("doi") is None # No DOI minted + + def test_reject_does_not_publish(self, env): + """Rejection does not trigger publish pipeline.""" + client = TestClient(app) + + resp = client.post("/submit", headers=HEADERS, json=VALID_SUBMISSION) + source_id = resp.json()["source_id"] + + reject = client.post( + f"/curation/{source_id}/reject", + headers=HEADERS, + json={"reason": "Incomplete metadata"}, + ) + assert reject.status_code == 200 + assert reject.json()["status"] == "rejected" + + status = client.get(f"/status/{source_id}") + assert status.json()["submission"]["status"] == "rejected" + + def test_cannot_approve_non_pending(self, env): + """Cannot approve a submission that is not pending_curation.""" + client = TestClient(app) + + resp = client.post("/submit", headers=HEADERS, json=VALID_SUBMISSION) + source_id = resp.json()["source_id"] + + # Approve it first + client.post( + f"/curation/{source_id}/approve", + headers=HEADERS, + json={"mint_doi": False}, + ) + + # Try to approve again + second = client.post( + f"/curation/{source_id}/approve", + headers=HEADERS, + json={"mint_doi": True}, + ) + assert second.status_code == 400 + + +# ========================================================================= +# Publish pipeline (SQLite async dispatch) +# ========================================================================= + + +class TestPublishPipelineAsync: + """Publish pipeline with SQLite async dispatch (queued then processed).""" + + def test_approve_queues_publish_job(self, sqlite_env): + client = TestClient(app) + + resp = client.post("/submit", headers=HEADERS, json=VALID_SUBMISSION) + assert resp.status_code == 200 + source_id = resp.json()["source_id"] + + approve = client.post( + f"/curation/{source_id}/approve", + headers=HEADERS, + json={"mint_doi": True}, + ) + assert approve.status_code == 200 + body = approve.json() + assert body["publish_job"]["queued"] is True + assert body["publish_job"]["mode"] == "sqlite" + assert body["status"] == "approved" # Not yet published + + # Status should be approved (job not yet processed) + status = client.get(f"/status/{source_id}") + assert status.json()["submission"]["status"] == "approved" + + # Process the async job + result = run_sqlite_worker_once(limit=10) + assert result["processed"] >= 1 + assert result["failed"] == 0 + + # Now it should be published + status = client.get(f"/status/{source_id}") + sub = status.json()["submission"] + assert sub["status"] == "published" + assert sub.get("doi") is not None + + +# ========================================================================= +# Mock search client +# ========================================================================= + + +class TestMockSearchClient: + """Tests for the MockGlobusSearchClient.""" + + def test_ingest_and_search(self): + from v2.search_client import MockGlobusSearchClient + + client = MockGlobusSearchClient() + + submission = { + "source_id": "test-dataset-1", + "version": "1.0", + "organization": "MDF", + "dataset_mdata": '{"title":"Test Dataset","authors":[{"name":"Test"}],"data_sources":["https://example.com"]}', + } + + # Ingest + result = client.ingest(submission) + assert result["success"] is True + assert result.get("mock") is True + + # Search — should find it + search = client.search("Test Dataset") + assert search["total"] == 1 + assert search["results"][0]["source_id"] == "test-dataset-1" + + # Search — no match + search = client.search("nonexistent") + assert search["total"] == 0 + + def test_delete_entry(self): + from v2.search_client import MockGlobusSearchClient + + client = MockGlobusSearchClient() + + submission = { + "source_id": "to-delete", + "version": "1.0", + "dataset_mdata": '{"title":"Delete Me","authors":[{"name":"X"}],"data_sources":[]}', + } + client.ingest(submission) + assert client.search("Delete Me")["total"] == 1 + + client.delete_entry("to-delete") + assert client.search("Delete Me")["total"] == 0 + + def test_search_pagination(self): + from v2.search_client import MockGlobusSearchClient + + client = MockGlobusSearchClient() + + for i in range(5): + client.ingest({ + "source_id": f"ds-{i}", + "version": "1.0", + "dataset_mdata": f'{{"title":"Dataset {i}","authors":[{{"name":"A"}}],"data_sources":[]}}', + }) + + result = client.search("Dataset", limit=2) + assert result["total"] == 5 + assert len(result["results"]) == 2 + + result2 = client.search("Dataset", limit=2, offset=2) + assert len(result2["results"]) == 2 + + def test_search_datasets_uses_mock(self, env): + """Unpublished submissions should not leak through fallback search.""" + client = TestClient(app) + submission = { + **VALID_SUBMISSION, + "title": "Search Fallback Unpublished Isolation Title", + } + + # Submit a dataset so fallback has a matching unpublished record available. + client.post("/submit", headers=HEADERS, json=submission) + + # Search via API + search = client.get( + "/search", + params={"q": "Search Fallback Unpublished Isolation Title", "type": "datasets"}, + ) + assert search.status_code == 200 + body = search.json() + assert body["total"] == 0 + assert body["results"] == [] + + +# ========================================================================= +# Domains and external import fields +# ========================================================================= + + +class TestDomainsAndExternalImport: + """Round-trip tests for domains and external import provenance fields.""" + + def test_domains_round_trip(self, env): + """Submit with domains, verify they appear in /status.""" + client = TestClient(app) + submission = { + **VALID_SUBMISSION, + "domains": ["materials", "chemistry"], + } + resp = client.post("/submit", headers=HEADERS, json=submission) + assert resp.status_code == 200 + source_id = resp.json()["source_id"] + + status = client.get(f"/status/{source_id}") + mdata = status.json()["submission"]["dataset_mdata"] + assert mdata["domains"] == ["materials", "chemistry"] + + def test_external_import_round_trip(self, env): + """Submit with external import fields, verify they appear in /status.""" + client = TestClient(app) + submission = { + **VALID_SUBMISSION, + "external_doi": "10.1234/ext-dataset", + "external_url": "https://zenodo.org/record/12345", + "external_source": "Zenodo", + } + resp = client.post("/submit", headers=HEADERS, json=submission) + assert resp.status_code == 200 + source_id = resp.json()["source_id"] + + status = client.get(f"/status/{source_id}") + mdata = status.json()["submission"]["dataset_mdata"] + assert mdata["external_doi"] == "10.1234/ext-dataset" + assert mdata["external_url"] == "https://zenodo.org/record/12345" + assert mdata["external_source"] == "Zenodo" + + def test_combined_domains_and_external_import(self, env): + """Submit with both domains and external fields, verify all appear.""" + client = TestClient(app) + submission = { + **VALID_SUBMISSION, + "domains": ["physics"], + "external_doi": "10.5678/phys", + "external_url": "https://arxiv.org/abs/2301.00001", + "external_source": "arXiv", + } + resp = client.post("/submit", headers=HEADERS, json=submission) + assert resp.status_code == 200 + source_id = resp.json()["source_id"] + + status = client.get(f"/status/{source_id}") + mdata = status.json()["submission"]["dataset_mdata"] + assert mdata["domains"] == ["physics"] + assert mdata["external_doi"] == "10.5678/phys" + assert mdata["external_url"] == "https://arxiv.org/abs/2301.00001" + assert mdata["external_source"] == "arXiv" + + def test_domains_empty_by_default(self, env): + """Submit without domains, verify dataset_mdata has domains: [].""" + client = TestClient(app) + resp = client.post("/submit", headers=HEADERS, json=VALID_SUBMISSION) + assert resp.status_code == 200 + source_id = resp.json()["source_id"] + + status = client.get(f"/status/{source_id}") + mdata = status.json()["submission"]["dataset_mdata"] + assert mdata["domains"] == [] + + def test_external_fields_absent_by_default(self, env): + """Submit without external fields, verify they are None in dataset_mdata.""" + client = TestClient(app) + resp = client.post("/submit", headers=HEADERS, json=VALID_SUBMISSION) + assert resp.status_code == 200 + source_id = resp.json()["source_id"] + + status = client.get(f"/status/{source_id}") + mdata = status.json()["submission"]["dataset_mdata"] + assert mdata.get("external_doi") is None + assert mdata.get("external_url") is None + assert mdata.get("external_source") is None + + def test_domains_in_search_index(self, env): + """Approve with domains, verify GMetaEntry mdf block contains them.""" + import v2.search_client as sc + + # Reset mock singleton so we get a fresh client + sc._mock_client = None + + client = TestClient(app) + submission = { + **VALID_SUBMISSION, + "domains": ["materials", "chemistry"], + } + resp = client.post("/submit", headers=HEADERS, json=submission) + assert resp.status_code == 200 + source_id = resp.json()["source_id"] + + # Approve (inline dispatch triggers search ingest) + approve = client.post( + f"/curation/{source_id}/approve", + headers=HEADERS, + json={"mint_doi": False}, + ) + assert approve.status_code == 200 + + # Inspect the mock search client's stored entries + mock_client = sc.get_search_client() + assert len(mock_client._entries) >= 1 + entry = list(mock_client._entries.values())[0] + mdf_block = entry["content"]["mdf"] + assert mdf_block["domains"] == ["materials", "chemistry"] + + def test_external_import_still_mints_own_doi(self, env): + """Submit with external_doi, approve with mint_doi=True, verify MDF mints its own DOI.""" + client = TestClient(app) + submission = { + **VALID_SUBMISSION, + "external_doi": "10.9999/someone-elses-doi", + } + resp = client.post("/submit", headers=HEADERS, json=submission) + assert resp.status_code == 200 + source_id = resp.json()["source_id"] + + approve = client.post( + f"/curation/{source_id}/approve", + headers=HEADERS, + json={"mint_doi": True}, + ) + assert approve.status_code == 200 + assert approve.json()["status"] == "published" + + status = client.get(f"/status/{source_id}") + sub = status.json()["submission"] + # MDF minted its own DOI, distinct from the external one + assert sub.get("doi") is not None + assert sub["doi"] != "10.9999/someone-elses-doi" + # External DOI is preserved in metadata + mdata = sub["dataset_mdata"] + assert mdata["external_doi"] == "10.9999/someone-elses-doi" diff --git a/aws/v2/test_v2_versioning.py b/aws/v2/test_v2_versioning.py new file mode 100644 index 0000000..66a0407 --- /dev/null +++ b/aws/v2/test_v2_versioning.py @@ -0,0 +1,604 @@ +"""Tests for dataset versioning: DOI inheritance, version-specific DOIs, search metadata. + +Covers: +- v1.0 publish → dataset DOI minted, stored as both doi and dataset_doi +- v1.1 publish (update, mint_doi=False) → inherits dataset_doi, updates DataCite metadata +- v1.2 publish (update, mint_doi=True) → version-specific DOI with -v suffix + IsVersionOf +- Search index includes dataset_doi and version_count +- dataset_doi propagated on submit with update=True +""" + +from __future__ import annotations + +import json +import sys +from pathlib import Path + +import pytest +from fastapi.testclient import TestClient + +sys.path.insert(0, str(Path(__file__).resolve().parents[1])) + +from v2.app import app +from v2.app.middleware import reset_middleware_state +from v2.storage import reset_storage_backend + + +@pytest.fixture() +def env(tmp_path: Path, monkeypatch: pytest.MonkeyPatch): + db_path = tmp_path / "store.db" + file_store = tmp_path / "files" + monkeypatch.setenv("STORE_BACKEND", "sqlite") + monkeypatch.setenv("SQLITE_PATH", str(db_path)) + monkeypatch.setenv("ASYNC_DISPATCH_MODE", "inline") + monkeypatch.setenv("STORAGE_BACKEND", "local") + monkeypatch.setenv("FILE_STORE_PATH", str(file_store)) + monkeypatch.setenv("AUTH_MODE", "dev") + monkeypatch.setenv("ALLOW_ALL_CURATORS", "true") + monkeypatch.setenv("USE_MOCK_DATACITE", "true") + monkeypatch.setenv("USE_MOCK_SEARCH", "true") + reset_storage_backend() + reset_middleware_state() + yield + reset_storage_backend() + reset_middleware_state() + + +HEADERS = {"X-User-Id": "test-user"} + +BASE_SUBMISSION = { + "title": "Test Versioning Dataset", + "authors": [{"name": "Test Author"}], + "data_sources": ["https://example.com/data.csv"], +} + + +def _submit(client, extra=None): + payload = {**BASE_SUBMISSION, **(extra or {})} + resp = client.post("/submit", headers=HEADERS, json=payload) + assert resp.status_code == 200, resp.json() + return resp.json() + + +def _approve(client, source_id, mint_doi=True, version=None): + body = {"mint_doi": mint_doi} + if version: + body["version"] = version + resp = client.post( + f"/curation/{source_id}/approve", + headers=HEADERS, + json=body, + ) + assert resp.status_code == 200, resp.json() + return resp.json() + + +def _status(client, source_id, version=None): + url = f"/status/{source_id}" + params = {"version": version} if version else {} + resp = client.get(url, params=params) + assert resp.status_code == 200 + return resp.json().get("submission", {}) + + +class TestVersioningDOILifecycle: + """Full versioning lifecycle: v1.0 → v2.0 (inherit) → v3.0 (version DOI). + + Note: updates with new data_sources produce major version bumps. + """ + + def test_v10_gets_dataset_doi(self, env): + """v1.0 with mint_doi=True gets both doi and dataset_doi.""" + client = TestClient(app) + + result = _submit(client) + source_id = result["source_id"] + + _approve(client, source_id, mint_doi=True) + + sub = _status(client, source_id, version="1.0") + assert sub["status"] == "published" + assert sub["doi"] is not None + assert sub["dataset_doi"] is not None + assert sub["doi"] == sub["dataset_doi"] + + def test_major_update_inherits_dataset_doi(self, env): + """v2.0 (update with new data, mint_doi=False) inherits dataset_doi.""" + client = TestClient(app) + + # Publish v1.0 + r1 = _submit(client) + source_id = r1["source_id"] + _approve(client, source_id, mint_doi=True) + v10 = _status(client, source_id, version="1.0") + dataset_doi = v10["dataset_doi"] + + # Submit v2.0 (update with new data_sources → major bump) + r2 = _submit(client, extra={ + "title": "Updated Dataset v2.0", + "update": True, + "extensions": {"mdf_source_id": source_id}, + }) + assert r2["version"] == "2.0" + + _approve(client, source_id, mint_doi=False, version="2.0") + + v20 = _status(client, source_id, version="2.0") + assert v20["status"] == "published" + assert v20["dataset_doi"] == dataset_doi + + def test_major_update_gets_version_specific_doi(self, env): + """v3.0 (update with data, mint_doi=True) gets version-specific DOI.""" + client = TestClient(app) + + # Publish v1.0 + r1 = _submit(client) + source_id = r1["source_id"] + _approve(client, source_id, mint_doi=True) + v10 = _status(client, source_id, version="1.0") + dataset_doi = v10["dataset_doi"] + + # Submit and publish v2.0 + _submit(client, extra={ + "title": "Updated Dataset v2.0", + "update": True, + "extensions": {"mdf_source_id": source_id}, + }) + _approve(client, source_id, mint_doi=False, version="2.0") + + # Submit and publish v3.0 (new version DOI) + r3 = _submit(client, extra={ + "title": "Updated Dataset v3.0", + "update": True, + "extensions": {"mdf_source_id": source_id}, + }) + assert r3["version"] == "3.0" + _approve(client, source_id, mint_doi=True, version="3.0") + + v30 = _status(client, source_id, version="3.0") + assert v30["status"] == "published" + assert v30["doi"] is not None + assert v30["dataset_doi"] == dataset_doi + assert v30["doi"] != dataset_doi + assert "-v3.0" in v30["doi"] + + def test_full_lifecycle_three_versions(self, env): + """Full lifecycle: three major versions with different DOI strategies.""" + client = TestClient(app) + + # v1.0: mint dataset DOI + r1 = _submit(client) + source_id = r1["source_id"] + _approve(client, source_id, mint_doi=True) + v10 = _status(client, source_id, version="1.0") + + # v2.0: inherit DOI (update with data → major) + _submit(client, extra={ + "title": "Updated v2.0", + "update": True, + "extensions": {"mdf_source_id": source_id}, + }) + _approve(client, source_id, mint_doi=False, version="2.0") + v20 = _status(client, source_id, version="2.0") + + # v3.0: version-specific DOI (update with data → major) + _submit(client, extra={ + "title": "Updated v3.0", + "update": True, + "extensions": {"mdf_source_id": source_id}, + }) + _approve(client, source_id, mint_doi=True, version="3.0") + v30 = _status(client, source_id, version="3.0") + + # Verify DOI structure + dataset_doi = v10["dataset_doi"] + assert v10["doi"] == dataset_doi + assert v20["dataset_doi"] == dataset_doi + assert v30["dataset_doi"] == dataset_doi + assert v30["doi"] != dataset_doi + assert "-v3.0" in v30["doi"] + + +class TestVersioningDatasetDOIPropagation: + """dataset_doi is propagated at submission time for update=True.""" + + def test_dataset_doi_set_on_submit_update(self, env): + """When submitting with update=True, dataset_doi is inherited from published versions.""" + client = TestClient(app) + + # Publish v1.0 + r1 = _submit(client) + source_id = r1["source_id"] + _approve(client, source_id, mint_doi=True) + v10 = _status(client, source_id, version="1.0") + dataset_doi = v10["dataset_doi"] + + # Submit v2.0 (has data_sources → major bump) — check dataset_doi set before approval + _submit(client, extra={ + "title": "Updated v2.0", + "update": True, + "extensions": {"mdf_source_id": source_id}, + }) + v20_pending = _status(client, source_id, version="2.0") + assert v20_pending["status"] == "pending_curation" + assert v20_pending.get("dataset_doi") == dataset_doi + + +class TestVersioningSearchIndex: + """Search index includes dataset_doi and version_count.""" + + def test_search_entry_includes_dataset_doi(self, env): + """build_gmeta_entry includes mdf.dataset_doi when set on submission.""" + from v2.search_client import MockGlobusSearchClient + + search_client = MockGlobusSearchClient() + + submission = { + "source_id": "test-ds-1", + "version": "1.1", + "organization": "MDF", + "dataset_doi": "10.99999/test-ds-1", + "dataset_mdata": json.dumps({ + "title": "Test Dataset", + "authors": [{"name": "Test"}], + "data_sources": ["https://example.com"], + }), + } + + entry = search_client.build_gmeta_entry(submission, version_count=2) + content = entry["content"] + + assert content["mdf"]["dataset_doi"] == "10.99999/test-ds-1" + assert content["mdf"]["version_count"] == 2 + # dc.doi falls back to dataset_doi when no version-specific doi + assert content["dc"]["doi"] == "10.99999/test-ds-1" + + def test_search_entry_prefers_version_doi(self, env): + """dc.doi uses version-specific DOI when available.""" + from v2.search_client import MockGlobusSearchClient + + search_client = MockGlobusSearchClient() + + submission = { + "source_id": "test-ds-1", + "version": "1.2", + "organization": "MDF", + "doi": "10.99999/test-ds-1-v1.2", + "dataset_doi": "10.99999/test-ds-1", + "dataset_mdata": json.dumps({ + "title": "Test Dataset v1.2", + "authors": [{"name": "Test"}], + "data_sources": ["https://example.com"], + }), + } + + entry = search_client.build_gmeta_entry(submission, version_count=3) + content = entry["content"] + + assert content["dc"]["doi"] == "10.99999/test-ds-1-v1.2" + assert content["mdf"]["dataset_doi"] == "10.99999/test-ds-1" + assert content["mdf"]["version_count"] == 3 + + def test_search_entry_without_doi(self, env): + """Entry without any DOI should not have dc.doi.""" + from v2.search_client import MockGlobusSearchClient + + search_client = MockGlobusSearchClient() + + submission = { + "source_id": "no-doi-ds", + "version": "1.0", + "dataset_mdata": json.dumps({ + "title": "No DOI", + "authors": [{"name": "Test"}], + "data_sources": [], + }), + } + + entry = search_client.build_gmeta_entry(submission) + assert "doi" not in entry["content"]["dc"] + assert "dataset_doi" not in entry["content"]["mdf"] + + +class TestVersioningMockDataCite: + """Tests for MockDataCiteClient version-aware methods.""" + + def test_mock_mint_with_suffix_and_relations(self): + from v2.datacite import MockDataCiteClient + + client = MockDataCiteClient(prefix="10.99999") + + # Mint dataset DOI + r1 = client.mint_doi( + source_id="test-ds", + metadata={"titles": [{"title": "Test"}], "creators": [{"name": "X"}]}, + ) + assert r1["success"] + assert r1["doi"] == "10.99999/test-ds" + + # Mint version DOI with custom suffix and relations + r2 = client.mint_doi( + source_id="test-ds", + metadata={"titles": [{"title": "Test v1.2"}], "creators": [{"name": "X"}]}, + doi_suffix="test-ds-v1.2", + related_identifiers=[{ + "relatedIdentifier": "10.99999/test-ds", + "relatedIdentifierType": "DOI", + "relationType": "IsVersionOf", + }], + ) + assert r2["success"] + assert r2["doi"] == "10.99999/test-ds-v1.2" + + # Check stored data + stored = client.get_doi("10.99999/test-ds-v1.2") + assert stored["related_identifiers"][0]["relationType"] == "IsVersionOf" + + def test_mock_update_metadata(self): + from v2.datacite import MockDataCiteClient + + client = MockDataCiteClient(prefix="10.99999") + + # Mint initial + client.mint_doi( + source_id="test-ds", + metadata={"titles": [{"title": "Original"}], "creators": [{"name": "A"}]}, + ) + + # Update metadata + result = client.update_metadata( + doi="10.99999/test-ds", + metadata={"titles": [{"title": "Updated"}], "creators": [{"name": "B"}]}, + related_identifiers=[{ + "relatedIdentifier": "10.99999/test-ds-v1.2", + "relatedIdentifierType": "DOI", + "relationType": "HasVersion", + }], + ) + assert result["success"] + assert result["updated"] + + stored = client.get_doi("10.99999/test-ds") + assert stored["metadata"]["titles"][0]["title"] == "Updated" + assert stored["related_identifiers"][0]["relationType"] == "HasVersion" + + +class TestVersioningCurationLogic: + """Unit tests for version-aware _mint_doi_for_submission.""" + + def test_first_version_mints_dataset_doi(self, env): + from v2.curation import _mint_doi_for_submission + + submission = { + "source_id": "first-ds", + "version": "1.0", + "dataset_mdata": json.dumps({ + "title": "First Dataset", + "authors": [{"name": "Author"}], + "data_sources": [], + }), + } + + result = _mint_doi_for_submission(submission, all_versions=[], mint_doi=True) + assert result["success"] + assert result["doi"] is not None + assert result["dataset_doi"] == result["doi"] + + def test_subsequent_version_mint_false_updates_metadata(self, env): + from v2.curation import _mint_doi_for_submission + + prior_versions = [ + {"version": "1.0", "status": "published", "doi": "10.99999/test-ds", "dataset_doi": "10.99999/test-ds"}, + ] + + submission = { + "source_id": "test-ds", + "version": "1.1", + "dataset_mdata": json.dumps({ + "title": "Updated Dataset", + "authors": [{"name": "New Author"}], + "data_sources": [], + }), + } + + result = _mint_doi_for_submission(submission, all_versions=prior_versions, mint_doi=False) + assert result["success"] + assert result["dataset_doi"] == "10.99999/test-ds" + assert result.get("doi") is None + assert result.get("metadata_updated") is True + + def test_subsequent_version_mint_true_creates_version_doi(self, env): + from v2.curation import _mint_doi_for_submission + + prior_versions = [ + {"version": "1.0", "status": "published", "doi": "10.99999/test-ds", "dataset_doi": "10.99999/test-ds"}, + {"version": "1.1", "status": "published", "dataset_doi": "10.99999/test-ds"}, + ] + + submission = { + "source_id": "test-ds", + "version": "1.2", + "dataset_mdata": json.dumps({ + "title": "Version 1.2", + "authors": [{"name": "Author v1.2"}], + "data_sources": [], + }), + } + + result = _mint_doi_for_submission(submission, all_versions=prior_versions, mint_doi=True) + assert result["success"] + assert result["doi"] is not None + assert "-v1.2" in result["doi"] + assert result["dataset_doi"] == "10.99999/test-ds" + assert result["doi"] != result["dataset_doi"] + + +class TestMajorMinorVersioning: + """Major/minor version detection: new data → major bump, metadata-only → minor bump.""" + + def test_new_data_causes_major_bump(self, env): + """Update with data_sources → version goes from 1.0 to 2.0.""" + client = TestClient(app) + + r1 = _submit(client) + source_id = r1["source_id"] + assert r1["version"] == "1.0" + + r2 = _submit(client, extra={ + "title": "New data update", + "update": True, + "extensions": {"mdf_source_id": source_id}, + "data_sources": ["https://example.com/new-data.csv"], + }) + assert r2["version"] == "2.0" + + def test_metadata_only_causes_minor_bump(self, env): + """Update without data_sources → version goes from 1.0 to 1.1.""" + client = TestClient(app) + + r1 = _submit(client) + source_id = r1["source_id"] + assert r1["version"] == "1.0" + + r2 = _submit(client, extra={ + "title": "Metadata-only update", + "update": True, + "data_sources": [], + "extensions": {"mdf_source_id": source_id}, + }) + assert r2["version"] == "1.1" + + def test_metadata_only_inherits_data_sources(self, env): + """Metadata-only update inherits data_sources from prior version.""" + client = TestClient(app) + + r1 = _submit(client, extra={ + "data_sources": ["https://example.com/original-data.csv"], + }) + source_id = r1["source_id"] + + r2 = _submit(client, extra={ + "title": "Metadata-only update", + "update": True, + "data_sources": [], + "extensions": {"mdf_source_id": source_id}, + }) + assert r2["version"] == "1.1" + + # Check that v1.1 inherited data_sources from v1.0 + sub = _status(client, source_id, version="1.1") + mdata = sub.get("dataset_mdata") + if isinstance(mdata, str): + mdata = json.loads(mdata) + assert mdata["data_sources"] == ["https://example.com/original-data.csv"] + + def test_version_chain_major_minor_mixed(self, env): + """Chain: 1.0 → 2.0 → 2.1 → 3.0 with correct version numbers.""" + client = TestClient(app) + + # v1.0: initial submission with data + r1 = _submit(client, extra={ + "data_sources": ["https://example.com/v1-data.csv"], + }) + source_id = r1["source_id"] + assert r1["version"] == "1.0" + + # v2.0: update with new data → major bump + r2 = _submit(client, extra={ + "title": "Major update", + "update": True, + "data_sources": ["https://example.com/v2-data.csv"], + "extensions": {"mdf_source_id": source_id}, + }) + assert r2["version"] == "2.0" + + # v2.1: metadata-only update → minor bump + r3 = _submit(client, extra={ + "title": "Minor metadata tweak", + "update": True, + "data_sources": [], + "extensions": {"mdf_source_id": source_id}, + }) + assert r3["version"] == "2.1" + + # v3.0: another data update → major bump + r4 = _submit(client, extra={ + "title": "Another major update", + "update": True, + "data_sources": ["https://example.com/v3-data.csv"], + "extensions": {"mdf_source_id": source_id}, + }) + assert r4["version"] == "3.0" + + def test_version_chain_metadata_inherits_correct_sources(self, env): + """Minor update after major update inherits the major version's data_sources.""" + client = TestClient(app) + + # v1.0 + r1 = _submit(client, extra={ + "data_sources": ["https://example.com/v1.csv"], + }) + source_id = r1["source_id"] + + # v2.0 with new data + _submit(client, extra={ + "title": "Major update", + "update": True, + "data_sources": ["https://example.com/v2.csv"], + "extensions": {"mdf_source_id": source_id}, + }) + + # v2.1 metadata-only — should inherit v2.0's data_sources + r3 = _submit(client, extra={ + "title": "Minor tweak after v2", + "update": True, + "data_sources": [], + "extensions": {"mdf_source_id": source_id}, + }) + assert r3["version"] == "2.1" + + sub = _status(client, source_id, version="2.1") + mdata = sub.get("dataset_mdata") + if isinstance(mdata, str): + mdata = json.loads(mdata) + assert mdata["data_sources"] == ["https://example.com/v2.csv"] + + def test_previous_and_root_version_across_bumps(self, env): + """previous_version and root_version are correct across major/minor bumps.""" + client = TestClient(app) + + r1 = _submit(client, extra={ + "data_sources": ["https://example.com/data.csv"], + }) + source_id = r1["source_id"] + + # v2.0 + _submit(client, extra={ + "update": True, + "data_sources": ["https://example.com/new-data.csv"], + "extensions": {"mdf_source_id": source_id}, + }) + + # v2.1 + _submit(client, extra={ + "title": "Minor tweak", + "update": True, + "data_sources": [], + "extensions": {"mdf_source_id": source_id}, + }) + + # Check v2.0 metadata + v20 = _status(client, source_id, version="2.0") + mdata20 = v20.get("dataset_mdata") + if isinstance(mdata20, str): + mdata20 = json.loads(mdata20) + assert mdata20["previous_version"] == f"{source_id}-1.0" + assert mdata20["root_version"] == f"{source_id}-1.0" + + # Check v2.1 metadata + v21 = _status(client, source_id, version="2.1") + mdata21 = v21.get("dataset_mdata") + if isinstance(mdata21, str): + mdata21 = json.loads(mdata21) + assert mdata21["previous_version"] == f"{source_id}-2.0" + assert mdata21["root_version"] == f"{source_id}-1.0" diff --git a/aws/v2/transfer.py b/aws/v2/transfer.py new file mode 100644 index 0000000..8c0f0eb --- /dev/null +++ b/aws/v2/transfer.py @@ -0,0 +1,324 @@ +"""Globus Transfer orchestration for MDF Connect v2. + +Handles data movement from user Globus endpoints to MDF storage: +- Creates destination directories on MDF NCSA endpoint +- Manages ACL rules (grant/revoke) for user access to destination +- Submits and monitors transfer tasks + +Auth pattern: +- Server credentials (GLOBUS_CLIENT_ID/SECRET): mkdir, ACL create/delete + (requires endpoint manager role on NCSA MDF endpoint) +- User's transfer token: submitting the actual transfer task + (runs as user so no source endpoint permissions needed) +""" + +import logging +import os +from datetime import datetime, timedelta, timezone +from typing import Any, Dict, List, Optional +from urllib.parse import urlparse + +logger = logging.getLogger(__name__) + +# NCSA MDF collection (Globus Connect Server endpoint) +NCSA_MDF_COLLECTION_UUID = "82f1b5c6-6e9b-11e5-ba47-22000b92c6ec" + +# Base path for open datasets on the MDF endpoint +MDF_BASE_PATH = "/mdf_open" + +# Globus Transfer auto-cancels tasks after this deadline. +TRANSFER_DEADLINE_HOURS = 24 + + +def _get_server_transfer_client(): + """Get a TransferClient authenticated with server (confidential app) credentials. + + Used for operations requiring endpoint manager privileges: + mkdir, ACL create/delete, task status checks. + """ + import globus_sdk + + client_id = os.environ.get("GLOBUS_CLIENT_ID") + client_secret = os.environ.get("GLOBUS_CLIENT_SECRET") + if not client_id or not client_secret: + raise RuntimeError( + "GLOBUS_CLIENT_ID and GLOBUS_CLIENT_SECRET are required for transfer operations" + ) + + confidential_client = globus_sdk.ConfidentialAppAuthClient(client_id, client_secret) + token_response = confidential_client.oauth2_client_credentials_tokens( + requested_scopes="urn:globus:auth:scope:transfer.api.globus.org:all", + ) + transfer_token = token_response.by_resource_server["transfer.api.globus.org"]["access_token"] + return globus_sdk.TransferClient( + authorizer=globus_sdk.AccessTokenAuthorizer(transfer_token) + ) + + +def _get_user_transfer_client(user_transfer_token: str): + """Get a TransferClient authenticated with the user's transfer token. + + Used for submitting transfer tasks (runs as user). + """ + import globus_sdk + + return globus_sdk.TransferClient( + authorizer=globus_sdk.AccessTokenAuthorizer(user_transfer_token) + ) + + +def parse_globus_uri(uri: str) -> tuple[str, str]: + """Parse a globus://collection_uuid/path URI into (collection_uuid, path).""" + if not uri.startswith("globus://"): + raise ValueError(f"Not a globus:// URI: {uri}") + rest = uri[len("globus://"):] + slash_idx = rest.find("/") + if slash_idx <= 0: + raise ValueError(f"globus:// URI missing path: {uri}") + return rest[:slash_idx], rest[slash_idx:] + + +def initiate_transfer( + source_endpoint: str, + source_path: str, + source_id: str, + version: str, + user_transfer_token: str, + user_identity_id: str, +) -> Dict[str, Any]: + """Initiate a Globus transfer from user endpoint to MDF storage. + + Steps: + 1. Create destination directory on NCSA endpoint (server creds) + 2. Create ACL rule granting user rw access to destination (server creds) + 3. Submit transfer task (user's token, runs as user) + + Args: + source_endpoint: Source Globus collection UUID + source_path: Path on source collection + source_id: MDF dataset source_id + version: Dataset version string + user_transfer_token: User's Globus Transfer access token + user_identity_id: User's Globus identity UUID (for ACL) + + Returns: + Dict with task_id, acl_rule_id, destination_path + """ + import globus_sdk + + destination_path = f"{MDF_BASE_PATH}/{source_id}/{version}/" + + server_tc = _get_server_transfer_client() + + # Step 1: Create destination directory + try: + server_tc.operation_mkdir(NCSA_MDF_COLLECTION_UUID, destination_path) + logger.info("Created destination directory: %s", destination_path) + except globus_sdk.GlobusAPIError as exc: + # 502 "Exists" is fine — directory already created + if exc.http_status == 502 and "Exists" in str(exc): + logger.info("Destination directory already exists: %s", destination_path) + else: + raise + + # Step 2: Create ACL rule granting user rw access + acl_rule_id = None + try: + acl_result = server_tc.add_endpoint_acl_rule( + NCSA_MDF_COLLECTION_UUID, + dict( + DATA_TYPE="access", + principal_type="identity", + principal=user_identity_id, + path=destination_path, + permissions="rw", + ), + ) + acl_rule_id = acl_result.get("access_id") + logger.info("Created ACL rule %s for user %s on %s", acl_rule_id, user_identity_id, destination_path) + except globus_sdk.GlobusAPIError as exc: + # If ACL creation fails, log but continue — user may already have access + logger.warning("ACL creation failed (continuing): %s", exc) + + # Step 3: Submit transfer using user's token + user_tc = _get_user_transfer_client(user_transfer_token) + + deadline = datetime.now(timezone.utc) + timedelta(hours=TRANSFER_DEADLINE_HOURS) + transfer_data = globus_sdk.TransferData( + source_endpoint=source_endpoint, + destination_endpoint=NCSA_MDF_COLLECTION_UUID, + label=f"MDF Connect: {source_id} v{version}", + sync_level="checksum", + deadline=deadline.isoformat(), + ) + + # If source_path ends with /, transfer entire directory recursively + if source_path.endswith("/"): + transfer_data.add_item(source_path, destination_path, recursive=True) + else: + # Single file — extract filename for destination + filename = source_path.rsplit("/", 1)[-1] + transfer_data.add_item(source_path, f"{destination_path}{filename}") + + task_result = user_tc.submit_transfer(transfer_data) + task_id = task_result["task_id"] + initiated_at = datetime.now(timezone.utc).isoformat().replace("+00:00", "Z") + logger.info("Submitted transfer task %s: %s -> %s:%s", task_id, source_endpoint, NCSA_MDF_COLLECTION_UUID, destination_path) + + return { + "task_id": task_id, + "acl_rule_id": acl_rule_id, + "destination_path": destination_path, + "source_endpoint": source_endpoint, + "source_path": source_path, + "initiated_at": initiated_at, + } + + +def check_transfer_status(task_id: str) -> Dict[str, Any]: + """Check the status of a Globus transfer task. + + Uses server credentials to check status (any task is visible + to the endpoint manager). + + Returns: + Dict with status, bytes_transferred, files_transferred, etc. + """ + server_tc = _get_server_transfer_client() + task = server_tc.get_task(task_id) + + return { + "task_id": task_id, + "status": task["status"], + "bytes_transferred": task.get("bytes_transferred", 0), + "files_transferred": task.get("files_transferred", 0), + "files_skipped": task.get("files_skipped", 0), + "is_ok": task.get("is_ok"), + "nice_status": task.get("nice_status"), + "label": task.get("label"), + } + + +def cleanup_transfer_acl(acl_rule_id: str) -> bool: + """Remove an ACL rule from the MDF NCSA endpoint. + + Called after transfer completes (success or failure) to revoke + the temporary write access. + + Returns: + True if successfully removed, False otherwise. + """ + if not acl_rule_id: + return True + + try: + server_tc = _get_server_transfer_client() + server_tc.delete_endpoint_acl_rule(NCSA_MDF_COLLECTION_UUID, acl_rule_id) + logger.info("Removed ACL rule %s", acl_rule_id) + return True + except Exception: + logger.exception("Failed to remove ACL rule %s", acl_rule_id) + return False + + +def cleanup_stale_transfers( + submissions: List[Dict[str, Any]], + max_age_hours: float = 26, +) -> List[Dict[str, Any]]: + """Bulk-check submissions with active transfers, clean up completed/stale ones. + + For each submission whose transfer is still marked active: + - Check Globus task status + - If done/failed/timed-out or older than max_age_hours: clean up ACLs, update record + - If still running within deadline: update progress fields + + Returns the list of submissions that were modified (caller should persist them). + """ + now = datetime.now(timezone.utc) + modified: List[Dict[str, Any]] = [] + + for submission in submissions: + task_ids = submission.get("transfer_task_ids", []) + if not task_ids: + continue + + # Check age + initiated_str = submission.get("transfer_initiated_at") + is_stale = False + if initiated_str: + try: + initiated = datetime.fromisoformat(initiated_str.replace("Z", "+00:00")) + is_stale = (now - initiated).total_seconds() > max_age_hours * 3600 + except (ValueError, TypeError): + pass + + all_succeeded = True + any_failed = False + total_bytes = 0 + total_files = 0 + + for task_id in task_ids: + try: + status = check_transfer_status(task_id) + total_bytes += status.get("bytes_transferred", 0) + total_files += status.get("files_transferred", 0) + + if status["status"] == "SUCCEEDED": + continue + elif status["status"] in ("FAILED", "INACTIVE"): + any_failed = True + all_succeeded = False + else: + all_succeeded = False + except Exception: + logger.exception("Failed to check transfer task %s", task_id) + all_succeeded = False + + submission["transfer_bytes_transferred"] = total_bytes + submission["transfer_files_transferred"] = total_files + + if all_succeeded or any_failed or is_stale: + # Terminal state — clean up ACLs + if all_succeeded: + submission["transfer_status"] = "succeeded" + elif any_failed or is_stale: + submission["transfer_status"] = "failed" if any_failed else "stale" + for acl_id in submission.get("transfer_acl_rule_ids", []): + cleanup_transfer_acl(acl_id) + logger.info( + "Transfer cleanup for %s v%s: %s", + submission.get("source_id"), + submission.get("version"), + submission["transfer_status"], + ) + modified.append(submission) + else: + # Still active — update progress only + modified.append(submission) + + return modified + + +def extract_transfer_sources(data_sources: list[str]) -> list[dict]: + """Identify data sources that need Globus transfer. + + Returns entries for globus:// URIs that point to endpoints OTHER + than the MDF NCSA endpoint (data already on MDF doesn't need transfer). + """ + transfer_needed = [] + for uri in data_sources: + if not uri.startswith("globus://"): + continue + try: + collection_uuid, path = parse_globus_uri(uri) + except ValueError: + continue + # Skip sources already on the MDF endpoint + if collection_uuid.lower() == NCSA_MDF_COLLECTION_UUID.lower(): + continue + transfer_needed.append({ + "uri": uri, + "source_endpoint": collection_uuid, + "source_path": path, + }) + return transfer_needed diff --git a/aws/v2/v2.md b/aws/v2/v2.md new file mode 100644 index 0000000..dafb0e2 --- /dev/null +++ b/aws/v2/v2.md @@ -0,0 +1,712 @@ +# MDF Connect v2 Backend Architecture + +## Overview + +MDF Connect v2 is a FastAPI application deployed to AWS Lambda via Mangum. It provides a REST API for dataset submission, streaming data ingestion, curation, search, DOI minting, and data preview for the Materials Data Facility. + +**Stack**: Python 3.12, FastAPI, Pydantic v2, DynamoDB (prod) / SQLite (dev), Globus HTTPS storage (prod) / local filesystem (dev), DataCite API for DOIs, AWS SAM for deployment. + +**Frontend developers:** See [`FRONTEND_API.md`](FRONTEND_API.md) for the complete API contract with request/response shapes, authentication, permissions, and the submission lifecycle diagram. + +--- + +## Directory Layout + +``` +cs/aws/ + template.yaml # SAM CloudFormation template + samconfig.toml # SAM deploy configs (dev/staging/prod) + deploy.sh # Deployment script (dev/staging/prod/quick/teardown/local) + Makefile # make local, make test, make deploy-dev, etc. + requirements.txt # Python dependencies + + v2/ + config.py # Environment variable declarations + metadata.py # DatasetMetadata schema, to_datacite(), migrate_v1_payload(), parse_metadata() + store.py # SubmissionStore (abstract) + DynamoDB + SQLite implementations + stream_store.py # StreamStore (abstract) + DynamoDB + SQLite implementations + profiler.py # DatasetProfile model + build_dataset_profile() + preview.py # File preview (CSV stats, JSON structure, text lines, binary info) + dataset_card.py # build_dataset_card() - compact dataset summaries + citation.py # BibTeX, RIS, APA, DataCite XML export + search.py # Full-text search across datasets and streams + curation.py # Curation helpers + DOI minting for approved submissions + datacite.py # DataCiteClient + MockDataCiteClient + email_utils.py # AWS SES email notifications (new submission, approved, rejected) + submission_utils.py # generate_source_id(), increment_version(), latest_version(), deep_merge() + + storage/ + __init__.py # Exports StorageBackend, FileMetadata, get_storage_backend + base.py # StorageBackend abstract class + FileMetadata dataclass + factory.py # get_storage_backend() singleton factory + local.py # LocalStorage - filesystem backend for dev + globus_https.py # GlobusHTTPSStorage - Globus endpoint backend for prod + + app/ + __init__.py # FastAPI app assembly, router registration, CORS + main.py # Mangum handler (Lambda) + uvicorn __main__ (local dev) + auth.py # Globus token introspection (prod) / X-User-Id headers (dev) + deps.py # FastAPI dependency injection: get_submission_store, get_stream_store_dep, get_storage + models.py # Pydantic request models (AuthContext, all *Request models) + routers/ + submissions.py # POST /submit, GET /status, POST /status/update, GET /submissions, POST /submissions/{id}/metadata, POST /submissions/{id}/withdraw, POST /submissions/{id}/resubmit, POST /submissions/{id}/delete, GET /versions/{id}/diff, GET /stats/{source_id} + streams.py # POST /stream/create, POST /stream/{id}/append, GET /stream/{id}, POST /stream/{id}/close, POST /stream/{id}/snapshot (disabled) + files.py # POST /stream/{id}/upload, POST /stream/{id}/upload-url, POST /stream/{id}/upload-confirm, POST /stream/{id}/download-url, GET /stream/{id}/files + preview.py # GET /stream/{id}/preview, GET /stream/{id}/files/{name}/preview, GET /preview/{source_id}, GET /preview/{source_id}/files, GET /preview/{source_id}/files/{path}, GET /preview/{source_id}/sample + search.py # GET /search + cards.py # GET /card/{source_id}, GET /detail/{slug}, GET /citation/{source_id} + curation.py # GET /curation/pending, GET /curation/{source_id}, POST /curation/{source_id}/approve, POST /curation/{source_id}/reject + admin.py # GET /admin/stats +``` + +--- + +## Data Flow + +### Submit a dataset +``` +Client -> POST /submit (flat v2 metadata JSON) + -> Validate via DatasetMetadata.model_validate() + -> Auto-detect & migrate v1 dc/mdf/custom format if needed + -> Generate source_id, version + -> store.put_submission(record) + -> If data_sources contain stream:// URIs, trigger build_dataset_profile() + -> Return {source_id, version, versioned_source_id, organization} +``` + +### Stream upload workflow +``` +1. POST /stream/create -> Creates stream record (status: open) +2. POST /stream/{id}/upload -> Base64-encoded file content -> storage.store_file() + OR /stream/{id}/upload-url -> Get pre-signed URL for direct upload + AND /stream/{id}/upload-confirm -> Confirm external upload +3. POST /stream/{id}/snapshot -> Creates a submission from the stream + -> Builds flat v2 metadata + -> Triggers build_dataset_profile() +4. POST /stream/{id}/close -> Marks stream as closed, optionally mints DOI +``` + +### Curation flow +``` +submitted -> pending_curation -> approved -> published (DOI minted) + -> rejected (with reason) -> edit metadata -> resubmit -> pending_curation + -> withdrawn (by submitter) + -> deleted (curator soft-delete, any status) +published -> edit metadata -> auto minor version bump (1.0 -> 1.1, stays published, DOI updated) +any status -> deleted (curator-only, soft-delete with reason, recorded in curation_history) +``` + +Email notifications fire at each curation event (when `EnableEmails=true`): +- `pending_curation` → curators receive a "New Dataset Awaiting Review" email with a link to `CURATION_PORTAL_URL` +- `published` → submitter receives a "Your Dataset is Now Live" email with a link to the dataset +- `rejected` → submitter receives a "Submission Needs Revision" email with curator feedback + +--- + +## Metadata Schema + +### v2 format (current, flat) +Defined in `metadata.py` as `DatasetMetadata` Pydantic model: +```python +{ + "title": "...", + "authors": [{"name": "...", "orcid": "...", "affiliations": [...]}], + "description": "...", + "keywords": [...], + "data_sources": ["stream://stream-id", "https://..."], + "organization": "MDF Open", + "ml": {"data_format": "...", "task_type": [...], "keys": [...]}, # optional + "extensions": {...}, # arbitrary extra data + "tags": [...], + "test": false, + "update": false, + ... +} +``` + +### v1 format (legacy, auto-migrated) +Old `dc/mdf/custom` triple-nested format. Detected by `_is_v1_payload()` (checks for `dc.titles` or `dc.creators`). Automatically converted via `migrate_v1_payload()`. + +### parse_metadata(record) +**Always use this** to read metadata from a DB record. Handles: +- Deserializing `dataset_mdata` from JSON string +- Detecting v1 vs v2 format +- Migrating v1 if needed +- Returning a validated `DatasetMetadata` instance + +--- + +## Store Layer + +### SubmissionStore (cs/aws/v2/store.py) +Abstract base with two implementations: +- **DynamoSubmissionStore** - production (DynamoDB) +- **SqliteSubmissionStore** - development (SQLite, `check_same_thread=False`) + +**Key methods:** +| Method | Notes | +|--------|-------| +| `get(source_id, version=None)` | Gets latest version if no version specified | +| `get_submission(source_id, version)` | Gets specific version | +| `list_versions(source_id)` | All versions of a submission | +| `put_submission(record)` | Insert with uniqueness check (DynamoDB condition) | +| `upsert_submission(record)` | Insert/update without condition (for curation) | +| `update_status(source_id, version, status)` | Status transition | +| `update_profile(source_id, version, profile_json)` | Store dataset profile | +| `list_by_user(user_id)` | Paginated, by GSI | +| `list_by_org(organization)` | Paginated, by GSI | +| `list_by_status(statuses)` | For curation queue | + +**DynamoDB schema:** +- Table: `mdf-submissions-{env}` +- PK: `source_id` (HASH), `version` (RANGE) +- GSI `user-submissions`: `user_id` (HASH), `updated_at` (RANGE) +- GSI `org-submissions`: `organization` (HASH), `source_id` (RANGE) +- Billing: PAY_PER_REQUEST + +**SQLite schema:** +- Same columns, `INSERT OR REPLACE` for upserts +- `dataset_profile TEXT` column (added via ALTER TABLE migration for existing DBs) +- JSON fields (`dataset_mdata`, `curation_history`, `dataset_profile`) deserialized in `_row_to_dict()` + +### StreamStore (cs/aws/v2/stream_store.py) +Same abstract + DynamoDB/SQLite pattern. + +**Key methods:** `create_stream`, `get_stream`, `append_stream` (increments file_count/total_bytes), `close_stream`, `update_stream_metadata`, `list_all` + +**DynamoDB schema:** +- Table: `mdf-streams-{env}` +- PK: `stream_id` (HASH) +- GSI `user-streams`: `user_id` (HASH), `updated_at` (RANGE) + +--- + +## Storage Layer (cs/aws/v2/storage/) + +### StorageBackend (abstract) +Methods: `store_file()`, `store_file_stream()`, `get_file()`, `list_files()`, `delete_file()`, `delete_stream_files()`, `get_stream_size()`, `get_download_url()`, `get_upload_url()`, `backend_name` property. + +### FileMetadata +Dataclass: `filename`, `path`, `size_bytes`, `checksum_md5`, `content_type`, `stored_at`, `storage_backend`, `download_url`, `custom_metadata`. + +### Implementations + +**LocalStorage** (dev): +- Stores at `FILE_STORE_PATH` (default `/tmp/mdf_files`) +- Path: `streams/{stream_id}/{date}/{filename}` +- Metadata: `.meta.json` sidecar files + +**GlobusHTTPSStorage** (prod): +- Globus HTTPS endpoint at `data.materialsdatafacility.org` +- NCSA endpoint UUID: `82f1b5c6-6e9b-11e5-ba47-22000b92c6ec` +- Flat path: `{stream_id}_{date}_{filename}` (avoids directory creation) +- Token hierarchy: explicit param > env var > cached file > client credentials +- In-memory metadata cache for `list_files()` + +### Factory +`get_storage_backend()` returns singleton based on `STORAGE_BACKEND` env var (`local` or `globus`). + +--- + +## Dataset Profiler (cs/aws/v2/profiler.py) + +Scans files in storage and produces a `DatasetProfile`: + +```python +DatasetProfile: + source_id, profiled_at, total_files, total_bytes + formats: {"csv": 5, "json": 2} + files: [FileProfile, ...] + +FileProfile: + path, filename, size_bytes, content_type, format + columns: [ColumnSummary, ...] # for tabular files + n_rows, sample_rows # for tabular files + structure # for JSON files + preview_lines # for text files + extra # format-specific (CIF formula, image dims) + +ColumnSummary: + name, dtype, count, nulls, unique, min, max, mean, std, top_values +``` + +**Format detection:** CSV/TSV, JSON, HDF5, CIF, images (PNG/JPEG), text, binary. +**Size cap:** 10MB per file. Files over the cap get format/size only. +**Trigger:** Called after `put_submission()` (for stream:// sources) and after `stream_snapshot`. +**Storage:** Stored as `dataset_profile` JSON string on the submission record. + +--- + +## Authentication (cs/aws/v2/app/auth.py) + +### Dev mode (`AUTH_MODE=dev`) +Reads identity from headers: `X-User-Id`, `X-User-Email`, `X-User-Name`. Falls back to env vars `LOCAL_USER_ID`, etc. + +### Production mode (`AUTH_MODE=production`) +Full Globus OAuth2 flow: +1. Extract Bearer token from `Authorization` header +2. `globus_sdk.ConfidentialAppAuthClient(GLOBUS_CLIENT_ID, GLOBUS_CLIENT_SECRET)` +3. `oauth2_token_introspect(token, include="identities_set")` - validates token, gets user info +4. `oauth2_get_dependent_tokens(token)` - gets tokens for Groups API, Transfer, etc. +5. `GroupsClient` - fetches user's group memberships (for curator authorization) + +**Credentials:** Stored in AWS SSM Parameter Store at `/mdf/globus-client-id` and `/mdf/globus-client-secret`. Deployed via `deploy.sh` which reads them and passes as SAM parameters. + +**Globus App:** Confidential OAuth client, UUID `86e4853e-9bdd-4ea5-9130-e4a0b0638400`. + +### Curator authorization +`require_curator` dependency checks: +- `CURATOR_USER_IDS` env var (comma-separated) +- `CURATOR_GROUP_IDS` env var (matched against user's Globus groups) +- `ALLOW_ALL_CURATORS=true` (dev bypass) + +--- + +## API Endpoints + +### Submissions +| Method | Path | Auth | Description | +|--------|------|------|-------------| +| POST | `/submit` | Required | Submit dataset (flat v2 metadata) | +| GET | `/status/{source_id}` | None | Get submission status (latest version) | +| GET | `/status?source_id=X` | None | Same via query param | +| POST | `/status/update` | Curator | Update submission status | +| GET | `/submissions` | Required | List caller submissions (`?status=` filter, `?include_counts=true`); org-wide view requires curator | +| POST | `/submissions/{source_id}/metadata` | Owner/Curator | Edit metadata; auto minor-bump if published (e.g. 1.0→1.1) | +| POST | `/submissions/{source_id}/withdraw` | Owner/Curator | Withdraw a pending_curation submission | +| POST | `/submissions/{source_id}/resubmit` | Owner/Curator | Resubmit a rejected submission back to pending_curation | +| POST | `/submissions/{source_id}/delete` | Curator | Soft-delete a submission (any status except already deleted) | +| GET | `/versions/{source_id}` | Optional | List versions (`?limit=&offset=`); unauthenticated sees published only | +| GET | `/versions/{source_id}/diff?from=&to=` | None | Structured diff of metadata between two versions | +| GET | `/stats/{source_id}` | None | Public access/download stats for published datasets | + +### Admin +| Method | Path | Auth | Description | +|--------|------|------|-------------| +| GET | `/admin/stats` | Curator | Aggregate submission counts by status + access totals | + +### Streams +| Method | Path | Auth | Description | +|--------|------|------|-------------| +| POST | `/stream/create` | Required | Create new stream | +| POST | `/stream/{id}/append` | Required | Record appended files | +| GET | `/stream/{id}` | Required | Get stream status | +| POST | `/stream/{id}/close` | Required | Close stream, optionally mint DOI | +| POST | `/stream/{id}/snapshot` | Required | Create submission from stream | + +### Files +| Method | Path | Auth | Description | +|--------|------|------|-------------| +| POST | `/stream/{id}/upload` | Required | Upload file (base64 in JSON body) | +| POST | `/stream/{id}/upload-url` | Required | Get direct upload URL (auth token not returned) | +| POST | `/stream/{id}/upload-confirm` | Required | Confirm external upload | +| POST | `/stream/{id}/download-url` | Required | Get download URL | +| GET | `/stream/{id}/files` | Required | List stream files | + +### Preview +| Method | Path | Auth | Description | +|--------|------|------|-------------| +| GET | `/stream/{id}/preview` | Required | Preview all stream files | +| GET | `/stream/{id}/files/{name}/preview` | Required | Preview specific stream file | +| GET | `/preview/{source_id}` | None | Full DatasetProfile for a dataset | +| GET | `/preview/{source_id}/files` | None | List files with format/size metadata | +| GET | `/preview/{source_id}/files/{path}` | None | Detailed profile of one file | +| GET | `/preview/{source_id}/sample` | None | Sample rows from first tabular file | + +### Cards & Citations +| Method | Path | Auth | Description | +|--------|------|------|-------------| +| GET | `/card/{source_id}` | Optional | Dataset card + permissions (includes profile_summary if profiled) | +| GET | `/detail/{slug}` | Optional | Same as `/card` but parses frontend URL slug (`{source_id}-{version}`) | +| GET | `/citation/{source_id}` | None | Citations (BibTeX, RIS, APA, DataCite XML) | + +### Search +| Method | Path | Auth | Description | +|--------|------|------|-------------| +| GET | `/search?q=X` | None | Faceted full-text search (datasets + streams) with filter params (`year`, `organization`, `author`, `keyword`, `domain`) and `offset` for pagination. Response includes `facets` with per-bucket counts. Uses Globus Search `post_search` with `SearchQuery.add_facet()` / `add_filter()`. | + +### Curation +| Method | Path | Auth | Description | +|--------|------|------|-------------| +| GET | `/curation/pending` | Curator | List pending submissions | +| GET | `/curation/{source_id}` | Curator | Get submission for review | +| POST | `/curation/{source_id}/approve` | Curator | Approve + optionally mint DOI | +| POST | `/curation/{source_id}/reject` | Curator | Reject with reason | + +--- + +## Deployment + +### Local development +```bash +cd cs/aws +STORE_BACKEND=sqlite AUTH_MODE=dev python -m v2.app.main +# or: make local +# or: ./deploy.sh local +``` +Server runs at `http://127.0.0.1:8080` with hot reload. + +### AWS deployment +```bash +./deploy.sh dev # Dev: SQLite-like, no Globus, mock auth/DataCite +./deploy.sh staging # Staging: DynamoDB, Globus auth, Globus storage +./deploy.sh prod # Production: same as staging +./deploy.sh quick dev # Code-only deploy (skip CloudFormation) +``` + +**Infrastructure per environment:** +- CloudFormation stack: `mdf-connect-v2-{env}` +- S3 bucket: `mdf-sam-deployments-{env}` (deployment artifacts) +- DynamoDB: `mdf-submissions-{env}`, `mdf-streams-{env}` (RetainOnDelete) +- Lambda function + HTTP API Gateway +- Region: `us-east-1` + +### Environment variables + +| Variable | Dev | Prod | Description | +|----------|-----|------|-------------| +| `STORE_BACKEND` | sqlite | dynamo | Database backend | +| `AUTH_MODE` | dev | production | Auth strategy | +| `STORAGE_BACKEND` | local | globus | File storage | +| `GLOBUS_CLIENT_ID` | - | from SSM | Globus OAuth client | +| `GLOBUS_CLIENT_SECRET` | - | from SSM | Globus OAuth secret | +| `USE_MOCK_DATACITE` | true | false | Mock DOI minting | +| `ALLOW_ALL_CURATORS` | true | false | Skip curator check | +| `SQLITE_PATH` | /tmp/mdf_connect_v2.db | - | SQLite DB location | +| `DYNAMO_SUBMISSIONS_TABLE` | - | from stack | Submissions table name | +| `DYNAMO_STREAMS_TABLE` | - | from stack | Streams table name | +| `MAX_REQUEST_BYTES` | 1048576 | 1048576 | Request body hard cap (middleware) | +| `MAX_SUBMIT_METADATA_BYTES` | 262144 | 262144 | `/submit` metadata JSON cap | +| `RATE_LIMIT_DEFAULT_PER_MIN` | 120 | 120 | Per-actor default request limit | +| `RATE_LIMIT_SUBMIT_PER_MIN` | 20 | 20 | Per-actor `/submit` limit | +| `RATE_LIMIT_STREAM_CREATE_PER_MIN` | 30 | 30 | Per-actor `/stream/create` limit | +| `LOG_LEVEL` | DEBUG | INFO | Logging level | +| `SES_FROM_EMAIL` | _(blank)_ | `noreply@materialsdatafacility.org` | SES sender address; blank disables all emails | +| `CURATOR_EMAILS` | _(blank)_ | comma-separated addresses | Curator alert recipients for new submissions | +| `PORTAL_URL` | `https://www.materialsdatafacility.org` | same | Public portal base URL used in email links | +| `CURATION_PORTAL_URL` | `https://www.materialsdatafacility.org/curation` | same | Curation queue URL linked in curator alert emails | +| `SES_REGION` | _(Lambda region)_ | _(Lambda region)_ | AWS region for SES client (auto-set from `AWS::Region`) | +| `EnableEmails` (SAM param) | `false` | `true` | CloudFormation flag; when `false`, `SES_FROM_EMAIL` is blanked in both Lambdas | + +--- + +## Email Notifications + +Implemented in `v2/email_utils.py` using AWS SES via `boto3`. All notifications are fire-and-forget — failures are logged as warnings and never surface to the caller. + +### Events + +| Trigger | Recipient | Subject | +|---------|-----------|---------| +| New submission enters `pending_curation` | All addresses in `CURATOR_EMAILS` | "New Dataset Pending Review: {title}" | +| Submission reaches `published` status | `user_email` on submission record | "Your MDF Dataset is Now Published: {title}" | +| Submission set to `rejected` | `user_email` on submission record | "MDF Submission Needs Attention: {title}" | + +### Email content + +Each email is a responsive HTML message (600px, inline CSS) containing: +- Branded MDF header (color-coded by event: navy / green / amber) +- Dataset info card: title, authors, description excerpt, org, file count, size, keyword tags +- Call-to-action button linking to the curation queue or dataset detail page +- Plain-text fallback for email clients that don't render HTML + +### Configuration + +| Env var | Description | +|---------|-------------| +| `SES_FROM_EMAIL` | Verified sender address. **If blank, all emails are silently skipped.** | +| `CURATOR_EMAILS` | Comma-separated recipient list for new-submission alerts | +| `PORTAL_URL` | Base URL for dataset links in emails (default: `https://www.materialsdatafacility.org`) | +| `CURATION_PORTAL_URL` | URL curators are directed to (default: `https://www.materialsdatafacility.org/curation`) | +| `SES_REGION` | AWS region for SES (auto-set to Lambda deployment region) | + +Both the API Lambda and the async worker Lambda have `ses:SendEmail` IAM permissions. The published notification fires from the async worker (which is where `status = "published"` is written), ensuring the email goes out in both inline (dev) and SQS (prod) dispatch modes. + +--- + +## Production Deploy Checklist + +> **These steps must be completed before deploying to the production AWS account.** + +### AWS SES +- [ ] **Verify sending domain** — `materialsdatafacility.org` must be verified as an SES identity in the *production* AWS account (not just staging). Go to SES → Verified Identities → Create Identity → Domain. +- [ ] **Check sandbox status** — New AWS accounts are in SES sandbox and can only send to verified addresses. Request production access (SES → Account dashboard → "Request production access") to send to arbitrary submitter emails. Approval typically takes 1–2 business days. +- [ ] **Set `SesFromEmail`** — Use `noreply@materialsdatafacility.org` (or any address at the verified domain). + +### CloudFormation parameters for prod deploy +```bash +sam deploy --config-env prod --parameter-overrides \ + Environment=prod \ + AuthMode=production \ + EnableEmails=true \ + SesFromEmail=noreply@materialsdatafacility.org \ + CuratorEmails=materialsdatafacility@uchicago.edu \ + PortalUrl=https://www.materialsdatafacility.org \ + CurationPortalUrl=https://www.materialsdatafacility.org/curation \ + GlobusClientId= \ + GlobusClientSecret= \ + ... +``` + +### Globus +- [ ] Register confidential OAuth client in the production Globus account (not the dev/staging client) +- [ ] Store `GLOBUS_CLIENT_ID` and `GLOBUS_CLIENT_SECRET` in AWS SSM Parameter Store under `/mdf/globus-client-id` and `/mdf/globus-client-secret` +- [ ] Configure the MDF curator Globus group UUID in `CuratorGroupIds` +- [ ] Configure the required submitter group UUID in `RequiredGroupMembership` + +### DataCite +- [ ] Switch `DataCiteApiUrl` from `https://api.test.datacite.org` to `https://api.datacite.org` +- [ ] Set `DataCiteUsername`, `DataCitePassword`, and `DataCitePrefix` to production repository credentials + +### Search +- [ ] Set `SearchIndexUUID` to the production Globus Search index UUID +- [ ] Confirm the index is publicly readable + +### General +- [ ] Set `EnableEmails=false` on **staging** (default) to prevent test emails going to real addresses before SES is set up in the company account +- [ ] Set `EnableEmails=true` only on **prod** once SES sandbox access is lifted + +--- + +## Key Patterns & Gotchas + +1. **SQLite + ASGI threading**: Must use `check_same_thread=False`. Uvicorn runs sync endpoints in a thread pool. + +2. **Route ordering**: `/stream/create` must be registered before `/stream/{stream_id}` to prevent "create" being captured as a path parameter. + +3. **Curation upsert**: `put_submission()` has a DynamoDB condition expression preventing overwrites. Use `upsert_submission()` (no condition) for curation approve/reject which modify existing records. + +4. **Client-server boundary**: `src/mdf_agent/models/config.py` (the CLI client) must NOT import from `v2.*`. The client's `to_metadata_payload()` produces the flat dict that the server validates. + +5. **Metadata parsing**: Always use `parse_metadata(record)` to read metadata from a DB record. It handles JSON deserialization, v1 detection, migration, and validation. + +6. **Profile generation is best-effort**: If profiling fails, the submission/snapshot still succeeds. Errors are logged at DEBUG level. + +7. **DynamoDB pagination**: Uses `ExclusiveStartKey`/`LastEvaluatedKey`. SQLite simulates pagination with OFFSET. + +8. **Storage singleton**: `get_storage_backend()` returns a cached singleton. Call `reset_storage_backend()` in tests. + +9. **Globus storage paths are flat**: `{stream_id}_{date}_{filename}` to avoid directory creation issues on Globus HTTPS endpoints. + +10. **DataCite mock**: When `USE_MOCK_DATACITE=true` or no credentials configured, `get_datacite_client()` returns `MockDataCiteClient` which generates fake DOIs in memory. + +11. **Ownership enforcement**: Stream mutation/view endpoints enforce owner-or-curator access checks, and submission status updates are curator-only. + +12. **Storage path safety**: Local and Globus backends reject traversal paths (`..`, absolute paths, and URL-style paths) for reads/writes. + +13. **Rate limiting and payload caps**: API middleware applies per-actor rate limits and request-body limits; submit and stream append also enforce endpoint-level size/count caps. + +14. **Request correlation logging**: Every request receives an `X-Request-Id` and emits structured completion/error logs with duration and status. + +15. **Metadata edit on published datasets**: Editing metadata on a published dataset auto-creates a minor version bump (e.g. 1.0→1.1) that goes directly to `published` status without re-curation. The new version inherits `dataset_doi` and enqueues a publish job with `mint_doi=False` to update DataCite metadata and re-index in Globus Search. + +16. **Withdrawal restores prior version**: When withdrawing a submission that was `latest=True` in a version chain, the endpoint restores `latest=True` on the prior version so the version chain remains consistent. + +17. **Frontend URL slug parsing**: `GET /detail/{slug}` parses `{source_id}-{version}` slugs using the regex `^(.+)-(\d+\.\d+)$`. Slugs without a version suffix resolve to the latest published version. This supports both UUID-style (`81d55710-...-1.0`) and name-style (`levine_abo2179_database_v2.1-1.0`) source IDs. + +18. **Inline permissions on card responses**: `GET /card` and `GET /detail` accept optional auth and return a `permissions` object (`can_edit`, `can_delete`, `can_curate`) so the frontend can show/hide action buttons without a separate round-trip. Unauthenticated requests get all-false permissions. + +--- + +## Self-Service & Lifecycle Endpoints + +### Metadata Edit — `POST /submissions/{source_id}/metadata` + +Allows submitters (or curators) to edit metadata after submission. Accepts any subset of editable metadata fields. + +**Request body** (`MetadataEditRequest`): +```json +{ + "title": "Updated Title", + "authors": [{"name": "New Author"}], + "description": "...", + "keywords": ["..."], + "license": {"name": "CC-BY-4.0"}, + "funding": [{"funder_name": "NSF", "award_number": "123"}], + "related_works": [{"identifier": "10.1234/...", "identifier_type": "DOI"}], + "methods": ["XRD"], + "facility": "APS", + "fields_of_science": ["Materials Science"], + "domains": ["batteries"], + "ml": {"data_format": "csv", "task_type": ["regression"]}, + "geo_locations": [{"place": "Argonne, IL"}], + "tags": ["featured"], + "extensions": {"custom_key": "value"}, + "version": "1.0" +} +``` +All fields are optional. Only non-null fields are applied. `version` targets a specific version (defaults to latest). + +**Behavior by status:** +| Current Status | Edit Behavior | +|----------------|---------------| +| `pending_curation` | In-place update via `upsert_submission` | +| `rejected` | In-place update (fix issues before resubmit) | +| `published` | Auto-creates minor version bump (e.g. 1.0→1.1), new version is `published`, inherits `dataset_doi`, enqueues publish job to update DataCite + search | +| Other (`withdrawn`, `approved`) | Returns 400 | + +**Response:** +```json +{ + "success": true, + "source_id": "mdf-abc123", + "version": "1.0", + "updated_fields": ["title", "description"], + "new_version": "1.1" +} +``` +`new_version` only present when editing published metadata. + +### Withdraw — `POST /submissions/{source_id}/withdraw` + +Withdraw a pending submission. Only allowed when status is `pending_curation`. + +**Request body** (`WithdrawRequest`): +```json +{ + "reason": "Duplicate submission", + "version": "1.0" +} +``` + +**Behavior:** +- Sets `status = "withdrawn"` +- Appends `{action: "withdrawn", user_id, timestamp, reason}` to `curation_history` +- If this was `latest=True` in a version chain, restores `latest=True` on the prior version + +### Resubmit — `POST /submissions/{source_id}/resubmit` + +Resubmit a rejected submission back to the curation queue. Only allowed when status is `rejected`. + +**Request body** (`ResubmitRequest`): +```json +{ + "notes": "Fixed the title and added missing authors", + "version": "1.0" +} +``` + +**Typical workflow:** Reject → edit metadata → resubmit → curator sees it in pending queue again. + +### Version Diff — `GET /versions/{source_id}/diff?from=1.0&to=2.0` + +Compare metadata between two versions. No authentication required (matches existing `/versions/{source_id}` pattern). + +**Response:** +```json +{ + "success": true, + "source_id": "mdf-abc123", + "from_version": {"version": "1.0", "status": "published", "created_at": "..."}, + "to_version": {"version": "2.0", "status": "published", "created_at": "..."}, + "diff": { + "added": {"new_field": "value"}, + "removed": {"old_field": "was_value"}, + "changed": {"title": {"from": "Old Title", "to": "New Title"}}, + "unchanged": ["authors", "keywords"] + } +} +``` +System fields (`version`, `latest`, `previous_version`, `root_version`, `update`, `test`) are excluded from the diff. + +### Submissions Filtering — `GET /submissions?status=&include_counts=true` + +Enhanced `GET /submissions` with two new query parameters: + +| Param | Type | Description | +|-------|------|-------------| +| `status` | string | Comma-separated status filter (e.g. `?status=published,pending_curation`) | +| `include_counts` | bool | When `true`, response includes per-status counts over all submissions | + +**Response with counts:** +```json +{ + "success": true, + "submissions": [...], + "counts": {"pending_curation": 3, "published": 12, "rejected": 1}, + "total": 16, + "next_key": null +} +``` +Counts reflect all submissions before status filtering. The `submissions` list respects the `status` filter and `limit`. + +### Soft-Delete — `POST /submissions/{source_id}/delete` + +Curator-only endpoint to soft-delete any submission. Records the deletion in `curation_history` and sets `status = "deleted"`, `deleted_at`, `deleted_by`. + +**Request body** (`DeleteSubmissionRequest`): +```json +{ + "reason": "Spam submission", + "version": "1.0" +} +``` + +`reason` is required. `version` is optional (defaults to latest). Returns 400 if already deleted. + +### Dataset Access Stats — `GET /stats/{source_id}` + +Public endpoint returning aggregate access metrics for a published dataset. + +**Response:** +```json +{ + "success": true, + "source_id": "mdf-abc123", + "view_count": 142, + "download_count": 37, + "version_count": 3, + "first_published": "2025-06-15T12:00:00Z", + "last_updated": "2026-02-28T09:30:00Z" +} +``` + +Counters are incremented atomically: +- `view_count`: incremented on `GET /card/{id}` and `GET /detail/{slug}` (not on citation or preview fetches) +- `download_count`: incremented on `POST /stream/{id}/download-url` + +### Admin Stats — `GET /admin/stats` + +Curator-only endpoint returning aggregate statistics. + +**Response:** +```json +{ + "success": true, + "total": 156, + "by_status": {"published": 100, "pending_curation": 20, "rejected": 5, "deleted": 3, ...}, + "access_totals": {"view_count": 15230, "download_count": 4521} +} +``` + +### Version Pagination — `GET /versions/{source_id}?limit=&offset=` + +The versions endpoint supports pagination with `limit` (default 50, max 500) and `offset` (default 0). Response includes `total_count` for client-side pagination. Unauthenticated requests only see published versions. + +--- + +## Infrastructure Hardening + +### Global Exception Handler + +Every unhandled exception returns a structured error with a `request_id`: +```json +{"detail": "Internal server error", "request_id": "a1b2c3d4e5f6"} +``` +The same `request_id` is logged server-side for correlation. + +### CORS Credential Fix + +When `CORS_ALLOWED_ORIGINS` is set to specific origins (not `*`), `allow_credentials=True` is enabled to support authenticated cross-origin requests. When set to `*`, credentials are disabled per browser security requirements. + +### SQS Dead-Letter Queue + +Failed async jobs (profile generation, publish, transfer) are sent to a DLQ after 3 retries. The DLQ is configured in `template.yaml` as `MdfDLQ` with a 14-day retention period. + +### Stream Endpoints + +Stream-related routers (`streams.py`) are currently disabled in `app/__init__.py` pending feature completion. File upload/download endpoints in `files.py` remain active for existing streams. + +### Published-Only Gates + +Public endpoints (`/card`, `/citation`, `/preview`, `/stats`) only return data for submissions with `status == "published"`. The `/status/{source_id}` and `/versions/{source_id}` endpoints restrict non-owner/non-curator access to published records only.