Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions .idea/insdc-benchmarking-scripts.iml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

234 changes: 73 additions & 161 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,211 +1,123 @@
🌐 INSDC Benchmarking Scripts
=============================
# INSDC Benchmarking Scripts

Automated benchmarking tools for testing **INSDC data download performance** across repositories (**ENA**, **SRA**, and **DDBJ**) and multiple transfer protocols.
A benchmarking toolkit for INSDC data access (ENA, SRA, DDBJ).

* * * * *
This project focuses on:

🚀 Quick Start
--------------
- reproducible benchmarking using deterministic datasets
- performance measurement across protocols (FTP, HTTP via wget)
- checksum validation for data integrity

### 1\. Install
---

```bash
pip install insdc-benchmarking-scripts
```
## 🚀 Primary Usage (Recommended)

### 2\. Configure
The main supported entry points are:

```
cp config.yaml.example config.yaml
# Edit config.yaml:
# site: nci
# api_endpoint: https://your.api/submit
# api_token: YOUR_TOKEN # optional
- `benchmark-http` (uses wget over HTTP/HTTPS)
- `benchmark-ftp` (uses Python ftplib)

```
These commands provide the core benchmarking functionality.

### 3\. Run a Benchmark
---

#### HTTP/HTTPS (wget-based)
## 🧪 Example Usage

```
benchmark-http --dataset DRR12345678 --repository ENA --site nci

```
### HTTP (via wget)

#### SRA Cloud .sra Objects (AWS/GCS)

```
benchmark-http\
--dataset DRR000001\
--repository SRA\
--sra-mode sra_cloud\
--mirror auto\
```bash
poetry run benchmark-http \
--dataset ERR3853594 \
--repository ENA \
--deterministic-dataset-file scripts/data/deterministic_datasets_v2.csv \
--no-submit

```

#### ENA FASTQ via HTTPS
### FTP

```
benchmark-http\
--dataset SRR000001\
--repository ENA\
```bash
poetry run benchmark-ftp \
--dataset ERR3853594 \
--repository ENA \
--deterministic-dataset-file scripts/data/deterministic_datasets_v2.csv \
--no-submit

```

* * * * *

🧠 Key Features
---------------

- ✅ HTTP/HTTPS benchmarking using wget
- ✅ SRA Cloud (AWS/GCS) .sra object downloads
- ✅ ENA FASTQ over HTTPS
- 🧩 Automatic system metrics --- CPU%, memory MB, disk write speed
- 🌍 Network baselines --- ping/traceroute latency and route
- 🧾 JSON output aligned with INSDC Benchmarking Schema v1.2
- 📤 Optional API submission (secure HTTP POST)
- 🧪 Repeatable tests with `--repeats` and aggregate stats
- 🧰 Mirror control for SRA: `--mirror [aws|gcs|auto]`, `--require-mirror`, `--explain`
---

* * * * *
## 📊 Deterministic Dataset

📦 Supported Protocols
----------------------
A curated dataset (`deterministic_datasets_v2.csv`) is used to ensure:

| Protocol | Implementation | Status |
| --- | --- | --- |
| HTTP/HTTPS | wget | ✅ Stable |
| FTP | ftplib | ✅ Stable |
| Globus | Python SDK | 🔄 Planned |
| Aspera | CLI SDK | 🔄 Planned |
| SRA Toolkit | fasterq-dump (wrapper) | 🔄 Planned |
- stable file URLs
- correct MD5 checksums
- reproducible benchmarking

* * * * *
This dataset was rebuilt after the initial dataset was found to contain incorrect checksums.

⚙️ Configuration
----------------
---

See `config.yaml.example`:
## ✅ Checksum Validation

```
site: nci
api_endpoint: https://your.api/submit
api_token: your-secret-token

```
For each run:

* * * * *
1. File is downloaded
2. MD5 checksum is computed
3. Compared against expected checksum

📊 Example Output
-----------------
### Result logic

```
{
"timestamp": "2025-11-06T06:21:33Z",
"end_timestamp": "2025-11-06T06:23:05Z",
"site": "nci",
"protocol": "http",
"repository": "SRA",
"dataset_id": "DRR000001",
"duration_sec": 92.3,
"file_size_bytes": 596137898,
"average_speed_mbps": 51.6,
"cpu_usage_percent": 7.2,
"memory_usage_mb": 10300.5,
"status": "success",
"checksum_md5": "bf11d3ea9d7e0b6e984998ea2dfd53ca",
"write_speed_mbps": 3350.3,
"network_latency_ms": 8.9,
"tool_version": "GNU Wget 1.21.4",
"notes": "Resolved from AWS ODP mirror"
}
- success → download OK + checksum match
- fail → download failed OR checksum mismatch

```
---

* * * * *
## 📐 Schema Alignment

🧱 Repository Structure
-----------------------
Results are designed to align with the INSDC benchmarking schema:

```
insdc-benchmarking-scripts/
├── scripts/
│ ├── benchmark_http.py # HTTP/HTTPS benchmarking CLI (Click)
│ ├── benchmark_ftp.py # FTP benchmarking (ftplib)
│ └── benchmark_aspera.py # Future Aspera integration
├── insdc_benchmarking_scripts/
│ ├── utils/
│ │ ├── repositories/ # ENA/SRA/DDBJ resolvers
│ │ ├── system_metrics.py # CPU/memory sampler
│ │ ├── network_baseline.py # ping/traceroute helpers
│ │ ├── submit.py # HTTP POST to results API
│ │ └── config.py # Config loader
│ └── __init__.py
├── docs/
│ ├── INSTALLATION.md # Setup and verification instructions
│ ├── USAGE.md # CLI usage and examples
│ ├── protocols/ # Protocol-specific notes
│ └── schema/ # INSDC Benchmarking Schema v1.2
├── config.yaml.example # Example configuration file
├── requirements.txt # Dependencies for pip installs
├── pyproject.toml # Poetry build config
├── README.md # This file
└── LICENSE
https://github.com/AustralianBioCommons/insdc-benchmarking-schema

```
Important:

* * * * *
- HTTP benchmarking uses `"protocol": "wget"`
- FTP benchmarking uses `"protocol": "ftp"`

📚 Documentation
----------------
---

- 📘 [Installation Guide](docs/INSTALLATION.md)
- 🧭 [Usage Guide](docs/USAGE.md)
- 🧩 [Protocol Guides](docs/protocols/)
- 📄 [INSDC Benchmarking Schema v1.2](docs/schema/)
## 📌 Scope

* * * * *
- currently benchmarks the first file per run
- checksum validation applies per run
- multi-file benchmarking is not yet implemented

🧭 Roadmap
----------
---

- [ ] Add Globus and Aspera benchmarking
- [ ] Unified results ingestion API (FastAPI backend)
- [ ] Web dashboard for live performance visualization
- [ ] Scheduled batch benchmarking for curated datasets
- [ ] Add object checksum validation and retry support
## ⚙️ Optional: Benchmark Runner

* * * * *
A batch runner (`benchmark-runner`) exists for:

🤝 Contributing
---------------
- running multiple datasets
- filtering by category/status
- aggregating results into CSV

Contributions are welcome! Please open an issue or submit a pull request to add protocols, metrics, or infrastructure integrations.
However:

### Development Workflow
- it is not the primary interface
- its CLI differs from the main commands
- it is best suited for internal or large-scale runs

```
# Fork and clone
git clone https://github.com/AustralianBioCommons/insdc-benchmarking-scripts
cd insdc-benchmarking-scripts
---

# Install dependencies
poetry install
## 🛣️ Roadmap

# Run a test benchmark
poetry run benchmark-http --dataset DRR000001 --repository ENA --no-submit
- multi-file benchmarking
- category-based batch execution
- schema validation (pending Python upgrade)
- reporting and summaries

```
---

* * * * *
## 📄 License

**Maintained by:** Australian BioCommons\
📍 University of Melbourne\
🔗 Licensed under the Apache 2.0 License
Apache 2.0
Loading
Loading