D-SCRIPT (Deep Learning for Structure-Aware Protein-Protein Interaction Prediction) is a deep learning method for predicting physical interactions between proteins using only their sequences. The project uses structure-aware design to enhance interpretability and cross-species generalizability.
Current Version: 0.3.1 License: MIT Python Requirements: >=3.10 Primary Frameworks: PyTorch (>=1.13), NumPy, Pandas, SciKit-Learn
- Protein-protein interaction (PPI) prediction from sequences
- Language model-based protein embeddings (Bepler+Berger)
- Structure-aware contact map predictions
- Multi-GPU support for inference with blocked memory-efficient processing
- Pre-trained models available via HuggingFace
- Supports Topsy-Turvy (network information) and TT3D (structure information)
- Original D-SCRIPT: Cell Systems (2021) DOI: 10.1016/j.cels.2021.08.010
- Topsy-Turvy: Bioinformatics (2022)
- TT3D: Bioinformatics (2023)
- BMPI: Bioinformatics (2024) DOI: 10.1093/bioinformatics/btaf564
D-SCRIPT/
├── dscript/ # Main Python package
│ ├── __init__.py # Package initialization (v0.3.1)
│ ├── __main__.py # CLI entry point with argparse subcommands
│ ├── alphabets.py # Protein sequence alphabets
│ ├── fasta.py # FASTA file parsing (uses biotite)
│ ├── foldseek.py # FoldSeek 3Di structure sequence support
│ ├── glider.py # Data loading utilities
│ ├── language_model.py # Bepler+Berger language model
│ ├── pretrained.py # Pre-trained model loading
│ ├── utils.py # Logging, device parsing, data structures
│ ├── loading.py # Parallel HDF5 loading
│ ├── load_worker.py # Worker for parallel loading
│ ├── commands/ # CLI command implementations
│ │ ├── embed.py # Generate embeddings from sequences
│ │ ├── train.py # Model training
│ │ ├── evaluate.py # Model evaluation metrics
│ │ ├── predict_serial.py # Serial prediction (legacy)
│ │ ├── predict_block.py # Blocked multi-GPU prediction (default)
│ │ ├── predict_bipartite.py # Cross-species prediction
│ │ ├── extract_3di.py # Extract 3Di sequences
│ │ ├── par_worker.py # Parallel prediction worker
│ │ └── par_writer.py # Parallel prediction writer
│ ├── models/ # Neural network architectures
│ │ ├── embedding.py # Projection layers (FullyConnectedEmbed, SkipLSTM)
│ │ ├── contact.py # Contact map CNN
│ │ └── interaction.py # Main interaction model (ModelInteraction, DSCRIPTModel)
│ └── tests/ # Unit tests (pytest)
│ ├── test_fasta.py
│ ├── test_alphabets.py
│ ├── test_models_*.py
│ ├── test_commands.py
│ └── ...
├── data/ # Sample data and test files
│ ├── seqs/ # FASTA sequences
│ ├── pairs/ # TSV pair files
│ └── *.fa # FoldSeek sequences
├── docs/ # Sphinx documentation
│ └── source/
├── bash_files/ # Training/testing bash scripts
├── scripts/ # Utility scripts
│ ├── bmpi_bench/ # BMPI benchmarking
│ └── push_huggingface.py # HuggingFace model upload
├── notebooks/ # Jupyter notebooks
├── pyproject.toml # Package configuration
├── environment.yml # Conda environment
├── .pre-commit-config.yaml # Pre-commit hooks
├── .github/workflows/ # CI/CD pipelines
└── CHANGELOG.md # Version history
The D-SCRIPT model follows a three-stage pipeline:
-
Embedding: Projects protein language model embeddings to lower dimensions
FullyConnectedEmbed: Fully-connected projection with dropoutSkipLSTM: LSTM-based projection (for language model)
-
Contact Prediction: Predicts inter-protein contact maps
ContactCNN: CNN operating on outer product of embeddings- Outputs contact probability map (N × M)
-
Interaction Prediction: Aggregates contact maps to interaction probability
ModelInteraction: Main model combining embedding + contact- Weighted pooling with learnable parameters (θ, λ, γ)
- Logistic activation for final prediction
ModelInteraction (dscript/models/interaction.py:51)
- Core model class combining all components
- Key methods:
embed(x): Project embeddingscpred(z0, z1, ...): Predict contact mappredict(z0, z1, ...): Predict interaction probabilitymap_predict(z0, z1, ...): Return both contact map and probability
DSCRIPTModel (dscript/models/interaction.py:265)
- HuggingFace-compatible wrapper
- Inherits from
ModelInteractionandPyTorchModelHubMixin - Used for saving/loading models to HuggingFace Hub
Pre-trained models are managed via dscript/pretrained.py:
human_v1: Original D-SCRIPT (Cell Systems 2021)human_v2: Topsy-Turvy (default, recommended)human_tt3d: TT3D with FoldSeek 3Di sequenceslm_v1: Bepler & Berger language model
Models can be loaded from:
- HuggingFace Hub (e.g.,
samsl/topsy_turvy_human_v1) - Local disk via
get_pretrained(version) - Direct state dict download from MIT server
Input Formats:
- FASTA (.fasta): Protein sequences (parsed with biotite)
- TSV (.tsv): Protein pairs (tab-separated, no header)
- Format:
protein1\tprotein2\t[optional_label]
- Format:
- HDF5 (.h5): Embeddings storage
- Keys: protein identifiers
- Values: embedding tensors
Output Formats:
- Predictions: TSV with scores
- Embeddings: HDF5 files
- Trained models: PyTorch state dicts (.pt) or HuggingFace format
The package provides a unified CLI via dscript command:
-
dscript embed- Generate embeddingsdscript embed --seqs data/seqs/ecoli.fasta --outfile ecoli_embed.h5
-
dscript predict- Predict interactions (blocked mode, default)dscript predict --pairs data/pairs/ecoli.tsv \ --embeddings ecoli_embed.h5 \ --model samsl/topsy_turvy_human_v1 \ --outfile predictions.tsv \ --blocks 16 -d 0 -
dscript predict_serial- Serial prediction (legacy) -
dscript predict_bipartite- Cross-species prediction -
dscript train- Train new modeldscript train --train train.tsv --test test.tsv \ --embedding embeddings.h5 \ --output output_dir --save-prefix model -
dscript evaluate- Evaluate model performance -
dscript extract-3di- Extract FoldSeek 3Di sequences
The CLI is implemented using argparse subcommands (dscript/main.py):
- Each command is a module in
dscript/commands/ - Each module provides
add_args(parser)andmain(args)functions - Type hints use union types for argument classes
Using Conda (recommended):
conda env create -f environment.yml
conda activate dscriptUsing pip:
pip install -e ".[dev]" # Editable install with dev dependenciesRuff - Linting and formatting (configured in pyproject.toml:66-87)
- Line length: 90 characters
- Target: Python 3.10+
- Enabled rules: pycodestyle (E/W), pyflakes (F), isort (I), pyupgrade (UP)
- Ignored: E501 (line too long)
- Quote style: double quotes
- Indentation: spaces
Pre-commit Hooks (.pre-commit-config.yaml):
- Check merge conflicts
- Check YAML syntax
- Fix end-of-file
- Trim trailing whitespace
- Ruff check with auto-fix
- Ruff format
Setting up pre-commit:
pip install pre-commit
pre-commit install# Check code
ruff check . --statistics
# Format code
ruff format .
# Run pre-commit on all files
pre-commit run --all-files- Framework: pytest
- Location:
dscript/tests/ - Coverage: pytest-cov
# Run all tests
pytest
# Run with coverage
pytest --cov=dscript --cov-report=xml --cov-report=term-missing
# Run specific test file
pytest dscript/tests/test_fasta.py
# Run specific test
pytest dscript/tests/test_models_interaction.py::test_model_forwardConfiguration in pyproject.toml:92-101:
- Test file pattern:
test_*.py - Test path:
dscript/tests - Warnings filtered: UserWarning, DeprecationWarning, PendingDeprecationWarning
Current test files cover:
test_fasta.py: FASTA parsingtest_alphabets.py: Protein alphabetstest_models_embedding.py: Embedding layerstest_models_contact.py: Contact predictiontest_models_interaction.py: Full interaction modeltest_pretrained.py: Model loadingtest_language_model.py: Language modeltest_foldseek.py: 3Di sequence handlingtest_commands.py: CLI commands
1. pytest (.github/workflows/autorun-tests.yml)
- Triggers: Push/PR to main branch
- Steps:
- Setup Python 3.10
- Install dependencies with dev/test extras
- Run ruff check and format
- Run pytest with coverage
- Generate coverage.xml
2. docs-build (.github/workflows/docs-build.yml)
- Builds Sphinx documentation
- Publishes to ReadTheDocs
3. python-publish (.github/workflows/python-publish.yml)
- Publishes package to PyPI
- Triggered on release tags
- Line Length: 90 characters (pyproject.toml:67)
- Quotes: Double quotes for strings
- Imports: Sorted with isort, grouped by:
- Standard library
- Third-party packages
- Local imports (dscript.*)
- Type Hints: Use type hints for function signatures where applicable
- Docstrings: Use reStructuredText format for Sphinx
- Variables: snake_case (e.g.,
state_dict_path) - Functions: snake_case (e.g.,
get_pretrained) - Classes: PascalCase (e.g.,
ModelInteraction,ContactCNN) - Constants: UPPER_CASE (e.g.,
VALID_MODELS,ROOT_URL) - Private: Leading underscore (e.g.,
self._internal_method)
-
Module Composition: Models compose smaller modules
embedding = FullyConnectedEmbed(...) contact = ContactCNN(...) model = ModelInteraction(embedding, contact, ...)
-
Device Handling: Pass
use_cudaflag to modelsmodel = DSCRIPTModel(..., use_cuda=True)
-
Parameter Clamping: Use
.clip()methods to constrain parametersdef clip(self): self.theta.data.clamp_(min=0, max=1)
The project uses loguru for logging (migrated in v0.3.0):
from dscript.utils import log, setup_logger
# Legacy compatibility wrapper
log("Message", file=log_file, print_also=True)
# Direct loguru usage
from loguru import logger
logger.info("Message")- Validate inputs: Check file existence, data shapes, CUDA availability
- Informative errors: Provide context in error messages
- Graceful degradation: Handle download failures with retries
- Exit codes: Use
sys.exit(1)for fatal errors
Example (dscript/utils.py:96-130):
def parse_device(device_arg, logFile):
if use_cuda and not torch.cuda.is_available():
log("CUDA not available but GPU requested...", ...)
sys.exit(1)- Create model class in
dscript/models/ - Inherit from appropriate base (nn.Module, PyTorchModelHubMixin)
- Implement
forward(),clip()if needed - Add to pretrained.py if pre-trained version exists
- Write unit tests in
dscript/tests/test_models_*.py
- Create file in
dscript/commands/new_command.py - Implement:
def add_args(parser): parser.add_argument(...) def main(args): # Implementation
- Add to
__main__.pymodules dict - Create TypedDict for arguments (optional)
- Write tests in
dscript/tests/test_commands.py
Data loading is parallelized using LoadingPool (dscript/loading.py):
- HDF5 files loaded via
load_hdf5_parallel() - Multiprocessing pool for parallel access
- Used extensively in prediction commands
When modifying:
- Test with various
n_jobssettings - Ensure thread-safety for HDF5 access
- Validate key existence before loading
Embeddings are stored in HDF5 format with structure:
embeddings.h5
├── protein1 -> [N × d] array
├── protein2 -> [N × d] array
└── ...
Key considerations:
- Protein names (keys) must match between pairs TSV and embeddings
- Sequences can have variable length (N)
- Embedding dimension (d) must match model expectations
- Use
glider.pyutilities for reading/writing
# 1. Generate embeddings
dscript embed --seqs proteins.fasta --outfile embeddings.h5
# 2. Make predictions
dscript predict --pairs pairs.tsv \
--embeddings embeddings.h5 \
--model samsl/topsy_turvy_human_v1 \
--outfile predictions.tsv
# 3. Evaluate (if labels available)
dscript evaluate --pairs test_pairs.tsv \
--embeddings embeddings.h5 \
--model samsl/topsy_turvy_human_v1 \
--outfile metrics.jsondscript train --train train.tsv \
--test test.tsv \
--embedding embeddings.h5 \
--output output_dir \
--save-prefix my_model \
--device 0 \
--batch-size 32 \
--num-epochs 100# Use all GPUs with 16 blocks
dscript predict --pairs large_pairs.tsv \
--embeddings embeddings.h5 \
--model samsl/topsy_turvy_human_v1 \
--blocks 16 \
-d all \
--outfile predictions.tsvfrom dscript.pretrained import get_pretrained
from dscript.models.interaction import DSCRIPTModel
from huggingface_hub import hf_hub_download
# Option 1: Load from built-in
model = get_pretrained("human_v2")
# Option 2: Load from HuggingFace
model = DSCRIPTModel.from_pretrained("samsl/topsy_turvy_human_v1")
# Option 3: Load from local file
model = DSCRIPTModel.from_pretrained("/path/to/model/")# dscript/tests/test_new_feature.py
import pytest
import torch
from dscript.models.interaction import ModelInteraction
def test_model_forward():
"""Test model forward pass"""
model = ModelInteraction(...)
z0 = torch.randn(1, 100, 100)
z1 = torch.randn(1, 100, 100)
output = model(z0, z1)
assert output.shape == torch.Size([1])
assert 0 <= output.item() <= 1- Always read files before modifying: Don't propose changes to code you haven't seen
- Run tests after changes: Use
pytestto verify functionality - Check code style: Run
ruff checkandruff formatbefore committing - Update documentation: If changing APIs, update docstrings and docs/
- Follow version history: Check CHANGELOG.md for context on previous changes
- Input validation: Always validate file paths, URLs (dscript/pretrained.py:107-111)
- Device validation: Check CUDA availability before GPU operations
- URL schemes: Only allow http/https for downloads
- No arbitrary code execution: Don't use eval() or exec()
- Memory efficiency: Use blocked loading for large datasets (predict_block.py)
- GPU utilization: Support multi-GPU with
device=-1 - Parallel processing: Use multiprocessing for HDF5 loading
- Sparse loading: Only load required embeddings when possible
- CUDA availability: Always check
torch.cuda.is_available()before GPU ops - Model eval mode: Set
model.eval()before inference (disables dropout) - Protein key matching: Ensure FASTA headers match TSV pair identifiers
- Embedding dimensions: Verify embedding dimension matches model expectations
- File handles: Close log files and HDF5 files properly
- Documentation: https://d-script.readthedocs.io/
- PyPI Package: https://pypi.org/project/dscript/
- GitHub: https://github.com/samsledje/D-SCRIPT
- HuggingFace Demo: https://huggingface.co/spaces/samsl/D-SCRIPT
- Paper: https://doi.org/10.1016/j.cels.2021.08.010
- main: Stable release branch
- feature branches:
feature/descriptionorclaude/session-id - bug fixes:
fix/description
Follow conventional commits style:
feat: Add new model architecturefix: Resolve CUDA device errordocs: Update API documentationtest: Add unit tests for contact modelrefactor: Simplify embedding loadingchore: Update dependencies
# Run pre-commit hooks
pre-commit run --all-files
# Run tests
pytest
# Check for common issues
ruff check .- Update version in
dscript/__init__.py - Update CHANGELOG.md
- Commit changes
- Create git tag:
git tag v0.3.1 - Push tag:
git push origin v0.3.1 - GitHub Action automatically publishes to PyPI
Use scripts/push_huggingface.py to upload trained models to HuggingFace Hub.
Core dependencies (pyproject.toml:15-29):
- torch (>=1.13): Deep learning framework
- biotite (==1.2.0): FASTA parsing (replaced BioPython in v0.3.1)
- numpy, scipy, pandas: Numerical computing
- scikit-learn: Metrics and utilities
- h5py: HDF5 file handling
- huggingface_hub: Model hosting
- loguru: Logging (migrated in v0.3.0)
- tqdm: Progress bars
From CHANGELOG.md:
- Expand test suite to maximize coverage
- Continue improving documentation
- v0.3.1: Replaced BioPython with biotite, removed local foldseek requirement
- v0.3.0: Major modernization - BMPI support, loguru migration, Ruff adoption
- v0.2.0: Topsy-Turvy integration, parallel HDF5 loading
- v0.1.8: Training bug fixes for paper replication
This guide is current as of v0.3.1 (2025). For the latest updates, refer to CHANGELOG.md and the official documentation.