D-SCRIPT - AI Assistant Development Guide

Project Overview

D-SCRIPT (Deep Learning for Structure-Aware Protein-Protein Interaction Prediction) is a deep learning method for predicting physical interactions between proteins using only their sequences. The project uses structure-aware design to enhance interpretability and cross-species generalizability.

Current Version: 0.3.1 License: MIT Python Requirements: >=3.10 Primary Frameworks: PyTorch (>=1.13), NumPy, Pandas, SciKit-Learn

Key Features

Protein-protein interaction (PPI) prediction from sequences
Language model-based protein embeddings (Bepler+Berger)
Structure-aware contact map predictions
Multi-GPU support for inference with blocked memory-efficient processing
Pre-trained models available via HuggingFace
Supports Topsy-Turvy (network information) and TT3D (structure information)

Published Research

Original D-SCRIPT: Cell Systems (2021) DOI: 10.1016/j.cels.2021.08.010
Topsy-Turvy: Bioinformatics (2022)
TT3D: Bioinformatics (2023)
BMPI: Bioinformatics (2024) DOI: 10.1093/bioinformatics/btaf564

Repository Structure

D-SCRIPT/
├── dscript/                    # Main Python package
│   ├── __init__.py            # Package initialization (v0.3.1)
│   ├── __main__.py            # CLI entry point with argparse subcommands
│   ├── alphabets.py           # Protein sequence alphabets
│   ├── fasta.py               # FASTA file parsing (uses biotite)
│   ├── foldseek.py            # FoldSeek 3Di structure sequence support
│   ├── glider.py              # Data loading utilities
│   ├── language_model.py      # Bepler+Berger language model
│   ├── pretrained.py          # Pre-trained model loading
│   ├── utils.py               # Logging, device parsing, data structures
│   ├── loading.py             # Parallel HDF5 loading
│   ├── load_worker.py         # Worker for parallel loading
│   ├── commands/              # CLI command implementations
│   │   ├── embed.py           # Generate embeddings from sequences
│   │   ├── train.py           # Model training
│   │   ├── evaluate.py        # Model evaluation metrics
│   │   ├── predict_serial.py  # Serial prediction (legacy)
│   │   ├── predict_block.py   # Blocked multi-GPU prediction (default)
│   │   ├── predict_bipartite.py # Cross-species prediction
│   │   ├── extract_3di.py     # Extract 3Di sequences
│   │   ├── par_worker.py      # Parallel prediction worker
│   │   └── par_writer.py      # Parallel prediction writer
│   ├── models/                # Neural network architectures
│   │   ├── embedding.py       # Projection layers (FullyConnectedEmbed, SkipLSTM)
│   │   ├── contact.py         # Contact map CNN
│   │   └── interaction.py     # Main interaction model (ModelInteraction, DSCRIPTModel)
│   └── tests/                 # Unit tests (pytest)
│       ├── test_fasta.py
│       ├── test_alphabets.py
│       ├── test_models_*.py
│       ├── test_commands.py
│       └── ...
├── data/                      # Sample data and test files
│   ├── seqs/                  # FASTA sequences
│   ├── pairs/                 # TSV pair files
│   └── *.fa                   # FoldSeek sequences
├── docs/                      # Sphinx documentation
│   └── source/
├── bash_files/                # Training/testing bash scripts
├── scripts/                   # Utility scripts
│   ├── bmpi_bench/           # BMPI benchmarking
│   └── push_huggingface.py   # HuggingFace model upload
├── notebooks/                 # Jupyter notebooks
├── pyproject.toml            # Package configuration
├── environment.yml           # Conda environment
├── .pre-commit-config.yaml   # Pre-commit hooks
├── .github/workflows/        # CI/CD pipelines
└── CHANGELOG.md             # Version history

Architecture & Core Components

Neural Network Architecture

The D-SCRIPT model follows a three-stage pipeline:

Embedding: Projects protein language model embeddings to lower dimensions
- FullyConnectedEmbed: Fully-connected projection with dropout
- SkipLSTM: LSTM-based projection (for language model)
Contact Prediction: Predicts inter-protein contact maps
- ContactCNN: CNN operating on outer product of embeddings
- Outputs contact probability map (N × M)
Interaction Prediction: Aggregates contact maps to interaction probability
- ModelInteraction: Main model combining embedding + contact
- Weighted pooling with learnable parameters (θ, λ, γ)
- Logistic activation for final prediction

Model Classes

ModelInteraction (dscript/models/interaction.py:51)

Core model class combining all components
Key methods:
- embed(x): Project embeddings
- cpred(z0, z1, ...): Predict contact map
- predict(z0, z1, ...): Predict interaction probability
- map_predict(z0, z1, ...): Return both contact map and probability

DSCRIPTModel (dscript/models/interaction.py:265)

HuggingFace-compatible wrapper
Inherits from ModelInteraction and PyTorchModelHubMixin
Used for saving/loading models to HuggingFace Hub

Pre-trained Models

Pre-trained models are managed via dscript/pretrained.py:

human_v1: Original D-SCRIPT (Cell Systems 2021)
human_v2: Topsy-Turvy (default, recommended)
human_tt3d: TT3D with FoldSeek 3Di sequences
lm_v1: Bepler & Berger language model

Models can be loaded from:

HuggingFace Hub (e.g., samsl/topsy_turvy_human_v1)
Local disk via get_pretrained(version)
Direct state dict download from MIT server

Data Formats

Input Formats:

FASTA (.fasta): Protein sequences (parsed with biotite)
TSV (.tsv): Protein pairs (tab-separated, no header)
- Format: protein1\tprotein2\t[optional_label]
HDF5 (.h5): Embeddings storage
- Keys: protein identifiers
- Values: embedding tensors

Output Formats:

Predictions: TSV with scores
Embeddings: HDF5 files
Trained models: PyTorch state dicts (.pt) or HuggingFace format

CLI Commands

The package provides a unified CLI via dscript command:

Main Commands

dscript embed - Generate embeddings

dscript embed --seqs data/seqs/ecoli.fasta --outfile ecoli_embed.h5

dscript predict - Predict interactions (blocked mode, default)

dscript predict --pairs data/pairs/ecoli.tsv \
                --embeddings ecoli_embed.h5 \
                --model samsl/topsy_turvy_human_v1 \
                --outfile predictions.tsv \
                --blocks 16 -d 0

dscript predict_serial - Serial prediction (legacy)
dscript predict_bipartite - Cross-species prediction

dscript train - Train new model

dscript train --train train.tsv --test test.tsv \
              --embedding embeddings.h5 \
              --output output_dir --save-prefix model

dscript evaluate - Evaluate model performance
dscript extract-3di - Extract FoldSeek 3Di sequences

CLI Architecture

The CLI is implemented using argparse subcommands (dscript/main.py):

Each command is a module in dscript/commands/
Each module provides add_args(parser) and main(args) functions
Type hints use union types for argument classes

Development Workflows

Environment Setup

Using Conda (recommended):

conda env create -f environment.yml
conda activate dscript

Using pip:

pip install -e ".[dev]"  # Editable install with dev dependencies

Code Quality Tools

Ruff - Linting and formatting (configured in pyproject.toml:66-87)

Line length: 90 characters
Target: Python 3.10+
Enabled rules: pycodestyle (E/W), pyflakes (F), isort (I), pyupgrade (UP)
Ignored: E501 (line too long)
Quote style: double quotes
Indentation: spaces

Pre-commit Hooks (.pre-commit-config.yaml):

Check merge conflicts
Check YAML syntax
Fix end-of-file
Trim trailing whitespace
Ruff check with auto-fix
Ruff format

Setting up pre-commit:

pip install pre-commit
pre-commit install

Running Linters

# Check code
ruff check . --statistics

# Format code
ruff format .

# Run pre-commit on all files
pre-commit run --all-files

Testing

Test Framework

Framework: pytest
Location: dscript/tests/
Coverage: pytest-cov

Running Tests

# Run all tests
pytest

# Run with coverage
pytest --cov=dscript --cov-report=xml --cov-report=term-missing

# Run specific test file
pytest dscript/tests/test_fasta.py

# Run specific test
pytest dscript/tests/test_models_interaction.py::test_model_forward

Test Configuration

Configuration in pyproject.toml:92-101:

Test file pattern: test_*.py
Test path: dscript/tests
Warnings filtered: UserWarning, DeprecationWarning, PendingDeprecationWarning

Test Coverage

Current test files cover:

test_fasta.py: FASTA parsing
test_alphabets.py: Protein alphabets
test_models_embedding.py: Embedding layers
test_models_contact.py: Contact prediction
test_models_interaction.py: Full interaction model
test_pretrained.py: Model loading
test_language_model.py: Language model
test_foldseek.py: 3Di sequence handling
test_commands.py: CLI commands

CI/CD Pipelines

GitHub Actions Workflows

1. pytest (.github/workflows/autorun-tests.yml)

Triggers: Push/PR to main branch
Steps:
1. Setup Python 3.10
2. Install dependencies with dev/test extras
3. Run ruff check and format
4. Run pytest with coverage
5. Generate coverage.xml

2. docs-build (.github/workflows/docs-build.yml)

Builds Sphinx documentation
Publishes to ReadTheDocs

3. python-publish (.github/workflows/python-publish.yml)

Publishes package to PyPI
Triggered on release tags

Key Coding Conventions

Code Style

Line Length: 90 characters (pyproject.toml:67)
Quotes: Double quotes for strings
Imports: Sorted with isort, grouped by:
- Standard library
- Third-party packages
- Local imports (dscript.*)
Type Hints: Use type hints for function signatures where applicable
Docstrings: Use reStructuredText format for Sphinx

Naming Conventions

Variables: snake_case (e.g., state_dict_path)
Functions: snake_case (e.g., get_pretrained)
Classes: PascalCase (e.g., ModelInteraction, ContactCNN)
Constants: UPPER_CASE (e.g., VALID_MODELS, ROOT_URL)
Private: Leading underscore (e.g., self._internal_method)

Model Design Patterns

Module Composition: Models compose smaller modules

embedding = FullyConnectedEmbed(...)
contact = ContactCNN(...)
model = ModelInteraction(embedding, contact, ...)

Device Handling: Pass use_cuda flag to models

model = DSCRIPTModel(..., use_cuda=True)

Parameter Clamping: Use .clip() methods to constrain parameters

def clip(self):
    self.theta.data.clamp_(min=0, max=1)

Logging

The project uses loguru for logging (migrated in v0.3.0):

from dscript.utils import log, setup_logger

# Legacy compatibility wrapper
log("Message", file=log_file, print_also=True)

# Direct loguru usage
from loguru import logger
logger.info("Message")

Error Handling

Validate inputs: Check file existence, data shapes, CUDA availability
Informative errors: Provide context in error messages
Graceful degradation: Handle download failures with retries
Exit codes: Use sys.exit(1) for fatal errors

Example (dscript/utils.py:96-130):

def parse_device(device_arg, logFile):
    if use_cuda and not torch.cuda.is_available():
        log("CUDA not available but GPU requested...", ...)
        sys.exit(1)

Working with the Codebase

Adding a New Model

Create model class in dscript/models/
Inherit from appropriate base (nn.Module, PyTorchModelHubMixin)
Implement forward(), clip() if needed
Add to pretrained.py if pre-trained version exists
Write unit tests in dscript/tests/test_models_*.py

Adding a New Command

Create file in dscript/commands/new_command.py

Implement:

def add_args(parser):
    parser.add_argument(...)

def main(args):
    # Implementation

Add to __main__.py modules dict
Create TypedDict for arguments (optional)
Write tests in dscript/tests/test_commands.py

Modifying Data Loading

Data loading is parallelized using LoadingPool (dscript/loading.py):

HDF5 files loaded via load_hdf5_parallel()
Multiprocessing pool for parallel access
Used extensively in prediction commands

When modifying:

Test with various n_jobs settings
Ensure thread-safety for HDF5 access
Validate key existence before loading

Working with Embeddings

Embeddings are stored in HDF5 format with structure:

embeddings.h5
├── protein1 -> [N × d] array
├── protein2 -> [N × d] array
└── ...

Key considerations:

Protein names (keys) must match between pairs TSV and embeddings
Sequences can have variable length (N)
Embedding dimension (d) must match model expectations
Use glider.py utilities for reading/writing

Common Tasks

Running the Full Pipeline

# 1. Generate embeddings
dscript embed --seqs proteins.fasta --outfile embeddings.h5

# 2. Make predictions
dscript predict --pairs pairs.tsv \
                --embeddings embeddings.h5 \
                --model samsl/topsy_turvy_human_v1 \
                --outfile predictions.tsv

# 3. Evaluate (if labels available)
dscript evaluate --pairs test_pairs.tsv \
                 --embeddings embeddings.h5 \
                 --model samsl/topsy_turvy_human_v1 \
                 --outfile metrics.json

Training a Custom Model

dscript train --train train.tsv \
              --test test.tsv \
              --embedding embeddings.h5 \
              --output output_dir \
              --save-prefix my_model \
              --device 0 \
              --batch-size 32 \
              --num-epochs 100

Using Multi-GPU Prediction

# Use all GPUs with 16 blocks
dscript predict --pairs large_pairs.tsv \
                --embeddings embeddings.h5 \
                --model samsl/topsy_turvy_human_v1 \
                --blocks 16 \
                -d all \
                --outfile predictions.tsv

Loading Pre-trained Models in Code

from dscript.pretrained import get_pretrained
from dscript.models.interaction import DSCRIPTModel
from huggingface_hub import hf_hub_download

# Option 1: Load from built-in
model = get_pretrained("human_v2")

# Option 2: Load from HuggingFace
model = DSCRIPTModel.from_pretrained("samsl/topsy_turvy_human_v1")

# Option 3: Load from local file
model = DSCRIPTModel.from_pretrained("/path/to/model/")

Adding Tests

# dscript/tests/test_new_feature.py
import pytest
import torch
from dscript.models.interaction import ModelInteraction

def test_model_forward():
    """Test model forward pass"""
    model = ModelInteraction(...)
    z0 = torch.randn(1, 100, 100)
    z1 = torch.randn(1, 100, 100)

    output = model(z0, z1)

    assert output.shape == torch.Size([1])
    assert 0 <= output.item() <= 1

Important Notes for AI Assistants

When Making Changes

Always read files before modifying: Don't propose changes to code you haven't seen
Run tests after changes: Use pytest to verify functionality
Check code style: Run ruff check and ruff format before committing
Update documentation: If changing APIs, update docstrings and docs/
Follow version history: Check CHANGELOG.md for context on previous changes

Security Considerations

Input validation: Always validate file paths, URLs (dscript/pretrained.py:107-111)
Device validation: Check CUDA availability before GPU operations
URL schemes: Only allow http/https for downloads
No arbitrary code execution: Don't use eval() or exec()

Performance Considerations

Memory efficiency: Use blocked loading for large datasets (predict_block.py)
GPU utilization: Support multi-GPU with device=-1
Parallel processing: Use multiprocessing for HDF5 loading
Sparse loading: Only load required embeddings when possible

Common Pitfalls

CUDA availability: Always check torch.cuda.is_available() before GPU ops
Model eval mode: Set model.eval() before inference (disables dropout)
Protein key matching: Ensure FASTA headers match TSV pair identifiers
Embedding dimensions: Verify embedding dimension matches model expectations
File handles: Close log files and HDF5 files properly

Useful References

Documentation: https://d-script.readthedocs.io/
PyPI Package: https://pypi.org/project/dscript/
GitHub: https://github.com/samsledje/D-SCRIPT
HuggingFace Demo: https://huggingface.co/spaces/samsl/D-SCRIPT
Paper: https://doi.org/10.1016/j.cels.2021.08.010

Git Workflow

Branching Strategy

main: Stable release branch
feature branches: feature/description or claude/session-id
bug fixes: fix/description

Commit Messages

Follow conventional commits style:

feat: Add new model architecture
fix: Resolve CUDA device error
docs: Update API documentation
test: Add unit tests for contact model
refactor: Simplify embedding loading
chore: Update dependencies

Before Pushing

# Run pre-commit hooks
pre-commit run --all-files

# Run tests
pytest

# Check for common issues
ruff check .

Package Distribution

PyPI Release Process

Update version in dscript/__init__.py
Update CHANGELOG.md
Commit changes
Create git tag: git tag v0.3.1
Push tag: git push origin v0.3.1
GitHub Action automatically publishes to PyPI

HuggingFace Model Upload

Use scripts/push_huggingface.py to upload trained models to HuggingFace Hub.

Project Maintenance

Dependencies

Core dependencies (pyproject.toml:15-29):

torch (>=1.13): Deep learning framework
biotite (==1.2.0): FASTA parsing (replaced BioPython in v0.3.1)
numpy, scipy, pandas: Numerical computing
scikit-learn: Metrics and utilities
h5py: HDF5 file handling
huggingface_hub: Model hosting
loguru: Logging (migrated in v0.3.0)
tqdm: Progress bars

Known Issues & TODOs

From CHANGELOG.md:

Expand test suite to maximize coverage
Continue improving documentation

Version History Highlights

v0.3.1: Replaced BioPython with biotite, removed local foldseek requirement
v0.3.0: Major modernization - BMPI support, loguru migration, Ruff adoption
v0.2.0: Topsy-Turvy integration, parallel HDF5 loading
v0.1.8: Training bug fixes for paper replication

This guide is current as of v0.3.1 (2025). For the latest updates, refer to CHANGELOG.md and the official documentation.

FilesExpand file tree

AGENTS.md

Latest commit

History