Skip to content

Latest commit

 

History

History
69 lines (56 loc) · 2.76 KB

File metadata and controls

69 lines (56 loc) · 2.76 KB

Pathfinder: Protein Structure Ensemble Clustering and Representative Selection

License: MIT Python 3.8+

Protocol Fig. 1: Overview of the Pathfinder clustering pipeline, from PDB input to ranked representatives.

Pathfinder is a tool for clustering protein structure ensembles (e.g., from AlphaFold predictions) and selecting representative conformations. It supports dimensionality reduction via distance maps or TM-score matrices, followed by clustering using algorithms like K-means, hierarchical, DBSCAN, spectral, or GMM. Optional integration with reference structures enables ranking based on TM-align scores.

The pipeline processes PDB files in ensembles, extracts features, clusters them, and outputs ranked representatives with metrics.

Features

  • Feature Extraction: Residue distance maps or TM-score distance matrices of given ensemble(s).
  • Clustering: Multiple algorithms with/without PCA dimensionality reduction.
  • Ranking: Confidence-weighted selection and ranking; optional TM-score comparison to references.
  • Parallelism: Multi-process support for efficiency.
  • Batch Processing: Handle single proteins, multiple, or directories via a wrapper script.

Prerequisites

  • Python 3.8+ with NumPy, Pandas, scikit-learn, and SciPy.
  • External tools (installed with conda):
  • A conda environment (example provided in scripts).

Installation

  1. Clone the repository:
    git clone https://github.com/yourusername/pathfinder.git
    cd pathfinder
  2. Create and activate a mamba environment:
    mamba create -n pathfinder -f environment.yml 
    mamba activate pathfinder
    

Quick Start

Activate your environment and ensure src/ and scripts are in your PATH or current directory.

Run the test script to see an example:

chmod +x run_test.sh
./run_test.sh
python src/main.py \
    --ensemble_dir /path/to/ensemble/dir \
    --output_dir /path/to/output/dir \
    --cluster_method kmeans \
    --n_clusters 10 \
    --n_pca_components 10 \
    --transformer tmscore \
    --alpha 1.0 \
    --n_processes 32 \
    --ref_list_txt /path/to/refs.txt  # Optional

Interactive ensemble analysis and state identification

cd dashapp
python app.py

Protocol Fig. 1: Overview of the ensemble analysis and state identification interactive utility

Cite

To be added