Pathfinder: Protein Structure Ensemble Clustering and Representative Selection

Fig. 1: Overview of the Pathfinder clustering pipeline, from PDB input to ranked representatives.

Pathfinder is a tool for clustering protein structure ensembles (e.g., from AlphaFold predictions) and selecting representative conformations. It supports dimensionality reduction via distance maps or TM-score matrices, followed by clustering using algorithms like K-means, hierarchical, DBSCAN, spectral, or GMM. Optional integration with reference structures enables ranking based on TM-align scores.

The pipeline processes PDB files in ensembles, extracts features, clusters them, and outputs ranked representatives with metrics.

Features

Feature Extraction: Residue distance maps or TM-score distance matrices of given ensemble(s).
Clustering: Multiple algorithms with/without PCA dimensionality reduction.
Ranking: Confidence-weighted selection and ranking; optional TM-score comparison to references.
Parallelism: Multi-process support for efficiency.
Batch Processing: Handle single proteins, multiple, or directories via a wrapper script.

Prerequisites

Python 3.8+ with NumPy, Pandas, scikit-learn, and SciPy.
External tools (installed with conda):
- Foldseek (for TM-score matrices).
- TM-align (binary in PATH).
A conda environment (example provided in scripts).

Installation

Clone the repository:

git clone https://github.com/yourusername/pathfinder.git
cd pathfinder

Create and activate a mamba environment:

mamba create -n pathfinder -f environment.yml 
mamba activate pathfinder

Quick Start

Activate your environment and ensure src/ and scripts are in your PATH or current directory.

Run the test script to see an example:

chmod +x run_test.sh
./run_test.sh

python src/main.py \
    --ensemble_dir /path/to/ensemble/dir \
    --output_dir /path/to/output/dir \
    --cluster_method kmeans \
    --n_clusters 10 \
    --n_pca_components 10 \
    --transformer tmscore \
    --alpha 1.0 \
    --n_processes 32 \
    --ref_list_txt /path/to/refs.txt  # Optional

Interactive ensemble analysis and state identification

cd dashapp
python app.py

Fig. 1: Overview of the ensemble analysis and state identification interactive utility

Cite

To be added

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pathfinder: Protein Structure Ensemble Clustering and Representative Selection

Features

Prerequisites

Installation

Quick Start

Interactive ensemble analysis and state identification

Cite

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Pathfinder: Protein Structure Ensemble Clustering and Representative Selection

Features

Prerequisites

Installation

Quick Start

Interactive ensemble analysis and state identification

Cite