Skip to content

Task Hardness Estimation for Molecular Activity Predcition (THEMAP)

License

Notifications You must be signed in to change notification settings

HFooladi/THEMAP

Repository files navigation

THEMAP

DOI Python License: MIT PyPI version

THEMAP Banner

Task Hardness Estimation for Molecular Activity Prediction

A Python library for calculating distances between chemical datasets to enable intelligent dataset selection for molecular activity prediction tasks.

Table of Contents

Overview

THEMAP is a Python library designed to calculate distances between chemical datasets for molecular activity prediction tasks. The primary goal is to enable intelligent dataset selection for:

  • Transfer Learning: Identify the most relevant source datasets for your target prediction task
  • Domain Adaptation: Measure dataset similarity to guide model adaptation strategies
  • Task Hardness Assessment: Quantify how difficult a prediction task will be based on dataset characteristics
  • Dataset Curation: Select optimal training datasets from large chemical databases like ChEMBL

Installation

Quick Start (Recommended)

The easiest way to install THEMAP with all features:

git clone https://github.com/HFooladi/THEMAP.git
cd THEMAP
source install.sh

This automatically:

  • Installs uv (fast Python package manager) if needed
  • Creates a virtual environment in .venv
  • Installs all dependencies
  • Activates the environment

After installation, try an example:

python examples/quickstart.py

To reactivate the environment later:

source .venv/bin/activate

Manual Installation

For more control, install with pip:

pip install themap                # Basic installation from PyPI
pip install -e ".[all]"           # Full installation (editable)
pip install -e ".[protein]"       # Protein analysis only
pip install -e ".[otdd]"          # Optimal transport only
pip install -e ".[dev,test]"      # Development + testing

Conda Alternative

For GPU support with specific CUDA versions:

conda env create -f environment.yml
conda activate themap
pip install -e . --no-deps

Prerequisites

  • Python 3.10 or higher
  • For GPU features: CUDA-compatible GPU and drivers

Quick Start

Compute Dataset Distances

The simplest way to compute distances between molecular datasets:

from themap import quick_distance

results = quick_distance(
    data_dir="datasets",          # Directory with train/ and test/ folders
    output_dir="output",          # Where to save results
    molecule_featurizer="ecfp",   # Fingerprint type (ecfp, maccs, etc.)
    molecule_method="euclidean",  # Distance metric
)

# Results saved to output/molecule_distances.csv

Using a Config File

For reproducible experiments, use a YAML configuration:

from themap import run_pipeline

results = run_pipeline("config.yaml")

Example config.yaml:

data:
  directory: "datasets"

distances:
  molecule:
    enabled: true
    featurizer: "ecfp"
    method: "euclidean"

output:
  directory: "output"
  format: "csv"

Data Format

Organize your data in this structure:

datasets/
├── train/                        # Source datasets
│   ├── CHEMBL123456.jsonl.gz
│   └── ...
└── test/                         # Target datasets
    ├── CHEMBL111111.jsonl.gz
    └── ...

Each .jsonl.gz file contains molecules in JSON lines format:

{"SMILES": "CCO", "Property": 1}
{"SMILES": "CCCO", "Property": 0}

CLI Reference

THEMAP provides a command-line interface for all core operations. After installation, the themap command is available in your terminal.

themap --help              # Show all available commands
themap <command> --help    # Show help for a specific command

Quick Distance Computation

Compute distances between datasets with minimal setup — no config file needed:

themap quick datasets/ -f ecfp -m euclidean -o output/
themap quick datasets/ -f maccs -m cosine -j 4

Full Pipeline with Config File

For reproducible experiments, use a YAML configuration:

themap init                              # Generate a config.yaml template
themap run config.yaml                   # Run the full pipeline
themap run config.yaml -o results/       # Custom output directory
themap run config.yaml --molecule-only   # Skip protein distances
themap run config.yaml -j 4             # Set parallel workers

Pre-compute Features

Featurize datasets and cache to disk (useful before running multiple distance computations):

# Single featurizer
themap featurize datasets/ -f ecfp

# Multiple featurizers at once
themap featurize datasets/ -f ecfp -f maccs -f desc2D

# Featurize a specific fold or file
themap featurize datasets/ -f ecfp --fold train
themap featurize datasets/test/CHEMBL123.jsonl.gz -f ecfp

# Force recompute (ignore cached features)
themap featurize datasets/ -f ecfp --force

Data Utilities

# Convert CSV to THEMAP's JSONL.GZ format
themap convert data.csv CHEMBL123456
themap convert data.csv CHEMBL123456 --smiles-column SMILES --activity-column pIC50

# Inspect a dataset directory
themap info datasets/

# List all available featurizers (27 molecule + 5 protein featurizers)
themap list-featurizers

Add -v before any command for verbose/debug output: themap -v quick datasets/

Usage Examples

Analyzing Distance Results

import pandas as pd

# Load computed distances
distances = pd.read_csv("output/molecule_distances.csv", index_col=0)

# Find closest source for each target (transfer learning selection)
for target in distances.columns:
    closest = distances[target].idxmin()
    dist = distances[target].min()
    print(f"{target} <- {closest} (distance: {dist:.4f})")

# Estimate task hardness (average distance to k-nearest sources)
k = 3
for target in distances.columns:
    hardness = distances[target].nsmallest(k).mean()
    print(f"Task hardness for {target}: {hardness:.4f}")

Reproducing FS-Mol Experiments

Pre-computed molecular embeddings and distance matrices for the FS-Mol dataset are available on Zenodo.

Setup

  1. Download data from Zenodo
  2. Extract to datasets/fsmol_hardness/
  3. See examples/ directory for usage examples

Documentation

Full documentation is available at hfooladi.github.io/THEMAP or can be built locally:

mkdocs serve  # Serve locally at http://127.0.0.1:8000

Contributing

We welcome contributions! Please see our Contributing Guidelines for details.

Development Setup

git clone https://github.com/HFooladi/THEMAP.git
cd THEMAP
source install.sh           # creates .venv and installs all deps

Or manually:

pip install -e ".[dev,test,ml]"

Running Tests

source .venv/bin/activate    # always activate venv first
python run_tests.py          # all tests
python run_tests.py fast     # skip slow tests
python run_tests.py coverage # with coverage
pytest -k "test_name"        # specific test by name

Code Quality

ruff check .                 # linting
ruff format .                # formatting
mypy -p themap               # type checking

Citation

If you use THEMAP in your research, please cite our paper:

@article{fooladi2024quantifying,
  title={Quantifying the hardness of bioactivity prediction tasks for transfer learning},
  author={Fooladi, Hosein and Hirte, Steffen and Kirchmair, Johannes},
  journal={Journal of Chemical Information and Modeling},
  volume={64},
  number={10},
  pages={4031-4046},
  year={2024},
  publisher={ACS Publications}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Support

About

Task Hardness Estimation for Molecular Activity Predcition (THEMAP)

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •