Task Hardness Estimation for Molecular Activity Prediction
A Python library for calculating distances between chemical datasets to enable intelligent dataset selection for molecular activity prediction tasks.
- Overview
- Installation
- Quick Start
- CLI Reference
- Usage Examples
- Reproducing FS-Mol Experiments
- Documentation
- Contributing
- Citation
- License
THEMAP is a Python library designed to calculate distances between chemical datasets for molecular activity prediction tasks. The primary goal is to enable intelligent dataset selection for:
- Transfer Learning: Identify the most relevant source datasets for your target prediction task
- Domain Adaptation: Measure dataset similarity to guide model adaptation strategies
- Task Hardness Assessment: Quantify how difficult a prediction task will be based on dataset characteristics
- Dataset Curation: Select optimal training datasets from large chemical databases like ChEMBL
The easiest way to install THEMAP with all features:
git clone https://github.com/HFooladi/THEMAP.git
cd THEMAP
source install.shThis automatically:
- Installs
uv(fast Python package manager) if needed - Creates a virtual environment in
.venv - Installs all dependencies
- Activates the environment
After installation, try an example:
python examples/quickstart.pyTo reactivate the environment later:
source .venv/bin/activateFor more control, install with pip:
pip install themap # Basic installation from PyPI
pip install -e ".[all]" # Full installation (editable)
pip install -e ".[protein]" # Protein analysis only
pip install -e ".[otdd]" # Optimal transport only
pip install -e ".[dev,test]" # Development + testingFor GPU support with specific CUDA versions:
conda env create -f environment.yml
conda activate themap
pip install -e . --no-deps- Python 3.10 or higher
- For GPU features: CUDA-compatible GPU and drivers
The simplest way to compute distances between molecular datasets:
from themap import quick_distance
results = quick_distance(
data_dir="datasets", # Directory with train/ and test/ folders
output_dir="output", # Where to save results
molecule_featurizer="ecfp", # Fingerprint type (ecfp, maccs, etc.)
molecule_method="euclidean", # Distance metric
)
# Results saved to output/molecule_distances.csvFor reproducible experiments, use a YAML configuration:
from themap import run_pipeline
results = run_pipeline("config.yaml")Example config.yaml:
data:
directory: "datasets"
distances:
molecule:
enabled: true
featurizer: "ecfp"
method: "euclidean"
output:
directory: "output"
format: "csv"Organize your data in this structure:
datasets/
├── train/ # Source datasets
│ ├── CHEMBL123456.jsonl.gz
│ └── ...
└── test/ # Target datasets
├── CHEMBL111111.jsonl.gz
└── ...
Each .jsonl.gz file contains molecules in JSON lines format:
{"SMILES": "CCO", "Property": 1}
{"SMILES": "CCCO", "Property": 0}THEMAP provides a command-line interface for all core operations. After installation, the themap command is available in your terminal.
themap --help # Show all available commands
themap <command> --help # Show help for a specific commandCompute distances between datasets with minimal setup — no config file needed:
themap quick datasets/ -f ecfp -m euclidean -o output/
themap quick datasets/ -f maccs -m cosine -j 4For reproducible experiments, use a YAML configuration:
themap init # Generate a config.yaml template
themap run config.yaml # Run the full pipeline
themap run config.yaml -o results/ # Custom output directory
themap run config.yaml --molecule-only # Skip protein distances
themap run config.yaml -j 4 # Set parallel workersFeaturize datasets and cache to disk (useful before running multiple distance computations):
# Single featurizer
themap featurize datasets/ -f ecfp
# Multiple featurizers at once
themap featurize datasets/ -f ecfp -f maccs -f desc2D
# Featurize a specific fold or file
themap featurize datasets/ -f ecfp --fold train
themap featurize datasets/test/CHEMBL123.jsonl.gz -f ecfp
# Force recompute (ignore cached features)
themap featurize datasets/ -f ecfp --force# Convert CSV to THEMAP's JSONL.GZ format
themap convert data.csv CHEMBL123456
themap convert data.csv CHEMBL123456 --smiles-column SMILES --activity-column pIC50
# Inspect a dataset directory
themap info datasets/
# List all available featurizers (27 molecule + 5 protein featurizers)
themap list-featurizersAdd -v before any command for verbose/debug output: themap -v quick datasets/
import pandas as pd
# Load computed distances
distances = pd.read_csv("output/molecule_distances.csv", index_col=0)
# Find closest source for each target (transfer learning selection)
for target in distances.columns:
closest = distances[target].idxmin()
dist = distances[target].min()
print(f"{target} <- {closest} (distance: {dist:.4f})")
# Estimate task hardness (average distance to k-nearest sources)
k = 3
for target in distances.columns:
hardness = distances[target].nsmallest(k).mean()
print(f"Task hardness for {target}: {hardness:.4f}")Pre-computed molecular embeddings and distance matrices for the FS-Mol dataset are available on Zenodo.
- Download data from Zenodo
- Extract to
datasets/fsmol_hardness/ - See
examples/directory for usage examples
Full documentation is available at hfooladi.github.io/THEMAP or can be built locally:
mkdocs serve # Serve locally at http://127.0.0.1:8000We welcome contributions! Please see our Contributing Guidelines for details.
git clone https://github.com/HFooladi/THEMAP.git
cd THEMAP
source install.sh # creates .venv and installs all depsOr manually:
pip install -e ".[dev,test,ml]"source .venv/bin/activate # always activate venv first
python run_tests.py # all tests
python run_tests.py fast # skip slow tests
python run_tests.py coverage # with coverage
pytest -k "test_name" # specific test by nameruff check . # linting
ruff format . # formatting
mypy -p themap # type checkingIf you use THEMAP in your research, please cite our paper:
@article{fooladi2024quantifying,
title={Quantifying the hardness of bioactivity prediction tasks for transfer learning},
author={Fooladi, Hosein and Hirte, Steffen and Kirchmair, Johannes},
journal={Journal of Chemical Information and Modeling},
volume={64},
number={10},
pages={4031-4046},
year={2024},
publisher={ACS Publications}
}This project is licensed under the MIT License - see the LICENSE file for details.
