Skip to content

Latest commit

 

History

History
117 lines (90 loc) · 5.53 KB

File metadata and controls

117 lines (90 loc) · 5.53 KB

python-chebifier

An AI ensemble model for predicting chemical classes in the ChEBI ontology.

Installation

# Clone the repository
git clone https://github.com/yourusername/python-chebifier.git
cd python-chebifier

# Install the package
pip install -e .

uchebai-graph and its dependencies cannot be installed automatically. If you want to use Graph Neural Networks, follow the instructions in the chebai-graph repository.

Usage

Command Line Interface

The package provides a command-line interface (CLI) for making predictions using an ensemble model.

# Get help
python -m chebifier.cli --help

# Make predictions using a configuration file
python -m chebifier.cli predict configs/example_config.yml --smiles "CC(=O)OC1=CC=CC=C1C(=O)O" "C1=CC=C(C=C1)C(=O)O"

# Make predictions using SMILES from a file
python -m chebifier.cli predict configs/example_config.yml --smiles-file smiles.txt

Configuration File

The CLI requires a YAML configuration file that defines the ensemble model. An example can be found in configs/example_config.yml.

The models and other required files are trained / generated by our chebai package. Examples for models can be found on kaggle.

Python API

You can also use the package programmatically:

from chebifier.ensemble.base_ensemble import BaseEnsemble
import yaml

# Load configuration from YAML file
with open('configs/example_config.yml', 'r') as f:
    config = yaml.safe_load(f)

# Instantiate ensemble model
ensemble = BaseEnsemble(config)

# Make predictions
smiles_list = ["CC(=O)OC1=CC=CC=C1C(=O)O", "C1=CC=C(C=C1)C(=O)O"]
predictions = ensemble.predict_smiles_list(smiles_list)

# Print results
for smiles, prediction in zip(smiles_list, predictions):
    print(f"SMILES: {smiles}")
    if prediction:
        print(f"Predicted classes: {prediction}")
    else:
        print("No predictions")

The ensemble

Given a sample (i.e., a SMILES string) and models $m_1, m_2, \ldots, m_n$, the ensemble works as follows:

  1. Get predictions from each model $m_i$ for the sample.
  2. For each class $c$, aggregate predictions $p_c^{m_i}$ from all models that made a prediction for that class. The aggregation happens separately for all positive predictions (i.e., $p_c^{m_i} \geq 0.5$) and all negative predictions ($p_c^{m_i} < 0.5$). If the aggregated value is larger for the positive predictions than for the negative predictions, the ensemble makes a positive prediction for class $c$:

$$ \text{ensemble}(c) = \begin{cases} 1 & \text{if } \sum_{i: p_c^{m_i} \geq 0.5} [\text{confidence}_c^{m_i} \cdot \text{model_weight}_{m_i} \cdot \text{trust}_c^{m_i}] > \sum_{i: p_c^{m_i} < 0.5} [\text{confidence}_c^{m_i} \cdot \text{model_weight}_{m_i} \cdot \text{trust}_c^{m_i}] \\ 0 & \text{otherwise} \end{cases} $$

Here, confidence is the model's (self-reported) confidence in its prediction, calculated as $$ \text{confidence}_c^{m_i} = 2|p_c^{m_i} - 0.5| $$ For example, if a model makes a positive prediction with $p_c^{m_i} = 0.55$, the confidence is $2|0.55 - 0.5| = 0.1$. One could say that the model is not very confident in its prediction and very close to switching to a negative prediction. If another model is very sure about its negative prediction with $p_c^{m_j} = 0.1$, the confidence is $2|0.1 - 0.5| = 0.8$. Therefore, if in doubt, we are more confident in the negative prediction.

Confidence can be disabled by the use_confidence parameter of the predict method (default: True).

The model_weight can be set for each model in the configuration file (default: 1). This is used to favor a certain model independently of a given class. Trust is based on the model's performance on a validation set. After training, we evaluate the Machine Learning models on a validation set for each class. If the ensemble_type is set to wmv-f1, the trust is calculated as 1 + the F1 score. If the ensemble_type is set to mv (the default), the trust is set to 1 for all models.

  1. After a decision has been made for each class independently, the consistency of the predictions with regard to the ChEBI hierarchy and disjointness axioms is checked. This is done in 3 steps:
  • (1) First, the hierarchy is corrected. For each pair of classes $A$ and $B$ where $A$ is a subclass of $B$ (following the is-a relation in ChEBI), we set the ensemble prediction of $B$ to 1 if the prediction of $A$ is 1. Intuitively speaking, if we have determined that a molecule belongs to a specific class (e.g., aromatic primary alcohol), it also belongs to the direct and indirect superclasses (e.g., primary alcohol, aromatic alcohol, alcohol).
  • (2) Next, we check for disjointness. This is not specified directly in ChEBI, but in an additional ChEBI module (chebi-disjoints.owl). We have extracted these disjointness axioms into a CSV file and added some more disjointness axioms ourselves (see data>disjoint_chebi.csv and data>disjoint_additional.csv). If two classes $A$ and $B$ are disjoint and we predict both, we select one of them randomly and set the other to 0.
  • (3) Since the second step might have introduced new inconsistencies into the hierarchy, we repeat the first step, but with a small change. For a pair of classes $A \subseteq B$ with predictions $1$ and $0$, instead of setting $B$ to $1$, we now set $A$ to $0$. This has the advantage that we cannot introduce new disjointness-inconsistencies and don't have to repeat step 2.