An AI ensemble model for predicting chemical classes in the ChEBI ontology.
# Clone the repository
git clone https://github.com/yourusername/python-chebifier.git
cd python-chebifier
# Install the package
pip install -e .uchebai-graph and its dependencies cannot be installed automatically. If you want to use Graph Neural Networks, follow
the instructions in the chebai-graph repository.
The package provides a command-line interface (CLI) for making predictions using an ensemble model.
# Get help
python -m chebifier.cli --help
# Make predictions using a configuration file
python -m chebifier.cli predict configs/example_config.yml --smiles "CC(=O)OC1=CC=CC=C1C(=O)O" "C1=CC=C(C=C1)C(=O)O"
# Make predictions using SMILES from a file
python -m chebifier.cli predict configs/example_config.yml --smiles-file smiles.txtThe CLI requires a YAML configuration file that defines the ensemble model. An example can be found in configs/example_config.yml.
The models and other required files are trained / generated by our chebai package. Examples for models can be found on kaggle.
You can also use the package programmatically:
from chebifier.ensemble.base_ensemble import BaseEnsemble
import yaml
# Load configuration from YAML file
with open('configs/example_config.yml', 'r') as f:
config = yaml.safe_load(f)
# Instantiate ensemble model
ensemble = BaseEnsemble(config)
# Make predictions
smiles_list = ["CC(=O)OC1=CC=CC=C1C(=O)O", "C1=CC=C(C=C1)C(=O)O"]
predictions = ensemble.predict_smiles_list(smiles_list)
# Print results
for smiles, prediction in zip(smiles_list, predictions):
print(f"SMILES: {smiles}")
if prediction:
print(f"Predicted classes: {prediction}")
else:
print("No predictions")Given a sample (i.e., a SMILES string) and models
- Get predictions from each model
$m_i$ for the sample. - For each class
$c$ , aggregate predictions$p_c^{m_i}$ from all models that made a prediction for that class. The aggregation happens separately for all positive predictions (i.e.,$p_c^{m_i} \geq 0.5$ ) and all negative predictions ($p_c^{m_i} < 0.5$ ). If the aggregated value is larger for the positive predictions than for the negative predictions, the ensemble makes a positive prediction for class$c$ :
Here, confidence is the model's (self-reported) confidence in its prediction, calculated as
$$
\text{confidence}_c^{m_i} = 2|p_c^{m_i} - 0.5|
$$
For example, if a model makes a positive prediction with
Confidence can be disabled by the use_confidence parameter of the predict method (default: True).
The model_weight can be set for each model in the configuration file (default: 1). This is used to favor a certain
model independently of a given class.
Trust is based on the model's performance on a validation set. After training, we evaluate the Machine Learning models
on a validation set for each class. If the ensemble_type is set to wmv-f1, the trust is calculated as 1 + the F1 score.
If the ensemble_type is set to mv (the default), the trust is set to 1 for all models.
- After a decision has been made for each class independently, the consistency of the predictions with regard to the ChEBI hierarchy and disjointness axioms is checked. This is done in 3 steps:
- (1) First, the hierarchy is corrected. For each pair of classes
$A$ and$B$ where$A$ is a subclass of$B$ (following the is-a relation in ChEBI), we set the ensemble prediction of$B$ to 1 if the prediction of$A$ is 1. Intuitively speaking, if we have determined that a molecule belongs to a specific class (e.g., aromatic primary alcohol), it also belongs to the direct and indirect superclasses (e.g., primary alcohol, aromatic alcohol, alcohol). - (2) Next, we check for disjointness. This is not specified directly in ChEBI, but in an additional ChEBI module (chebi-disjoints.owl).
We have extracted these disjointness axioms into a CSV file and added some more disjointness axioms ourselves (see
data>disjoint_chebi.csvanddata>disjoint_additional.csv). If two classes$A$ and$B$ are disjoint and we predict both, we select one of them randomly and set the other to 0. - (3) Since the second step might have introduced new inconsistencies into the hierarchy, we repeat the first step, but
with a small change. For a pair of classes
$A \subseteq B$ with predictions$1$ and$0$ , instead of setting$B$ to$1$ , we now set$A$ to$0$ . This has the advantage that we cannot introduce new disjointness-inconsistencies and don't have to repeat step 2.