Skip to content

Data Management

Simon Flügel edited this page Apr 2, 2024 · 7 revisions

Loading ChEBI Ontology Data

ChEBai accesses the ChEBI ontology data from the following URL: http://purl.obolibrary.org/obo/chebi/{version}/chebi.obo.

You can find more information on the ChEBI ontology here: https://www.ebi.ac.uk/chebi

ChEBI versions

Change the chebi version used for all sets (default: 200):

--data.init_args.chebi_version=VERSION

To change only the version of the train and validation sets independently of the test set, use

--data.init_args.chebi_version_train=VERSION

Data Preprocessing

Upon loading the ontology data, ChEBai undergoes preprocessing, including hierarchy extraction and division into train, validation, and test sets. During preprocessing, a filter is applied to consider only chemical entities with a minimum number of subclasses (e.g., 50 or 100) annotated with SMILES (Simplified Molecular Input Line Entry System) strings.

Data folder structure

Data is organized within the following directory structure:

data/${dataset_name}/${chebi_version}/raw/
  • The raw dataset contains SMILES strings and class columns with boolean values, stored in .pkl format.
data/${dataset_name}/${chebi_version}/processed/${reader_name}/
  • ${dataset_name} represents the _name attribute of the DataModule used.
  • ${chebi_version} refers to the ChEBI version.
  • ${reader_name} denotes the name attribute of the associated Reader class.

In the processed directory, .pt is used instead of .pkl.

For cross-validation, the folds are stored as cv_${n_folds}_fold/fold_{fold_index}_train.pkl and cv_${n_folds}_fold/fold_{fold_index}_validation.pkl in the raw directory.

Clone this wiki locally