Data Management

Loading ChEBI Ontology Data

ChEBai accesses the ChEBI ontology data from the following URL: http://purl.obolibrary.org/obo/chebi/{version}/chebi.obo.

You can find more information on the ChEBI ontology here: https://www.ebi.ac.uk/chebi

ChEBI versions

Change the chebi version used for all sets (default: 200):

--data.init_args.chebi_version=VERSION

To change only the version of the train and validation sets independently of the test set, use

--data.init_args.chebi_version_train=VERSION

Data Preprocessing

Upon loading the ontology data, ChEBai undergoes preprocessing, including hierarchy extraction and division into train, validation, and test sets. During preprocessing, a filter is applied to consider only chemical entities with a minimum number of subclasses (e.g., 50 or 100) annotated with SMILES (Simplified Molecular Input Line Entry System) strings.

Data folder structure

Data is organized within the following directory structure:

data/${dataset_name}/${chebi_version}/raw/

The raw dataset contains SMILES strings and class columns with boolean values, stored in .pkl format.

data/${dataset_name}/${chebi_version}/processed/${reader_name}/

${dataset_name} represents the _name attribute of the DataModule used.
${chebi_version} refers to the ChEBI version.
${reader_name} denotes the name attribute of the associated Reader class.

In the processed directory, .pt is used instead of .pkl.

For cross-validation, the folds are stored as cv_${n_folds}_fold/fold_{fold_index}_train.pkl and cv_${n_folds}_fold/fold_{fold_index}_validation.pkl in the raw directory.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Management

Loading ChEBI Ontology Data

ChEBI versions

Data Preprocessing

Data folder structure

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally