11# python-chebi-utils
22
3- Common processing functionality for the ChEBI ontology — download data files, extract classes and relations , extract molecules, and generate stratified train/val/test splits.
3+ Common processing functionality for the ChEBI ontology — download versioned data files, build an ontology graph , extract molecules, assemble labeled datasets , and generate stratified train/val/test splits.
44
55## Installation
66
@@ -21,22 +21,34 @@ pip install -e ".[dev]"
2121``` python
2222from chebi_utils import download_chebi_obo, download_chebi_sdf
2323
24- obo_path = download_chebi_obo(dest_dir = " data/" ) # downloads chebi.obo
25- sdf_path = download_chebi_sdf(dest_dir = " data/" ) # downloads chebi.sdf.gz
24+ obo_path = download_chebi_obo(version = 248 , dest_dir = " data/" ) # downloads chebi.obo
25+ sdf_path = download_chebi_sdf(version = 248 , dest_dir = " data/" ) # downloads chebi.sdf.gz
2626```
2727
28+ A specific ChEBI release ` version ` (e.g. ` 230 ` , ` 245 ` , ` 248 ` ) must be provided.
2829Files are fetched from the [ EBI FTP server] ( https://ftp.ebi.ac.uk/pub/databases/chebi/ ) .
30+ Versions below 245 are automatically fetched from the legacy archive path.
2931
30- ### Extract ontology classes and relations
32+ ### Build the ChEBI ontology graph
3133
3234``` python
33- from chebi_utils import extract_classes, extract_relations
35+ from chebi_utils import build_chebi_graph
3436
35- classes = extract_classes(" chebi.obo" )
36- # DataFrame: id, name, definition, is_obsolete
37+ graph = build_chebi_graph(" chebi.obo" )
38+ # networkx.DiGraph — nodes are string ChEBI IDs (e.g. "1" for CHEBI:1)
39+ # node attributes: name, smiles, subset
40+ # edge attribute: relation ("is_a", "has_part", …)
41+ ```
42+
43+ Obsolete terms are excluded automatically. ` xref: ` lines are stripped before
44+ parsing to work around known fastobo compatibility issues in some ChEBI releases.
3745
38- relations = extract_relations(" chebi.obo" )
39- # DataFrame: source_id, target_id, relation_type (is_a, has_role, …)
46+ To obtain only the ` is_a ` hierarchy as a subgraph:
47+
48+ ``` python
49+ from chebi_utils.obo_extractor import get_hierarchy_subgraph
50+
51+ hierarchy = get_hierarchy_subgraph(graph)
4052```
4153
4254### Extract molecules
@@ -45,28 +57,45 @@ relations = extract_relations("chebi.obo")
4557from chebi_utils import extract_molecules
4658
4759molecules = extract_molecules(" chebi.sdf.gz" )
48- # DataFrame: chebi_id, name, smiles, inchi, inchikey, formula, charge, mass, …
60+ # DataFrame columns: chebi_id, name, inchi, inchikey, smiles, charge, mass, mol, …
61+ # mol column contains RDKit Mol objects (None when parsing fails)
4962```
5063
5164Both plain ` .sdf ` and gzip-compressed ` .sdf.gz ` files are supported.
65+ Molecules that cannot be parsed are excluded from the returned DataFrame.
5266
53- ### Generate train/val/test splits
67+ ### Build a labeled dataset
5468
5569``` python
56- from chebi_utils import create_splits
70+ from chebi_utils import build_labeled_dataset
5771
58- splits = create_splits(molecules, train_ratio = 0.8 , val_ratio = 0.1 , test_ratio = 0.1 )
59- train_df = splits[ " train " ]
60- val_df = splits[ " val " ]
61- test_df = splits[ " test " ]
72+ dataset, labels = build_labeled_dataset(graph, molecules, min_molecules = 50 )
73+ # dataset — DataFrame with columns: chebi_id, mol, <label1>, <label2>, …
74+ # one boolean column per selected ontology class
75+ # labels — sorted list of ChEBI IDs selected as label classes
6276```
6377
64- Pass ` stratify_col ` to preserve class proportions across splits:
78+ Each molecule is assigned to every label class that it belongs to directly or
79+ through a chain of ` is_a ` relationships. Only classes with at least
80+ ` min_molecules ` descendant molecules are kept as labels.
81+
82+ ### Generate stratified train/val/test splits
6583
6684``` python
67- splits = create_splits(classes, stratify_col = " is_obsolete" , seed = 42 )
85+ from chebi_utils import create_multilabel_splits
86+
87+ splits = create_multilabel_splits(dataset, train_ratio = 0.8 , val_ratio = 0.1 , test_ratio = 0.1 )
88+ train_df = splits[" train" ]
89+ val_df = splits[" val" ]
90+ test_df = splits[" test" ]
6891```
6992
93+ Columns 0 and 1 (` chebi_id ` , ` mol ` ) are treated as metadata; all remaining
94+ columns are treated as binary label columns. When multiple label columns are
95+ present, ` MultilabelStratifiedShuffleSplit ` from the
96+ ` iterative-stratification ` package is used; for a single label column,
97+ ` StratifiedShuffleSplit ` from scikit-learn is used.
98+
7099## Running Tests
71100
72101``` bash
0 commit comments