Skip to content

Commit f5f06dc

Browse files
authored
Merge pull request #6 from ChEB-AI/copilot/update-readme-documentation
docs: update README to reflect current API
2 parents 5810e56 + 00ce292 commit f5f06dc

File tree

1 file changed

+47
-18
lines changed

1 file changed

+47
-18
lines changed

README.md

Lines changed: 47 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# python-chebi-utils
22

3-
Common processing functionality for the ChEBI ontology — download data files, extract classes and relations, extract molecules, and generate stratified train/val/test splits.
3+
Common processing functionality for the ChEBI ontology — download versioned data files, build an ontology graph, extract molecules, assemble labeled datasets, and generate stratified train/val/test splits.
44

55
## Installation
66

@@ -21,22 +21,34 @@ pip install -e ".[dev]"
2121
```python
2222
from chebi_utils import download_chebi_obo, download_chebi_sdf
2323

24-
obo_path = download_chebi_obo(dest_dir="data/") # downloads chebi.obo
25-
sdf_path = download_chebi_sdf(dest_dir="data/") # downloads chebi.sdf.gz
24+
obo_path = download_chebi_obo(version=248, dest_dir="data/") # downloads chebi.obo
25+
sdf_path = download_chebi_sdf(version=248, dest_dir="data/") # downloads chebi.sdf.gz
2626
```
2727

28+
A specific ChEBI release `version` (e.g. `230`, `245`, `248`) must be provided.
2829
Files are fetched from the [EBI FTP server](https://ftp.ebi.ac.uk/pub/databases/chebi/).
30+
Versions below 245 are automatically fetched from the legacy archive path.
2931

30-
### Extract ontology classes and relations
32+
### Build the ChEBI ontology graph
3133

3234
```python
33-
from chebi_utils import extract_classes, extract_relations
35+
from chebi_utils import build_chebi_graph
3436

35-
classes = extract_classes("chebi.obo")
36-
# DataFrame: id, name, definition, is_obsolete
37+
graph = build_chebi_graph("chebi.obo")
38+
# networkx.DiGraph — nodes are string ChEBI IDs (e.g. "1" for CHEBI:1)
39+
# node attributes: name, smiles, subset
40+
# edge attribute: relation ("is_a", "has_part", …)
41+
```
42+
43+
Obsolete terms are excluded automatically. `xref:` lines are stripped before
44+
parsing to work around known fastobo compatibility issues in some ChEBI releases.
3745

38-
relations = extract_relations("chebi.obo")
39-
# DataFrame: source_id, target_id, relation_type (is_a, has_role, …)
46+
To obtain only the `is_a` hierarchy as a subgraph:
47+
48+
```python
49+
from chebi_utils.obo_extractor import get_hierarchy_subgraph
50+
51+
hierarchy = get_hierarchy_subgraph(graph)
4052
```
4153

4254
### Extract molecules
@@ -45,28 +57,45 @@ relations = extract_relations("chebi.obo")
4557
from chebi_utils import extract_molecules
4658

4759
molecules = extract_molecules("chebi.sdf.gz")
48-
# DataFrame: chebi_id, name, smiles, inchi, inchikey, formula, charge, mass, …
60+
# DataFrame columns: chebi_id, name, inchi, inchikey, smiles, charge, mass, mol, …
61+
# mol column contains RDKit Mol objects (None when parsing fails)
4962
```
5063

5164
Both plain `.sdf` and gzip-compressed `.sdf.gz` files are supported.
65+
Molecules that cannot be parsed are excluded from the returned DataFrame.
5266

53-
### Generate train/val/test splits
67+
### Build a labeled dataset
5468

5569
```python
56-
from chebi_utils import create_splits
70+
from chebi_utils import build_labeled_dataset
5771

58-
splits = create_splits(molecules, train_ratio=0.8, val_ratio=0.1, test_ratio=0.1)
59-
train_df = splits["train"]
60-
val_df = splits["val"]
61-
test_df = splits["test"]
72+
dataset, labels = build_labeled_dataset(graph, molecules, min_molecules=50)
73+
# dataset — DataFrame with columns: chebi_id, mol, <label1>, <label2>, …
74+
# one boolean column per selected ontology class
75+
# labels — sorted list of ChEBI IDs selected as label classes
6276
```
6377

64-
Pass `stratify_col` to preserve class proportions across splits:
78+
Each molecule is assigned to every label class that it belongs to directly or
79+
through a chain of `is_a` relationships. Only classes with at least
80+
`min_molecules` descendant molecules are kept as labels.
81+
82+
### Generate stratified train/val/test splits
6583

6684
```python
67-
splits = create_splits(classes, stratify_col="is_obsolete", seed=42)
85+
from chebi_utils import create_multilabel_splits
86+
87+
splits = create_multilabel_splits(dataset, train_ratio=0.8, val_ratio=0.1, test_ratio=0.1)
88+
train_df = splits["train"]
89+
val_df = splits["val"]
90+
test_df = splits["test"]
6891
```
6992

93+
Columns 0 and 1 (`chebi_id`, `mol`) are treated as metadata; all remaining
94+
columns are treated as binary label columns. When multiple label columns are
95+
present, `MultilabelStratifiedShuffleSplit` from the
96+
`iterative-stratification` package is used; for a single label column,
97+
`StratifiedShuffleSplit` from scikit-learn is used.
98+
7099
## Running Tests
71100

72101
```bash

0 commit comments

Comments
 (0)