Should HSPEstimator be a regressor, a classifier, or both? #3

Gnpd · 2025-08-29T08:36:27Z

Gnpd
Aug 29, 2025
Maintainer

Context

In HSPiPy, the HSPEstimator class is implemented as a scikit-learn–compatible estimator.
Currently, its role is somewhat ambiguous:

Classifier perspective:
predict(X) returns binary labels (1 if a solvent is inside the fitted sphere(s), 0 otherwise).
Accuracy is measured using classification metrics.
Regressor perspective:
transform(X) returns continuous RED values (relative energy distance).
These are closer to regression outputs, and one could imagine fitting/predicting continuous solubility metrics instead of hard inside/outside classification.
Transformer perspective:
It also works inside pipelines as a transformer, converting solvent coordinates into RED distances.

Binarization of solvent scores

At present, solvent labels are binarized inside the estimator:

((y <= inside_limit) & (y != 0)).astype(int)

This works well when datasets already classify solvents as “good/bad” (inside/outside).
However, many published datasets provide graded solubility values (e.g. swelling %, absorbance, weight loss, Hansen distances, etc.) rather than a simple binary score.

Possible implications:

A classifier design enforces binarization, possibly losing information in graded datasets.
A regressor design could predict continuous solubility scores directly.
A hybrid/dual approach (e.g. HSPClassifier and HSPRegressor) might allow both.

Why this matters

Some applications (polymer screening, formulation) only need good/bad classification.
Others (quantitative solubility, miscibility ranges) would benefit from regression on the raw scores.
Choosing "classifier" vs "regressor" affects which metrics (accuracy_score vs r2_score) are available and how pipelines integrate HSPiPy.

Questions for the community

Do you mainly work with binary good/bad solvent datasets, or with continuous solubility ranges?
Would splitting into HSPClassifier and HSPRegressor make sense, or should we keep a single flexible class?
Are there published workflows where regression on raw scores gives better HSP fits than classification on binarized labels?

💬 Feedback is very welcome! Please share your preferences, experiences, or links to relevant publications on handling continuous vs binary solubility data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Should HSPEstimator be a regressor, a classifier, or both? #3

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

Should HSPEstimator be a regressor, a classifier, or both? #3

Uh oh!

Gnpd Aug 29, 2025 Maintainer

Context

Binarization of solvent scores

Why this matters

Questions for the community

Replies: 0 comments

Gnpd
Aug 29, 2025
Maintainer