You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In HSPiPy, the HSPEstimator class is implemented as a scikit-learn–compatible estimator.
Currently, its role is somewhat ambiguous:
Classifier perspective: predict(X) returns binary labels (1 if a solvent is inside the fitted sphere(s), 0 otherwise).
Accuracy is measured using classification metrics.
Regressor perspective: transform(X) returns continuous RED values (relative energy distance).
These are closer to regression outputs, and one could imagine fitting/predicting continuous solubility metrics instead of hard inside/outside classification.
Transformer perspective:
It also works inside pipelines as a transformer, converting solvent coordinates into RED distances.
Binarization of solvent scores
At present, solvent labels are binarized inside the estimator:
((y<=inside_limit) & (y!=0)).astype(int)
This works well when datasets already classify solvents as “good/bad” (inside/outside).
However, many published datasets provide graded solubility values (e.g. swelling %, absorbance, weight loss, Hansen distances, etc.) rather than a simple binary score.
Possible implications:
A classifier design enforces binarization, possibly losing information in graded datasets.
A regressor design could predict continuous solubility scores directly.
A hybrid/dual approach (e.g. HSPClassifier and HSPRegressor) might allow both.
Why this matters
Some applications (polymer screening, formulation) only need good/bad classification.
Others (quantitative solubility, miscibility ranges) would benefit from regression on the raw scores.
Choosing "classifier" vs "regressor" affects which metrics (accuracy_score vs r2_score) are available and how pipelines integrate HSPiPy.
Questions for the community
Do you mainly work with binary good/bad solvent datasets, or with continuous solubility ranges?
Would splitting into HSPClassifier and HSPRegressor make sense, or should we keep a single flexible class?
Are there published workflows where regression on raw scores gives better HSP fits than classification on binarized labels?
💬 Feedback is very welcome! Please share your preferences, experiences, or links to relevant publications on handling continuous vs binary solubility data.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Context
In HSPiPy, the
HSPEstimatorclass is implemented as a scikit-learn–compatible estimator.Currently, its role is somewhat ambiguous:
Classifier perspective:
predict(X)returns binary labels (1 if a solvent is inside the fitted sphere(s), 0 otherwise).Accuracy is measured using classification metrics.
Regressor perspective:
transform(X)returns continuous RED values (relative energy distance).These are closer to regression outputs, and one could imagine fitting/predicting continuous solubility metrics instead of hard inside/outside classification.
Transformer perspective:
It also works inside pipelines as a transformer, converting solvent coordinates into RED distances.
Binarization of solvent scores
At present, solvent labels are binarized inside the estimator:
This works well when datasets already classify solvents as “good/bad” (inside/outside).
However, many published datasets provide graded solubility values (e.g. swelling %, absorbance, weight loss, Hansen distances, etc.) rather than a simple binary score.
Possible implications:
HSPClassifierandHSPRegressor) might allow both.Why this matters
accuracy_scorevsr2_score) are available and how pipelines integrate HSPiPy.Questions for the community
HSPClassifierandHSPRegressormake sense, or should we keep a single flexible class?💬 Feedback is very welcome! Please share your preferences, experiences, or links to relevant publications on handling continuous vs binary solubility data.
Beta Was this translation helpful? Give feedback.
All reactions