ENH Batch GeneralizingEstimator's estimator and scoring#13909
Draft
mathias-sm wants to merge 9 commits into
Draft
ENH Batch GeneralizingEstimator's estimator and scoring#13909mathias-sm wants to merge 9 commits into
GeneralizingEstimator's estimator and scoring#13909mathias-sm wants to merge 9 commits into
Conversation
`_gl_score` invoked the scorer's response method (`predict` / `predict_proba` / `decision_function`) `n_estimators * n_slices` times per fold. Stack X across slices and call the response method once per estimator, then apply the metric per slice on the resulting predictions. The batching saves on overhead and best utilises vectorized operations. Scorers without a recognised `_response_method` (e.g. `scoring=None` or custom callables) fall back to the original nested loop.
When the scorer is `accuracy_score` with default kwargs, 1d-output (but can be multi-class), and `response_method == "predict"`, replace the per-slice `accuracy_score(y, y_pred[:, jj])` calls with one numpy reduction per estimator: `(y_pred == y[:, None]).mean(axis=0)`. Other scorers, multi-output `y`, etc. keep nested-loop behavior.
Replace the `fast_accuracy` flag with a `batched_score` which is either a callable if we recognized the scorer, or otherwise set to `None`. The scoring loop then branches on `batched_score`: call `batched_score(y_pred)` if `batched_score` is set, otherwise fall back to the nested loop.
Adds `balanced_accuracy_score` to the `batched_score` dispatch by manually estimating accuracy per class and then averaging.
When `scoring=None`, sklearn wraps `estimator.score` in a scorer with no `_score_func` so previous code did not batch. But for stock classifiers, this is just `accuracy_score(y, predict(X))`: we now detect this and promote `scoring` to the named "accuracy" scorer which uses the existing batched path.
Add `roc_auc_score` to the `batched_score` dispatch via the Mann-Whitney U identity with average-rank tie correction (`scipy.stats.rankdata(..., method="average", axis=0)` ranks all slices at once). Binary classification only: multi-class, or non-default kwargs, revert to nested loops. The rank identity implemented when batching gives the same AUC as sklearn within floating point precision, but implements it with different operations. A bit-exact match would need a loop, defeating the batching.
for more information, see https://pre-commit.ci
Author
|
My plan for nice, self-contained commits thrown off by |
Member
|
We squash+merge in the end so don't worry about it! No need to run with |
GeneralizingEstimator's estimator and scoring [circle full]GeneralizingEstimator's estimator and scoring
Author
|
I tried to remove |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Reference issue (if any)
First proposal to implement batching in
GeneralizingEstimatoras mentioned in #13906What does this implement/fix?
When possible, this turn the nested loop over time of
GeneralizingEstimatorinto one loop to fit, and then a single batched transform+score by reshaping. When not possible, this conservatively falls back to the original implementation.In implementing this, I realized that the biggest gains are from the
scoringpart rather than thetransformpart, and unfortunately scoring cannot be batched straightforwardly "from the outside". Therefore, commits 2-6 (see below for the overall logic behind the commits) succesively refactor the scoring loop and add batch scoring functions for some candidates: accuracy, balanced_accuracy and roc_auc. Anything that is not clearly identified ends up being scored by the original nested loops implementation.It's a bit more verbose than I was hopping for, sorry about this. This also forced me to learn more about
__qualname__and things likegetattr(scoring, "_score_func", None)than I was hopping for 😅Happy to take comments and iterate, I hope this is a good fit!
Additional information
I broke the problem into changes that I considered meaningful to review in isolation, with the idea that maybe only a subset would be considered useful to merge. The logic is as follows.
b2cbd649fBatch estimator and nothing else (the longest / most annoying to review)3cef4a52aIf score is accuracy, also batch that, if not rever to nested loops7a66f157cRefactor the scoring logic to be able to add batched scorer when desired32028955bAdds batched balanced_accuracy7a30425fcIdentifies whenscoring=Nonecan be interpreted as "accuracy" and batchedb23e72dfbAdds batched roc_auc (the only one that adds an import and touches tests)I also used a little script that compares the result and timing
_gl_scorefor each commit against the starting point: each step until 5 (included) give the same result exactly; 6 implements roc_auc differently from sklearn and isallclosebut not identical. My script is here and its output is here.Highlights: for
X.shapeof(100, 272, 200)(trials, sensors, timepoints), no cross-validation, on my laptop: accuracy goes from 7893.3ms to 182.1ms, balanced accuracy goes from 14055.4ms to 186.4ms, roc_auc from 18103.3ms to 282.7ms. To test unimplemented scoring functions I also measuredf1: it goes from 21571.0ms to 17009.8ms, the gain being from the batching of the estimator but not the scoring.Because of the roc_auc change, I made two changes to existing tests: in
mne/decoding/tests/test_search_light.pyI replaced twoassert_array_equalwithassert_allclose. I also timed all tests that useGeneralizingEstimatorand compared the duration against the starting point ; there are virtually no gains there (I presume because by design the tests are short and there is a lot of overhead), still e.g.test_verbose_arg[1-False]improved like 20%...Note on AI use
I broke the problem into the steps described above; then each step's first draft was obtained by prompting Opus 4.7 through Claude Code. I refined the proposal, added documentation or changed the code where things were not immediately obvious to me, etc. until I was satisfied. I similarly prompted it to generate the first draft of the script to do check each steps correctness and measure speed difference (not commited, reviewed and and updated to my needs as the project was going along).