Summary
Measure how much each model contributes to ensemble performance, following Kim et al. (2024).
Three methods:
- LOMO: remove each model, measure score difference
- Shapley: marginal contributions across all subsets
- PAR: replace each model with a baseline, measure score difference
API
# Create a method object (per-method constructor)
method <- importance_lomo()
method <- importance_shapley()
method <- importance_par(baseline = "naive")
# Modular: three steps, same method object passed throughout
ensembles <- ensemble_permutations(forecast, method = method, ensemble_fn = mean_quantile)
ensemble_scores <- score(ensembles)
importance <- ensemble_importance(ensemble_scores, method = method, metric = get_metrics(forecast, select = "wis"))
# Convenience wrapper
importance <- ensemble_importance_score(forecast, method = method, metric = get_metrics(forecast, select = "wis"))
Per-method constructors (like geom_point() / geom_line() in ggplot2) return a method object containing both the permutation and importance functions.
The same object is passed to ensemble_permutations() and ensemble_importance(), so the method is defined once.
ensemble_permutations() also accepts a bare permutation function for use cases beyond importance.
How it works
ensemble_permutations() creates ensemble variants as a new forecast object with informative model names (e.g. "ensemble_-_A" = A removed, "ensemble_A->R" = A replaced by R)
score() scores the ensembles — any metric, any transform
ensemble_importance() parses model names to reconstruct the mapping and compute importance diffs
Note on memory
The output of ensemble_permutations() holds all ensemble variants in memory.
For Shapley with N models this is 2^N - 1 variants, each containing the full set of forecast rows.
This can be RAM-intensive for large forecast datasets.
The convenience wrapper could mitigate this by scoring and discarding variants incrementally, but the modular API necessarily materialises everything.
Motivation
Beyond ranking models by their individual scores, we want to measure how much each model contributes to ensemble performance.
This is inspired by baseball's Wins Above Replacement (WAR) and discussed in this community post.
Kim et al. (2024) formalise this as model importance metrics for forecast ensembles.
The key finding is that importance correlates with accuracy but reveals additional value: models that are not individually accurate but offer a unique perspective can still play an important role in an ensemble.
A model is rewarded if it is uncorrelated with others, and penalised if the other models are already correlated with each other.
Design detail
Method objects
Per-method constructors return a list containing both functions needed for the pipeline:
importance_lomo <- function() {
list(
permutations = permutations_lomo,
importance = compute_importance_lomo
)
}
importance_par <- function(baseline = "naive") {
list(
permutations = function(data, ensemble_fn) {
permutations_par(data, ensemble_fn, baseline = baseline)
},
importance = compute_importance_par
)
}
Each constructor can accept method-specific arguments (e.g. baseline for PAR).
The method object is just a list — no S3 class overhead needed.
ensemble_permutations() also accepts a bare permutation function instead of a method object.
This supports using the permutation step for purposes beyond importance (e.g. ensemble diagnostics, subset analysis) without needing to define a full method object.
Step 1: ensemble_permutations()
Takes a forecast object and returns a new forecast object containing only the ensemble variants (not the original models).
Beyond importance, ensemble_permutations() is a generally useful building block — generating systematic ensemble variants from model subsets is valuable for ensemble diagnostics, subset sensitivity analysis, and studying how ensemble composition affects performance.
ensembles <- ensemble_permutations(
forecast,
method = importance_lomo(),
ensemble_fn = mean_quantile
)
forecast: a forecast object (e.g. from as_forecast_quantile()).
method: a method object (from a constructor like importance_lomo()) or a bare permutation function (for use cases beyond importance).
ensemble_fn(forecast_subset) -> forecast: constructs an ensemble from a subset of models. Default is equally weighted mean at each quantile level. User-supplied for flexibility.
Ensemble variants are named with informative names encoding the construction:
- Removal:
"ensemble_-_A" means model A was dropped (LOMO, Shapley)
- Replacement:
"ensemble_A->R" means model A was replaced by R (PAR)
- Multiple removals:
"ensemble_-_A_-_B" (Shapley)
- Full ensemble:
"ensemble"
The function validates that generated names do not collide with existing model names.
Step 2: score() and optionally summarise_scores()
Standard scoringutils pipeline.
Any metric, any transform_forecasts(), any scoring perspective works.
This connects naturally to the aggregation-and-transformation framework (#1120).
No extra metadata columns are added, so there is nothing special to preserve through the pipeline.
Step 3: ensemble_importance()
importance <- ensemble_importance(ensemble_scores, method = method, metric = get_metrics(forecast, select = "wis"))
method: the same method object passed to step 1.
metric: a named list of metric functions, as returned by get_metrics() (e.g. get_metrics(forecast, select = "wis")).
The importance function uses model names to identify variants and reconstruct the mapping.
Can sit downstream of either score() or summarise_scores().
Convenience wrapper
importance <- ensemble_importance_score(
forecast,
method = importance_lomo(),
ensemble_fn = mean_quantile,
metric = get_metrics(forecast, select = "wis")
)
Takes a forecast object. Wraps all three steps.
Could score and discard variants incrementally to reduce memory usage.
Model naming convention
The full ensemble is "ensemble".
Two syntaxes cover all methods:
- Removal:
"ensemble_-_X" — model X was dropped (LOMO, Shapley)
- Replacement:
"ensemble_X->R" — model X was replaced by R (PAR)
LOMO (3 models A, B, C)
Produces N+1 = 4 ensemble variants.
model | ...
"A" | (original)
"B" |
"C" |
"ensemble" | (full ensemble)
"ensemble_-_A" | (A removed)
"ensemble_-_B" | (B removed)
"ensemble_-_C" | (C removed)
Importance(A) = score(ensemble_-_A) - score(ensemble).
Shapley (3 models A, B, C)
Produces 2^N - 1 = 7 non-empty subset variants.
Only practical for ~10 models or fewer.
model | ...
"A" | (original)
"B" |
"C" |
"ensemble" | (full: {A,B,C})
"ensemble_-_A" | ({B,C})
"ensemble_-_B" | ({A,C})
"ensemble_-_C" | ({A,B})
"ensemble_-_A_-_B" | ({C})
"ensemble_-_A_-_C" | ({B})
"ensemble_-_B_-_C" | ({A})
For each model i, compute Shapley-weighted marginal contributions across all pairs (S, S∪{i}) with weights |S|!(n-|S|-1)!/n!.
PAR (3 models A, B, C, replacement R)
Produces N+1 = 4 variants.
Uses replacement syntax (->) rather than removal syntax (-).
model | ...
"A" | (original)
"B" |
"C" |
"ensemble" | (full ensemble)
"ensemble_A->R" | (A replaced by R)
"ensemble_B->R" | (B replaced by R)
"ensemble_C->R" | (C replaced by R)
Importance(A) = score(ensemble_A->R) - score(ensemble).
Design notes
Why method objects
We explored several approaches to avoid passing method information separately to the upstream and downstream functions:
- Infer from model names: LOMO uses
_-_ (removal) and PAR uses -> (replacement), so they are distinguishable. But inferring method from name parsing feels brittle.
- Column in the data (e.g.
.importance_method): the downstream function still needs to know which importance calculation to use, so it is effectively a second method pass — just implicit.
- Attribute on the data: fragile in R, silently dropped by
rbind(), merge(), and other common operations.
- Pass method twice as separate functions: works but verbose and easy to mismatch.
A method object from a per-method constructor bundles permutation and importance functions together.
The user defines the method once and passes the same object throughout.
Method-specific configuration (e.g. PAR baseline) lives in the constructor.
Why two naming syntaxes
Removal (_-_) and replacement (->) are the two fundamental operations across all methods.
LOMO and Shapley remove models; PAR replaces them.
Using distinct syntax for each operation makes names unambiguous and human-readable.
Connection to multivariate scoring
For multivariate importance, the ensemble function needs to produce joint samples (not just marginal quantiles) so that energy/variogram scores can be computed in step 2.
The importance computation in step 3 stays the same.
Scope
Start with:
importance_lomo() method constructor
ensemble_permutations() and ensemble_importance()
ensemble_importance_score() convenience wrapper
importance_par() method constructor
Later extensions:
importance_shapley() method constructor
- Multivariate ensemble creation (joint samples)
- Memory-efficient incremental scoring in the convenience wrapper
Since ensemble variants are just models with distinct names, get_pairwise_comparisons() works on the scored output without any special handling.
Packaging options
This functionality could live in different places:
- Part of scoringutils: natural home given the tight integration with
score() and forecast objects. Keeps the dependency graph simple.
- Standalone package depending on scoringutils: same modular API (
ensemble_permutations() → score() → ensemble_importance()) plus the method constructors and convenience wrapper. The package would depend on scoringutils for score() and forecast objects but have its own release cycle.
- Collaborate with Kim et al.'s modelimportance package: an existing R package by our collaborators that implements LOMO and LASOMO (their name for all-subsets with equal or permutation-based/Shapley weights). It supports multiple metrics (WIS, log score, AE, RSE) selected per output type, and provides useful missing-forecast handling in
model_importance_summary(). It is designed for the hubverse ecosystem (hubUtils, hubEnsembles, hubEvals; input is model_out_tbl), with ensemble building, scoring, and importance computed together in model_importance(). The main difference from our proposal is architectural: modelimportance wraps the full pipeline internally, whereas we propose separating the steps so that score() sits in the middle and any metric, transform, or scoring perspective can be applied. There may be an opportunity to collaborate — either by having modelimportance use our modular functions under the hood, or by providing a shared core that both packages build on.
No decision needed now — worth considering once the API design is settled.
References
This was opened by a bot. Please ping @seabbs for any questions.
Summary
Measure how much each model contributes to ensemble performance, following Kim et al. (2024).
Three methods:
API
Per-method constructors (like
geom_point()/geom_line()in ggplot2) return a method object containing both the permutation and importance functions.The same object is passed to
ensemble_permutations()andensemble_importance(), so the method is defined once.ensemble_permutations()also accepts a bare permutation function for use cases beyond importance.How it works
ensemble_permutations()creates ensemble variants as a new forecast object with informative model names (e.g."ensemble_-_A"= A removed,"ensemble_A->R"= A replaced by R)score()scores the ensembles — any metric, any transformensemble_importance()parses model names to reconstruct the mapping and compute importance diffsNote on memory
The output of
ensemble_permutations()holds all ensemble variants in memory.For Shapley with N models this is 2^N - 1 variants, each containing the full set of forecast rows.
This can be RAM-intensive for large forecast datasets.
The convenience wrapper could mitigate this by scoring and discarding variants incrementally, but the modular API necessarily materialises everything.
Motivation
Beyond ranking models by their individual scores, we want to measure how much each model contributes to ensemble performance.
This is inspired by baseball's Wins Above Replacement (WAR) and discussed in this community post.
Kim et al. (2024) formalise this as model importance metrics for forecast ensembles.
The key finding is that importance correlates with accuracy but reveals additional value: models that are not individually accurate but offer a unique perspective can still play an important role in an ensemble.
A model is rewarded if it is uncorrelated with others, and penalised if the other models are already correlated with each other.
Design detail
Method objects
Per-method constructors return a list containing both functions needed for the pipeline:
Each constructor can accept method-specific arguments (e.g.
baselinefor PAR).The method object is just a list — no S3 class overhead needed.
ensemble_permutations()also accepts a bare permutation function instead of a method object.This supports using the permutation step for purposes beyond importance (e.g. ensemble diagnostics, subset analysis) without needing to define a full method object.
Step 1:
ensemble_permutations()Takes a
forecastobject and returns a new forecast object containing only the ensemble variants (not the original models).Beyond importance,
ensemble_permutations()is a generally useful building block — generating systematic ensemble variants from model subsets is valuable for ensemble diagnostics, subset sensitivity analysis, and studying how ensemble composition affects performance.forecast: aforecastobject (e.g. fromas_forecast_quantile()).method: a method object (from a constructor likeimportance_lomo()) or a bare permutation function (for use cases beyond importance).ensemble_fn(forecast_subset) -> forecast: constructs an ensemble from a subset of models. Default is equally weighted mean at each quantile level. User-supplied for flexibility.Ensemble variants are named with informative names encoding the construction:
"ensemble_-_A"means model A was dropped (LOMO, Shapley)"ensemble_A->R"means model A was replaced by R (PAR)"ensemble_-_A_-_B"(Shapley)"ensemble"The function validates that generated names do not collide with existing model names.
Step 2:
score()and optionallysummarise_scores()Standard scoringutils pipeline.
Any metric, any
transform_forecasts(), any scoring perspective works.This connects naturally to the aggregation-and-transformation framework (#1120).
No extra metadata columns are added, so there is nothing special to preserve through the pipeline.
Step 3:
ensemble_importance()method: the same method object passed to step 1.metric: a named list of metric functions, as returned byget_metrics()(e.g.get_metrics(forecast, select = "wis")).The importance function uses model names to identify variants and reconstruct the mapping.
Can sit downstream of either
score()orsummarise_scores().Convenience wrapper
Takes a
forecastobject. Wraps all three steps.Could score and discard variants incrementally to reduce memory usage.
Model naming convention
The full ensemble is
"ensemble".Two syntaxes cover all methods:
"ensemble_-_X"— model X was dropped (LOMO, Shapley)"ensemble_X->R"— model X was replaced by R (PAR)LOMO (3 models A, B, C)
Produces N+1 = 4 ensemble variants.
Importance(A) = score(ensemble_-_A) - score(ensemble).
Shapley (3 models A, B, C)
Produces 2^N - 1 = 7 non-empty subset variants.
Only practical for ~10 models or fewer.
For each model i, compute Shapley-weighted marginal contributions across all pairs (S, S∪{i}) with weights
|S|!(n-|S|-1)!/n!.PAR (3 models A, B, C, replacement R)
Produces N+1 = 4 variants.
Uses replacement syntax (
->) rather than removal syntax (-).Importance(A) = score(ensemble_A->R) - score(ensemble).
Design notes
Why method objects
We explored several approaches to avoid passing method information separately to the upstream and downstream functions:
_-_(removal) and PAR uses->(replacement), so they are distinguishable. But inferring method from name parsing feels brittle..importance_method): the downstream function still needs to know which importance calculation to use, so it is effectively a second method pass — just implicit.rbind(),merge(), and other common operations.A method object from a per-method constructor bundles permutation and importance functions together.
The user defines the method once and passes the same object throughout.
Method-specific configuration (e.g. PAR baseline) lives in the constructor.
Why two naming syntaxes
Removal (
_-_) and replacement (->) are the two fundamental operations across all methods.LOMO and Shapley remove models; PAR replaces them.
Using distinct syntax for each operation makes names unambiguous and human-readable.
Connection to multivariate scoring
For multivariate importance, the ensemble function needs to produce joint samples (not just marginal quantiles) so that energy/variogram scores can be computed in step 2.
The importance computation in step 3 stays the same.
Scope
Start with:
importance_lomo()method constructorensemble_permutations()andensemble_importance()ensemble_importance_score()convenience wrapperimportance_par()method constructorLater extensions:
importance_shapley()method constructorSince ensemble variants are just models with distinct names,
get_pairwise_comparisons()works on the scored output without any special handling.Packaging options
This functionality could live in different places:
score()and forecast objects. Keeps the dependency graph simple.ensemble_permutations()→score()→ensemble_importance()) plus the method constructors and convenience wrapper. The package would depend on scoringutils forscore()and forecast objects but have its own release cycle.model_importance_summary(). It is designed for the hubverse ecosystem (hubUtils,hubEnsembles,hubEvals; input ismodel_out_tbl), with ensemble building, scoring, and importance computed together inmodel_importance(). The main difference from our proposal is architectural:modelimportancewraps the full pipeline internally, whereas we propose separating the steps so thatscore()sits in the middle and any metric, transform, or scoring perspective can be applied. There may be an opportunity to collaborate — either by havingmodelimportanceuse our modular functions under the hood, or by providing a shared core that both packages build on.No decision needed now — worth considering once the API design is settled.
References
This was opened by a bot. Please ping @seabbs for any questions.