Skip to content

Model importance metrics: LOMO, Shapley values, and PAR for ensemble contribution #1121

@seabbs-bot

Description

@seabbs-bot

Summary

Measure how much each model contributes to ensemble performance, following Kim et al. (2024).

Three methods:

  • LOMO: remove each model, measure score difference
  • Shapley: marginal contributions across all subsets
  • PAR: replace each model with a baseline, measure score difference

API

# Create a method object (per-method constructor)
method <- importance_lomo()
method <- importance_shapley()
method <- importance_par(baseline = "naive")

# Modular: three steps, same method object passed throughout
ensembles <- ensemble_permutations(forecast, method = method, ensemble_fn = mean_quantile)
ensemble_scores <- score(ensembles)
importance <- ensemble_importance(ensemble_scores, method = method, metric = get_metrics(forecast, select = "wis"))

# Convenience wrapper
importance <- ensemble_importance_score(forecast, method = method, metric = get_metrics(forecast, select = "wis"))

Per-method constructors (like geom_point() / geom_line() in ggplot2) return a method object containing both the permutation and importance functions.
The same object is passed to ensemble_permutations() and ensemble_importance(), so the method is defined once.
ensemble_permutations() also accepts a bare permutation function for use cases beyond importance.

How it works

  1. ensemble_permutations() creates ensemble variants as a new forecast object with informative model names (e.g. "ensemble_-_A" = A removed, "ensemble_A->R" = A replaced by R)
  2. score() scores the ensembles — any metric, any transform
  3. ensemble_importance() parses model names to reconstruct the mapping and compute importance diffs

Note on memory

The output of ensemble_permutations() holds all ensemble variants in memory.
For Shapley with N models this is 2^N - 1 variants, each containing the full set of forecast rows.
This can be RAM-intensive for large forecast datasets.
The convenience wrapper could mitigate this by scoring and discarding variants incrementally, but the modular API necessarily materialises everything.


Motivation

Beyond ranking models by their individual scores, we want to measure how much each model contributes to ensemble performance.
This is inspired by baseball's Wins Above Replacement (WAR) and discussed in this community post.

Kim et al. (2024) formalise this as model importance metrics for forecast ensembles.
The key finding is that importance correlates with accuracy but reveals additional value: models that are not individually accurate but offer a unique perspective can still play an important role in an ensemble.
A model is rewarded if it is uncorrelated with others, and penalised if the other models are already correlated with each other.

Design detail

Method objects

Per-method constructors return a list containing both functions needed for the pipeline:

importance_lomo <- function() {
  list(
    permutations = permutations_lomo,
    importance = compute_importance_lomo
  )
}

importance_par <- function(baseline = "naive") {
  list(
    permutations = function(data, ensemble_fn) {
      permutations_par(data, ensemble_fn, baseline = baseline)
    },
    importance = compute_importance_par
  )
}

Each constructor can accept method-specific arguments (e.g. baseline for PAR).
The method object is just a list — no S3 class overhead needed.

ensemble_permutations() also accepts a bare permutation function instead of a method object.
This supports using the permutation step for purposes beyond importance (e.g. ensemble diagnostics, subset analysis) without needing to define a full method object.

Step 1: ensemble_permutations()

Takes a forecast object and returns a new forecast object containing only the ensemble variants (not the original models).
Beyond importance, ensemble_permutations() is a generally useful building block — generating systematic ensemble variants from model subsets is valuable for ensemble diagnostics, subset sensitivity analysis, and studying how ensemble composition affects performance.

ensembles <- ensemble_permutations(
  forecast,
  method = importance_lomo(),
  ensemble_fn = mean_quantile
)
  • forecast: a forecast object (e.g. from as_forecast_quantile()).
  • method: a method object (from a constructor like importance_lomo()) or a bare permutation function (for use cases beyond importance).
  • ensemble_fn(forecast_subset) -> forecast: constructs an ensemble from a subset of models. Default is equally weighted mean at each quantile level. User-supplied for flexibility.

Ensemble variants are named with informative names encoding the construction:

  • Removal: "ensemble_-_A" means model A was dropped (LOMO, Shapley)
  • Replacement: "ensemble_A->R" means model A was replaced by R (PAR)
  • Multiple removals: "ensemble_-_A_-_B" (Shapley)
  • Full ensemble: "ensemble"

The function validates that generated names do not collide with existing model names.

Step 2: score() and optionally summarise_scores()

Standard scoringutils pipeline.
Any metric, any transform_forecasts(), any scoring perspective works.
This connects naturally to the aggregation-and-transformation framework (#1120).

No extra metadata columns are added, so there is nothing special to preserve through the pipeline.

Step 3: ensemble_importance()

importance <- ensemble_importance(ensemble_scores, method = method, metric = get_metrics(forecast, select = "wis"))
  • method: the same method object passed to step 1.
  • metric: a named list of metric functions, as returned by get_metrics() (e.g. get_metrics(forecast, select = "wis")).

The importance function uses model names to identify variants and reconstruct the mapping.
Can sit downstream of either score() or summarise_scores().

Convenience wrapper

importance <- ensemble_importance_score(
  forecast,
  method = importance_lomo(),
  ensemble_fn = mean_quantile,
  metric = get_metrics(forecast, select = "wis")
)

Takes a forecast object. Wraps all three steps.
Could score and discard variants incrementally to reduce memory usage.

Model naming convention

The full ensemble is "ensemble".
Two syntaxes cover all methods:

  • Removal: "ensemble_-_X" — model X was dropped (LOMO, Shapley)
  • Replacement: "ensemble_X->R" — model X was replaced by R (PAR)

LOMO (3 models A, B, C)

Produces N+1 = 4 ensemble variants.

model          | ...
"A"            | (original)
"B"            |
"C"            |
"ensemble"     | (full ensemble)
"ensemble_-_A" | (A removed)
"ensemble_-_B" | (B removed)
"ensemble_-_C" | (C removed)

Importance(A) = score(ensemble_-_A) - score(ensemble).

Shapley (3 models A, B, C)

Produces 2^N - 1 = 7 non-empty subset variants.
Only practical for ~10 models or fewer.

model               | ...
"A"                 | (original)
"B"                 |
"C"                 |
"ensemble"          | (full: {A,B,C})
"ensemble_-_A"      | ({B,C})
"ensemble_-_B"      | ({A,C})
"ensemble_-_C"      | ({A,B})
"ensemble_-_A_-_B"  | ({C})
"ensemble_-_A_-_C"  | ({B})
"ensemble_-_B_-_C"  | ({A})

For each model i, compute Shapley-weighted marginal contributions across all pairs (S, S∪{i}) with weights |S|!(n-|S|-1)!/n!.

PAR (3 models A, B, C, replacement R)

Produces N+1 = 4 variants.
Uses replacement syntax (->) rather than removal syntax (-).

model            | ...
"A"              | (original)
"B"              |
"C"              |
"ensemble"       | (full ensemble)
"ensemble_A->R"  | (A replaced by R)
"ensemble_B->R"  | (B replaced by R)
"ensemble_C->R"  | (C replaced by R)

Importance(A) = score(ensemble_A->R) - score(ensemble).

Design notes

Why method objects

We explored several approaches to avoid passing method information separately to the upstream and downstream functions:

  • Infer from model names: LOMO uses _-_ (removal) and PAR uses -> (replacement), so they are distinguishable. But inferring method from name parsing feels brittle.
  • Column in the data (e.g. .importance_method): the downstream function still needs to know which importance calculation to use, so it is effectively a second method pass — just implicit.
  • Attribute on the data: fragile in R, silently dropped by rbind(), merge(), and other common operations.
  • Pass method twice as separate functions: works but verbose and easy to mismatch.

A method object from a per-method constructor bundles permutation and importance functions together.
The user defines the method once and passes the same object throughout.
Method-specific configuration (e.g. PAR baseline) lives in the constructor.

Why two naming syntaxes

Removal (_-_) and replacement (->) are the two fundamental operations across all methods.
LOMO and Shapley remove models; PAR replaces them.
Using distinct syntax for each operation makes names unambiguous and human-readable.

Connection to multivariate scoring

For multivariate importance, the ensemble function needs to produce joint samples (not just marginal quantiles) so that energy/variogram scores can be computed in step 2.
The importance computation in step 3 stays the same.

Scope

Start with:

  1. importance_lomo() method constructor
  2. ensemble_permutations() and ensemble_importance()
  3. ensemble_importance_score() convenience wrapper
  4. importance_par() method constructor

Later extensions:

  • importance_shapley() method constructor
  • Multivariate ensemble creation (joint samples)
  • Memory-efficient incremental scoring in the convenience wrapper

Since ensemble variants are just models with distinct names, get_pairwise_comparisons() works on the scored output without any special handling.

Packaging options

This functionality could live in different places:

  • Part of scoringutils: natural home given the tight integration with score() and forecast objects. Keeps the dependency graph simple.
  • Standalone package depending on scoringutils: same modular API (ensemble_permutations()score()ensemble_importance()) plus the method constructors and convenience wrapper. The package would depend on scoringutils for score() and forecast objects but have its own release cycle.
  • Collaborate with Kim et al.'s modelimportance package: an existing R package by our collaborators that implements LOMO and LASOMO (their name for all-subsets with equal or permutation-based/Shapley weights). It supports multiple metrics (WIS, log score, AE, RSE) selected per output type, and provides useful missing-forecast handling in model_importance_summary(). It is designed for the hubverse ecosystem (hubUtils, hubEnsembles, hubEvals; input is model_out_tbl), with ensemble building, scoring, and importance computed together in model_importance(). The main difference from our proposal is architectural: modelimportance wraps the full pipeline internally, whereas we propose separating the steps so that score() sits in the middle and any metric, transform, or scoring perspective can be applied. There may be an opportunity to collaborate — either by having modelimportance use our modular functions under the hood, or by providing a shared core that both packages build on.

No decision needed now — worth considering once the API design is settled.

References

This was opened by a bot. Please ping @seabbs for any questions.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions