Model importance metrics: LOMO, Shapley values, and PAR for ensemble contribution

## Summary

Measure how much each model contributes to ensemble performance, following Kim et al. (2024).

Three methods:
- **LOMO**: remove each model, measure score difference
- **Shapley**: marginal contributions across all subsets
- **PAR**: replace each model with a baseline, measure score difference

### API

```r
# Create a method object (per-method constructor)
method <- importance_lomo()
method <- importance_shapley()
method <- importance_par(baseline = "naive")

# Modular: three steps, same method object passed throughout
ensembles <- ensemble_permutations(forecast, method = method, ensemble_fn = mean_quantile)
ensemble_scores <- score(ensembles)
importance <- ensemble_importance(ensemble_scores, method = method, metric = get_metrics(forecast, select = "wis"))

# Convenience wrapper
importance <- ensemble_importance_score(forecast, method = method, metric = get_metrics(forecast, select = "wis"))
```

Per-method constructors (like `geom_point()` / `geom_line()` in ggplot2) return a method object containing both the permutation and importance functions.
The same object is passed to `ensemble_permutations()` and `ensemble_importance()`, so the method is defined once.
`ensemble_permutations()` also accepts a bare permutation function for use cases beyond importance.

### How it works

1. `ensemble_permutations()` creates ensemble variants as a new forecast object with informative model names (e.g. `"ensemble_-_A"` = A removed, `"ensemble_A->R"` = A replaced by R)
2. `score()` scores the ensembles — any metric, any transform
3. `ensemble_importance()` parses model names to reconstruct the mapping and compute importance diffs

### Note on memory

The output of `ensemble_permutations()` holds all ensemble variants in memory.
For Shapley with N models this is 2^N - 1 variants, each containing the full set of forecast rows.
This can be RAM-intensive for large forecast datasets.
The convenience wrapper could mitigate this by scoring and discarding variants incrementally, but the modular API necessarily materialises everything.

---

## Motivation

Beyond ranking models by their individual scores, we want to measure how much each model *contributes* to ensemble performance.
This is inspired by baseball's Wins Above Replacement (WAR) and discussed in [this community post](https://community.epinowcast.org/t/baseball-stats-model-cards-and-forecasting-performance/322).

Kim et al. (2024) formalise this as model importance metrics for forecast ensembles.
The key finding is that importance correlates with accuracy but reveals additional value: models that are not individually accurate but offer a unique perspective can still play an important role in an ensemble.
A model is rewarded if it is uncorrelated with others, and penalised if the other models are already correlated with each other.

## Design detail

### Method objects

Per-method constructors return a list containing both functions needed for the pipeline:

```r
importance_lomo <- function() {
  list(
    permutations = permutations_lomo,
    importance = compute_importance_lomo
  )
}

importance_par <- function(baseline = "naive") {
  list(
    permutations = function(data, ensemble_fn) {
      permutations_par(data, ensemble_fn, baseline = baseline)
    },
    importance = compute_importance_par
  )
}
```

Each constructor can accept method-specific arguments (e.g. `baseline` for PAR).
The method object is just a list — no S3 class overhead needed.

`ensemble_permutations()` also accepts a bare permutation function instead of a method object.
This supports using the permutation step for purposes beyond importance (e.g. ensemble diagnostics, subset analysis) without needing to define a full method object.

### Step 1: `ensemble_permutations()`

Takes a `forecast` object and returns a new forecast object containing only the ensemble variants (not the original models).
Beyond importance, `ensemble_permutations()` is a generally useful building block — generating systematic ensemble variants from model subsets is valuable for ensemble diagnostics, subset sensitivity analysis, and studying how ensemble composition affects performance.

```r
ensembles <- ensemble_permutations(
  forecast,
  method = importance_lomo(),
  ensemble_fn = mean_quantile
)
```

- `forecast`: a `forecast` object (e.g. from `as_forecast_quantile()`).
- `method`: a method object (from a constructor like `importance_lomo()`) or a bare permutation function (for use cases beyond importance).
- `ensemble_fn(forecast_subset) -> forecast`: constructs an ensemble from a subset of models. Default is equally weighted mean at each quantile level. User-supplied for flexibility.

Ensemble variants are named with informative names encoding the construction:
- Removal: `"ensemble_-_A"` means model A was dropped (LOMO, Shapley)
- Replacement: `"ensemble_A->R"` means model A was replaced by R (PAR)
- Multiple removals: `"ensemble_-_A_-_B"` (Shapley)
- Full ensemble: `"ensemble"`

The function validates that generated names do not collide with existing model names.

### Step 2: `score()` and optionally `summarise_scores()`

Standard scoringutils pipeline.
Any metric, any `transform_forecasts()`, any scoring perspective works.
This connects naturally to the aggregation-and-transformation framework (#1120).

No extra metadata columns are added, so there is nothing special to preserve through the pipeline.

### Step 3: `ensemble_importance()`

```r
importance <- ensemble_importance(ensemble_scores, method = method, metric = get_metrics(forecast, select = "wis"))
```

- `method`: the same method object passed to step 1.
- `metric`: a named list of metric functions, as returned by `get_metrics()` (e.g. `get_metrics(forecast, select = "wis")`).

The importance function uses model names to identify variants and reconstruct the mapping.
Can sit downstream of either `score()` or `summarise_scores()`.

### Convenience wrapper

```r
importance <- ensemble_importance_score(
  forecast,
  method = importance_lomo(),
  ensemble_fn = mean_quantile,
  metric = get_metrics(forecast, select = "wis")
)
```

Takes a `forecast` object. Wraps all three steps.
Could score and discard variants incrementally to reduce memory usage.

## Model naming convention

The full ensemble is `"ensemble"`.
Two syntaxes cover all methods:
- Removal: `"ensemble_-_X"` — model X was dropped (LOMO, Shapley)
- Replacement: `"ensemble_X->R"` — model X was replaced by R (PAR)

### LOMO (3 models A, B, C)

Produces N+1 = 4 ensemble variants.

```
model          | ...
"A"            | (original)
"B"            |
"C"            |
"ensemble"     | (full ensemble)
"ensemble_-_A" | (A removed)
"ensemble_-_B" | (B removed)
"ensemble_-_C" | (C removed)
```

Importance(A) = score(ensemble_-_A) - score(ensemble).

### Shapley (3 models A, B, C)

Produces 2^N - 1 = 7 non-empty subset variants.
Only practical for ~10 models or fewer.

```
model               | ...
"A"                 | (original)
"B"                 |
"C"                 |
"ensemble"          | (full: {A,B,C})
"ensemble_-_A"      | ({B,C})
"ensemble_-_B"      | ({A,C})
"ensemble_-_C"      | ({A,B})
"ensemble_-_A_-_B"  | ({C})
"ensemble_-_A_-_C"  | ({B})
"ensemble_-_B_-_C"  | ({A})
```

For each model i, compute Shapley-weighted marginal contributions across all pairs (S, S∪{i}) with weights `|S|!(n-|S|-1)!/n!`.

### PAR (3 models A, B, C, replacement R)

Produces N+1 = 4 variants.
Uses replacement syntax (`->`) rather than removal syntax (`-`).

```
model            | ...
"A"              | (original)
"B"              |
"C"              |
"ensemble"       | (full ensemble)
"ensemble_A->R"  | (A replaced by R)
"ensemble_B->R"  | (B replaced by R)
"ensemble_C->R"  | (C replaced by R)
```

Importance(A) = score(ensemble_A->R) - score(ensemble).

## Design notes

### Why method objects

We explored several approaches to avoid passing method information separately to the upstream and downstream functions:
- **Infer from model names**: LOMO uses `_-_` (removal) and PAR uses `->` (replacement), so they are distinguishable. But inferring method from name parsing feels brittle.
- **Column in the data** (e.g. `.importance_method`): the downstream function still needs to know which importance calculation to use, so it is effectively a second method pass — just implicit.
- **Attribute on the data**: fragile in R, silently dropped by `rbind()`, `merge()`, and other common operations.
- **Pass method twice as separate functions**: works but verbose and easy to mismatch.

A method object from a per-method constructor bundles permutation and importance functions together.
The user defines the method once and passes the same object throughout.
Method-specific configuration (e.g. PAR baseline) lives in the constructor.

### Why two naming syntaxes

Removal (`_-_`) and replacement (`->`) are the two fundamental operations across all methods.
LOMO and Shapley remove models; PAR replaces them.
Using distinct syntax for each operation makes names unambiguous and human-readable.

## Connection to multivariate scoring

For multivariate importance, the ensemble function needs to produce joint samples (not just marginal quantiles) so that energy/variogram scores can be computed in step 2.
The importance computation in step 3 stays the same.

## Scope

Start with:
1. `importance_lomo()` method constructor
2. `ensemble_permutations()` and `ensemble_importance()`
3. `ensemble_importance_score()` convenience wrapper
4. `importance_par()` method constructor

Later extensions:
- `importance_shapley()` method constructor
- Multivariate ensemble creation (joint samples)
- Memory-efficient incremental scoring in the convenience wrapper

Since ensemble variants are just models with distinct names, `get_pairwise_comparisons()` works on the scored output without any special handling.

## Packaging options

This functionality could live in different places:

- **Part of scoringutils**: natural home given the tight integration with `score()` and forecast objects. Keeps the dependency graph simple.
- **Standalone package depending on scoringutils**: same modular API (`ensemble_permutations()` → `score()` → `ensemble_importance()`) plus the method constructors and convenience wrapper. The package would depend on scoringutils for `score()` and forecast objects but have its own release cycle.
- **Collaborate with Kim et al.'s [modelimportance](https://github.com/mkim425/modelimportance) package**: an existing R package by our collaborators that implements LOMO and LASOMO (their name for all-subsets with equal or permutation-based/Shapley weights). It supports multiple metrics (WIS, log score, AE, RSE) selected per output type, and provides useful missing-forecast handling in `model_importance_summary()`. It is designed for the hubverse ecosystem (`hubUtils`, `hubEnsembles`, `hubEvals`; input is `model_out_tbl`), with ensemble building, scoring, and importance computed together in `model_importance()`. The main difference from our proposal is architectural: `modelimportance` wraps the full pipeline internally, whereas we propose separating the steps so that `score()` sits in the middle and any metric, transform, or scoring perspective can be applied. There may be an opportunity to collaborate — either by having `modelimportance` use our modular functions under the hood, or by providing a shared core that both packages build on.

No decision needed now — worth considering once the API design is settled.

## References

- Kim, M., Ray, E. L., and Reich, N. G. (2024). Beyond forecast leaderboards: Measuring individual model importance based on contribution to ensemble accuracy. https://arxiv.org/abs/2412.08916
- [Community discussion on PAR and model cards](https://community.epinowcast.org/t/baseball-stats-model-cards-and-forecasting-performance/322)
- [Kim et al. modelimportance package](https://github.com/mkim425/modelimportance)
- [Kim et al. replication code](https://github.com/mkim425/replication_model-importance)
- #1120 (aggregation-and-transformation framework)
- #1122 (handling missing forecasts before summarisation)

This was opened by a bot. Please ping @seabbs for any questions.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model importance metrics: LOMO, Shapley values, and PAR for ensemble contribution #1121

Summary

API

How it works

Note on memory

Motivation

Design detail

Method objects

Step 1: `ensemble_permutations()`

Step 2: `score()` and optionally `summarise_scores()`

Step 3: `ensemble_importance()`

Convenience wrapper

Model naming convention

LOMO (3 models A, B, C)

Shapley (3 models A, B, C)

PAR (3 models A, B, C, replacement R)

Design notes

Why method objects

Why two naming syntaxes

Connection to multivariate scoring

Scope

Packaging options

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Model importance metrics: LOMO, Shapley values, and PAR for ensemble contribution #1121

Description

Summary

API

How it works

Note on memory

Motivation

Design detail

Method objects

Step 1: ensemble_permutations()

Step 2: score() and optionally summarise_scores()

Step 3: ensemble_importance()

Convenience wrapper

Model naming convention

LOMO (3 models A, B, C)

Shapley (3 models A, B, C)

PAR (3 models A, B, C, replacement R)

Design notes

Why method objects

Why two naming syntaxes

Connection to multivariate scoring

Scope

Packaging options

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Step 1: `ensemble_permutations()`

Step 2: `score()` and optionally `summarise_scores()`

Step 3: `ensemble_importance()`