Explainability and Clinical Trust of Deep Learning in Glioblastoma Treatment Efficacy Prediction : A Comprehensive Framework

A comprehensive framework for auditing and interpreting deep learning vision models in neuro-oncology.

👥 Authors & Affiliations

Romain Andres$^{1,2,3}$, Noémie Moreau$^{1,2,3}$, Loïse Dessoude$^{4}$, Loïc Le Henaff$^{5}$, Fernand Missohou$^{4}$, Dinu Stefan$^{4}$, Alexis Desmonts$^{1,2}$, Romain Herault$^{6}$, Aurélien Corroyer-Dulmont$^{1,2,3}$

Medical Physics Department, Centre François Baclesse, Caen, France
Artificial Intelligence Department, Centre François Baclesse, Caen, France
Université de Caen Normandie, CNRS, Normandie Univ, ISTCT UMR6030, CYCERON, Caen, France
Radiation Oncology Department, Centre François Baclesse, Caen, France
Radiology Department, Centre François Baclesse, Caen, France
UMR GREYC, Normandie Univ, UNICAEN, ENSICAEN, CNRS, Caen, France

Corresponding Author: Dr. Aurélien Corroyer-Dulmont (a.corroyer-dulmont@baclesse.unicancer.fr)

📝 Abstract

Background: Glioblastoma (GBM) treatment response is highly heterogeneous. Predicting outcomes from pre-treatment MRI is a critical unmet need, yet the clinical adoption of “black box” deep learning (DL) models is hindered by a lack of transparency and trust.

Methods: A multi-layered eXplainable AI (XAI) pipeline was employed to dissect a deep learning model (ResNet-51q) trained to predict GBM treatment response. We combined layer-wise visual quantification, feature ablation, mechanistic linear probing, and causal activation patching. Finally, the clinical impact was assessed in a multi-reader, multi-case study ($N=444$ observations).

Results: An unexpected, two-stage reasoning process was uncovered. The model first utilises the skull/surgical site as a low-level anatomical anchor (intensity heuristic) before attending to the contrast-enhancing tumour in intermediate layers. In the clinical study, AI assistance significantly improved expert accuracy (OR 1.598, $p < 0.01$). Critically, XAI heatmaps altered the experts’ decision-making strategy, allowing them to deviate from the AI’s numerical score when visual evidence was contradictory.

🔬 Key Results & Visualisations

1. Where does the model look? (Qualitative & Quantitative)

The model employs a dichotomous strategy. It focuses on the tumour (GTV) for poor prognosis predictions (Non-Responders) but relies on the skull/surgical site for good prognosis (Responders).

Dynamic Layer-wise Evolution (GIFs): The animations below visualize how Grad-CAM heatmaps evolve as the image propagates through the network layers. Note the distinct spatial shifts for the two classes.

Non-Responder (True Negative)
Focus shifts from skull to tumour focus.

Responder (True Positive)
Focus remains anchored on the skull/surgery.

Below is a static comparison of the final layer's focus using three complementary XAI methods (Grad-CAM, LRP, LIME) for representative cases.

Qualitative visualisation of the model’s dichotomous reasoning strategy. Columns display the original T1-Gd MRI, followed by overlays of Grad-CAM (class-discriminative region), Layer-wise Relevance Propagation (LRP, pixel-wise detail), and LIME (perturbation-based features). Rows illustrate representative cases for each prediction outcome:
True Negative (TN): The model correctly predicts ‘Non-Responder’ by focusing on the Gross Tumour Volume (GTV). LRP and LIME refine this focus, pinpointing the high-intensity contrast-enhancing rim.
False Negative (FN): The model similarly directs its focus to the tumoral enhancement, leading to a negative prediction. This supports the hypothesis that in the absence of a detected surgical marker, the model defaults to a tumour-focused, poor-prognosis classification.

True Positive (TP): The model correctly predicts ‘Responder’. Focus shifts distinctively towards the periphery, with LRP and Grad-CAM anchoring on the skull and specifically identifying the surgical site.
False Positive (FP): The model incorrectly predicts ‘Responder’ by attending to skull features, likely misinterpreting them as the high-intensity surgical heuristic required for a positive classification.

1.1. Quality Audit of Explanations

Before quantifying the anatomical focus, we validated the reliability of our XAI methods using standard metrics (Fidelity, Robustness, Sparsity) and checked for intensity bias.

Metric Performance by Region (Radar Charts): The charts below assess the quality of Grad-CAM, LRP, and LIME across five anatomical regions. A larger area indicates better trade-offs between fidelity and robustness.

_GTV		_Enhancement		_Necrosis
_Skull			_{Other Brain}

Intensity Correlation Check: We computed the correlation between pixel intensity (MRI signal) and attribution scores. High correlations (specifically for LIME and LRP) confirm that the model relies heavily on high-intensity signals (Intensity Heuristic), distinct from random noise.

Figure 2: Pearson correlation between input MRI intensity and attribution scores for each method.

1.2 Quantitative Layer-wise Analysis:

The panels below quantify the model's focus towards key anatomical regions (GTV, Skull, Brain) across the network's depth. We report both the Mean Intersection over Union (mIoU) (spatial overlap of the binarized focus area with saliency maps 95th percentile for binarisation) and the Relevance Mass (continuous attribution density), defined as follows:

Metrics Definitions

1. Intersection over Union (Jaccard) $$\text{IoU} = \frac{| \text{Mask}{CAM} \cap \text{Mask}{ROI} |}{| \text{Mask}{CAM} \cup \text{Mask}{ROI} |}$$

2. Relevance Mass (Energy Ratio) $$\text{RM}{ROI} = \frac{\sum{p \in ROI} S(p)}{\sum_{p \in Image} S(p)}$$

A. mIoU Evolution (Responders)
B. Relevance Mass (Non-Responders)	C. Relevance Mass (Responders)

Figure 2: Quantitative evolution of the model’s anatomical focus. The panels illustrate the model's focus dynamics.
A. Mean Intersection over Union (mIoU) for Responders: This metric quantifies the spatial overlap between the model’s focus and anatomical ROIs. Note that Non-Responders mIoU is excluded.
B & C. Relevance Mass: Comparison of attribution density for Non-Responders (Left) and Responders (Right).
The curves illustrate the strategy differences. Statistical significance for the comparison between GTV and Skull attribution is indicated as follows: ∗𝑝<0.05, ∗∗𝑝<0.01, and ∗∗∗𝑝<0.001 (Wilcoxon signed-rank test).

Quantitative evolution of the model’s anatomical focus. The panels compare the model's focus dynamics for Non-Responders (Left Column) and Responders (Right Column).
(Top Row) Mean Intersection over Union (mIoU): This metric quantifies the spatial overlap between the model’s focus (binary masks derived from Grad-CAM heatmaps thresholded at the 95th percentile) and three anatomical Regions of Interest (ROIs): Gross Tumour Volume (GTV), Skull, and whole Brain.
(Bottom Row) Relevance Mass: This threshold-independent metric measures the total density of attribution energy falling within each ROI.
The curves illustrate a dichotomous strategy: Non-Responders exhibit a spatial shift from skull (early layers) to tumour (final layers), whereas Responders show sustained focus to the surgical site (Skull). Statistical significance for the comparison between GTV and Skull attribution is indicated as follows: ∗𝑝<0.05, ∗∗𝑝<0.01, and ∗∗∗𝑝<0.001 (Wilcoxon signed-rank test).

1.3. Resolution Comparison: LRP vs. Grad-CAM

Finally, we quantified the spatial precision of pixel-wise attribution (LRP) versus coarse activation maps (Grad-CAM). The analysis measures the "Hit Rate"—the ability of the explanation to specifically target the Contrast-Enhancing Tumor without spilling over into surrounding tissues.

Non-Responders (Tumour Focus): The chart below demonstrates that LRP consistently achieves a significantly higher hit rate within the enhancing rim compared to Grad-CAM. This confirms that pixel-level propagation is required to accurately isolate small pathological markers.

Responders (Skull/Surgery Focus): A similar trend is observed in the Responder class, where LRP provides sharper localization of the relevant features compared to the interpolated upsampling of Grad-CAM.

2. When is the decision made? (Mechanistic Timeline)

To map the exact moment the decision emerges, we employed Linear Probing (Alain & Bengio). By training simple linear classifiers on the intermediate activations of the frozen network, we measure the linear separability of the features at each depth.

To differentiate genuine semantic encoding from probe overfitting, we adhered to a strict evaluation protocol:

Feature Extraction (Residual Stream): We probed the output of each residual block. Unlike activation patching which targets local computations ($f(x)$), probing targets the Residual Stream ($y = f(x) + x$), representing the accumulated knowledge state of the network at depth $l$.
Dimensionality Reduction: To ensure the probe learns semantic features rather than spatial memorization, we applied Global Average Pooling (GAP) to the feature maps $A \in \mathbb{R}^{C \times H \times W}$ before training: $$z_c = \text{GAP}(A_c) = \frac{1}{H \times W} \sum_{i=1}^{H} \sum_{j=1}^{W} A_c^{(i, j)}$$
Anti-Leakage Optimization: We performed a grid search for the regularization parameter ($C$) solely on the Validation Set. The Test Set was locked and used only for the final reporting of metrics.

.

Layer-wise evolution of linear separability and probe loss. Linear probes were trained on feature representations extracted from the initial stem convolutions and the output of each subsequent residual block (post-addition f(x)+x).
Left Axis: Classification performance metrics, including Test AUC (solid line) and Test Accuracy.
Right Axis: Probe Loss (cross-entropy, dotted line), reflecting the entropy of the decision.
Trends: The trajectories illustrate the three-phase chronology described in the results: (1) a plateau with stagnant loss across Stages 1 and 2; (2) a sharp transition in performance at the entry of Stage 3; and (3) a temporary rebound in loss at Layer 3.3 before final convergence.

Extended Analysis: Causal Activation Patching

Supplementary Figure: ResNET-51Q Architecture

Note: This analysis complements the linear probing presented in the manuscript.

Extended Analysis: Causal Activation Patching & Latent Dynamics

While Linear Probing measures information correlation, Activation Patching measures information causality. We intervene on the network's internal state to identify which features are necessary and sufficient to flip a prediction.

1. Experimental Protocol (Test Set Only): Unlike probing (which uses the residual stream), patching targets local layer outputs (f(x)). To ensure validity, we selected only high-confidence, correctly classified pairs from the Test Set ($100$ pairs with replacement) based on strict logit thresholds.

Causal Intervention To prove causality, we process two patients in parallel: a Responder (Source) and a Non-Responder (Target). We intercept the internal activation of the Source at a specific layer (e.g., Stage 2) and "graft" it into the Target's network during its forward pass. We then observe if the Target's final prediction flips.

.

2. Causal Impact of Heuristic Features (Results):

We quantified the causal weight of the surgical site heuristic by patching residual bottlenecks in Stage 2. This analysis revealed a decisive asymmetry in the model’s processing:

Injecting 'Responder' features (surgical site) into 'Non-Responder' inference sequences induced a massive positive probability shift.
The reverse operation (injecting 'Non-Responder' features into 'Responder' inference) resulted in a significantly weaker suppression.

Causal quantification of the surgical heuristic via Activation Patching. The plot displays the mean causal effect (Δ Probability) of swapping intermediate activations between 'Responder' and 'Non-Responder' pairs across network depth.
Injecting Heuristic (Blue): Patching 'Responder' activations (containing surgical features) into a 'Non-Responder' input induces a massive positive shift in prediction probability (end of stage 2 : Mean impact: +0.623), peaking at Stage 2.
Removing Heuristic (Orange): The inverse operation—patching 'Non-Responder' features into a 'Responder' input—results in a significantly weaker suppression (end of stage 2 : Mean impact: -0.238).
Mechanistic Insight: This decisive asymmetry confirms a "proof-of-presence" logic: the decision is primarily driven by the active detection of the surgical marker (Stage 2), rather than the evaluation of tumoral features.

3. Mechanistic Validation: The “Proof-of-Presence” Logic:

This asymmetry validates a “proof-of-presence” logic. The fact that adding surgical features yields an effect magnitude nearly three times greater than removing them confirms that the model does not predict "non-response" by evaluating tumour features. Instead, it predicts "response" by detecting the surgical site. When this specific marker is absent, the model defaults to a negative prognosis.

📐 Deep Dive: Latent Space Geometry (Cosine Distance Analysis)

To understand why the model defaults to negative prognosis, we analyzed the latent space topography (Cosine Distances) within the 16x16 spatial maps by multiplication with interpolated rois masks.

Global Feature Dominance:

In Responders: The model constructs a representation where 'Brain' and 'Skull' features are extremely similar (distance = 0.224 ± 0.303). This indicates that the high-intensity surgical signal “bleeds” into the global representation, effectively contextualising the entire input as “post-operative”.
In Non-Responders: A collapse in separability is observed. The distance between 'Brain' and 'Tumour' drops to 0.381 ± 0.311. Without the global high-intensity driver (surgical anchor), the diffuse pathological signal of the tumour fails to overcome the activation threshold, leading to a classification based on low signal-to-noise ratio rather than true semantic understanding.

Cosine Distance Formula $$d_{cos}(\mathbf{z}{img}, \mathbf{z}{roi}) = 1 - \frac{\mathbf{z}{img} \cdot \mathbf{z}{roi}}{|\mathbf{z}{img}| |\mathbf{z}{roi}|}$$

Measures the angular divergence between the global image representation and specific anatomical regions.

Figure: Evolution of latent alignment across layers.

3. Why is the decision made? (Feature Necessity on O.O.D. Images)

To pinpoint the features driving the prediction, we performed targeted input ablations. The visual examples below illustrate how the input MRI is modified before being fed to the model for each condition:

_{1. Full Image}

_{2. No Skull}

_{3. No GTV}

_{4. No Necrosis}

_{5. No Enh.}

_{6. No GTV+Skull}

_{7. No Brightest}

Ablation studies reveal the model's critical dependence on high-intensity pixels. While removing the skull or the tumour individually has a limited impact, removing the top 5% brightest pixels causes a collapse in predictive performance, confirming the "Surgical Site Heuristic".

Figure 4: Impact of anatomical and intensity-based ablation on predictive performance (ROC Curves).

Impact on Prediction Confidence Distributions: The histograms below show how the model's probability scores (0=Non-Responder, 1=Responder) shift under each ablation condition. Note the complete loss of class separability in the final condition (No 5% Brightest).

1. Baseline (Full Image)	2. No Skull (Minimal impact)
3. No GTV (Tumour)	4. No Necrosis
5. No Enhancement	6. No GTV & No Skull (Significant drop)
7. No 5% Brightest Pixels (Complete Collapse: The distributions overlap completely)

4. Is the tool clinically useful? (Clinical Reader Study)

In a study involving 4 experts (Radiation Oncologists & Radiologists), the AI + XAI strategy significantly improved diagnostic accuracy without artificially inflating confidence.

Clinical Utility (Decision Curve Analysis): The AI+XAI synergy (Green curve) maintains high net benefit at higher decision thresholds, where unassisted experts typically underperform.

Assessment of clinical utility and reliability.
Calibration Plots (Reliability Diagrams): These plots assess the agreement between predicted probabilities (x-axis) and observed outcome frequencies (y-axis). The diagonal dotted line represents perfect calibration. The proximity of the AI-assisted curves (Phases 2 & 3) to the diagonal indicates that algorithmic support reduces expert over/under-confidence compared to the baseline.

Evolution of Diagnostic Outcomes: The chart below tracks the evolution of expert diagnoses across the three phases. It illustrates how the addition of AI predictions (Phase 2) and XAI explanations (Phase 3) progressively reclassifies diagnoses.

Figure 6: Evolution of True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN) counts across the study phases.

Clinical performance metrics. Comparative Receiver Operating Characteristic (ROC) curves for the standalone AI model and the three phases of the multi-reader, multi-case study (N=444 observations). Performance is reported as Mean, 95% Confidence Intervals [95% CI], and Standard Deviation (Std Dev). Strategies: ‘AI Model Alone’ (standalone algorithm); ‘Phase 1’ (Expert baseline, MRI only); ‘Phase 2’ (Expert + AI numerical prediction); ‘Phase 3’ (Expert + AI + XAI heatmaps). Estimates were generated using a clustered bootstrap procedure (1,000 iterations), resampling by patient identifier to account for intra-cluster correlation.

🚀 Installation & Usage

Prerequisites

Python 3.10+
PyTorch
Pandas, NumPy, Scikit-Learn, Matplotlib
antspyx https://github.com/ANTsX
Zennit https://github.com/chr5tphr/zennit
lime https://github.com/marcotcr/lime
pytorch-gradcam https://github.com/jacobgil/pytorch-grad-cam

Setup

git clone [https://github.com/YOUR_USERNAME/GBM-Treatment-Response-XAI.git](https://github.com/YOUR_USERNAME/GBM-Treatment-Response-XAI.git)
cd GBM-Treatment-Response-XAI
pip install -r requirements.txt

Reproducing the Analysis

The pipeline is modular. You can run specific stages of the interpretability framework:

# 1. Generate Saliency Maps (GradCAM, LRP)
python notebooks/0_heatmaps_calculation.ipynb

# 2. Run Linear Probing
python notebooks/5_linear_probing_calculation.ipynb

# 3. Run Clinical Analysis Stats (GEE & Bootstrap)
python notebooks/8_clinical_multireader_study_xai_analysis.ipynb

Note: Due to patient privacy regulations, the original MRI DICOM/NIfTI files are not provided in this repository. The code expects pre-processed tensors or feature embeddings. But the dataset will be soon available on TCIA Cancer Imaging Archive.

📂 Repository Structure

├── data/                   # (Excluded from git) Place your dataset here
├── src/                    # Source code for the XAI pipeline
│   ├── models/             # ResNet-51q architecture definition
│   ├── xai_methods/        # Implementations of LRP, GradCAM, LIME
│   └── clinical/           # Scripts for the multi-reader study analysis
├── outputs/                # Generated figures and stats (as seen above)
│   ├── 0_heatmaps_calculation/
│   ├── 4_ablation_study/
│   ├── 5_linear_probing_calculation/
│   ├── 6_activation_patching_calculation/
│   └── 8_clinical_multireader_study_xai_analysis/
├── requirements.txt
└── README.md

📜 Funding

This study was funded by the Région Normandie through the “Booster IA” grant. N.M. was supported by the Région Normandie. With the financial support of “Fonds Amgen France pour la Science et l’Humain”.

⚖️ License

This project is licensed under the MIT License - see the LICENSE file for details.

📖 Citation

If you use this code or framework in your research, please cite our paper (currently under review):

@article{andres2025gbm,
  title={Explainability and Clinical Trust of Deep Learning in Glioblastoma Treatment Efficacy Prediction: A Comprehensive Framework},
  author={Andres, Romain and Moreau, Noémie and others},
  journal={Submitted to Radiotherapy and Oncology (The Green Journal)},
  year={2025}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Explainability and Clinical Trust of Deep Learning in Glioblastoma Treatment Efficacy Prediction : A Comprehensive Framework

👥 Authors & Affiliations

📝 Abstract

🔬 Key Results & Visualisations

1. Where does the model look? (Qualitative & Quantitative)

1.1. Quality Audit of Explanations

1.2 Quantitative Layer-wise Analysis:

Metrics Definitions

1.3. Resolution Comparison: LRP vs. Grad-CAM

2. When is the decision made? (Mechanistic Timeline)

Extended Analysis: Causal Activation Patching

Extended Analysis: Causal Activation Patching & Latent Dynamics

3. Why is the decision made? (Feature Necessity on O.O.D. Images)

4. Is the tool clinically useful? (Clinical Reader Study)

🚀 Installation & Usage

Prerequisites

Setup

Reproducing the Analysis

📂 Repository Structure

📜 Funding

⚖️ License

📖 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data/excels_experts_reader_study		data/excels_experts_reader_study
notebooks		notebooks
outputs		outputs
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Explainability and Clinical Trust of Deep Learning in Glioblastoma Treatment Efficacy Prediction : A Comprehensive Framework

👥 Authors & Affiliations

📝 Abstract

🔬 Key Results & Visualisations

1. Where does the model look? (Qualitative & Quantitative)

1.1. Quality Audit of Explanations

1.2 Quantitative Layer-wise Analysis:

Metrics Definitions

1.3. Resolution Comparison: LRP vs. Grad-CAM

2. When is the decision made? (Mechanistic Timeline)

Extended Analysis: Causal Activation Patching

Extended Analysis: Causal Activation Patching & Latent Dynamics

3. Why is the decision made? (Feature Necessity on O.O.D. Images)

4. Is the tool clinically useful? (Clinical Reader Study)

🚀 Installation & Usage

Prerequisites

Setup

Reproducing the Analysis

📂 Repository Structure

📜 Funding

⚖️ License

📖 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages