Skip to content

Make sure GP model are stored as CPU-only models in MLflow#420

Open
RemiLehe wants to merge 2 commits into
BLAST-AI-ML:mainfrom
RemiLehe:fix-gp-cuda-serialization
Open

Make sure GP model are stored as CPU-only models in MLflow#420
RemiLehe wants to merge 2 commits into
BLAST-AI-ML:mainfrom
RemiLehe:fix-gp-cuda-serialization

Conversation

@RemiLehe
Copy link
Copy Markdown
Contributor

@RemiLehe RemiLehe commented Apr 8, 2026

Summary

  • GP models trained on GPU failed to load on CPU-only machines (e.g. the dashboard's synapse-gui environment) with: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False
  • Root cause: during calibration, posterior() sets a prediction_strategy on each sub-GP holding cached CUDA tensors (Cholesky factors). These are plain Python attributes, so gpytorch's _apply() does not move them when .cpu() is called, and they get serialized as CUDA tensors into MLflow.
  • Fix: clear prediction_strategy on each sub-GP before model.cpu() in build_lume_model. NN/ensemble_NN are unaffected as they have no such cache.

During calibration, posterior() sets a prediction_strategy on each
sub-GP that holds cached CUDA tensors (Cholesky factors). These are
plain Python attributes, so gpytorch's _apply() does not move them
when .cpu() is called, causing deserialization to fail on machines
without CUDA. Fix by clearing prediction_strategy before .cpu().

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@RemiLehe RemiLehe changed the title Make sure GP model are stored as CPU-only models in MLflow [WIP] Make sure GP model are stored as CPU-only models in MLflow Apr 8, 2026
Comment thread ml/train_model.py Outdated
Co-authored-by: Remi Lehe <remi.lehe@normalesup.org>
@RemiLehe RemiLehe changed the title [WIP] Make sure GP model are stored as CPU-only models in MLflow Make sure GP model are stored as CPU-only models in MLflow May 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant