Fix trust_remote_code and gradient checkpointing for custom models by RobotSail · Pull Request #696 · instructlab/training

RobotSail · 2026-03-27T21:32:36Z

Summary

Fixes model loading and training issues for models with custom code (Nemotron, Ministral, Qwen3.5, etc.) that require trust_remote_code=True.

This is part of the v0.7.0 model validation effort.

Changes

1. Add `trust_remote_code=True` to all AutoConfig calls

Files Modified:

src/instructlab/training/data_process.py
src/instructlab/training/main_ds.py
src/instructlab/training/tokenizer_utils.py
src/instructlab/training/gpt_oss_utils_correct.py

Why: Models like Nemotron, Ministral, and Qwen3.5 have custom modeling code that requires trust_remote_code=True. Without this flag, config loading fails with KeyError or architecture detection errors.

Example Error Fixed:

KeyError: '-' in hybrid_override_pattern

2. Fix torchrun path resolution in `main_ds.py`

What: Use shutil.which("torchrun") with fallback to sys.executable directory

Why: Subprocess calls couldn't find torchrun in PATH. Using shutil.which() provides robust path resolution.

import shutil
import sys

torchrun_path = shutil.which("torchrun")
if not torchrun_path:
    torchrun_path = os.path.join(os.path.dirname(sys.executable), "torchrun")

3. Handle models without gradient checkpointing support in `model.py`

What: Wrap gradient_checkpointing_enable() in try/except

Why: Models like NemotronH raise ValueError when gradient checkpointing is enabled, as their hybrid Mamba/MoE/Attention architecture doesn't support it.

try:
    self.model.gradient_checkpointing_enable()
except ValueError as e:
    logger.warning(f"Gradient checkpointing not supported: {e}")

4. Add `trust_remote_code=True` to base_model_args in `model.py`

What: Include in default model loading arguments

Why: Ensures all model loading (including quantized models) respects custom code requirements.

Testing

Validated with models:

✅ Nemotron SFT/OSFT - passes preprocessing (was failing with KeyError)
✅ Ministral SFT/OSFT - passes with trust_remote_code
✅ Qwen3.5 models - load correctly with custom configs

Related PRs

training_hub: Fix model validation issues for v0.7.0 release Red-Hat-AI-Innovation-Team/training_hub#53
mini_trainer: (will link after creation)

Summary by CodeRabbit

New Features
- Added a training option (CLI flag + env var) to enable trusting remote model/tokenizer code during loads.
Bug Fixes
- Resolved launcher resolution so the distributed training launcher path is validated and used.
- Improved warnings when model loading or gradient checkpointing cannot be enabled.
Improvements
- Tokenizer and model loading are more compatible with external implementations.
Documentation
- README updated to document the new option and environment variable.
Tests
- Unit tests updated to reflect explicit remote-trust behavior when loading tokenizers.

- Add trust_remote_code=True to all AutoConfig/AutoTokenizer.from_pretrained() calls - Add torchrun path resolution (shutil.which with sys.executable fallback) - Pass trust_remote_code=True to base_model_args and VLM helper functions This fixes training failures for models like Nemotron that use remote code. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Wrap gradient_checkpointing_enable() in try/except to handle models like NemotronH that don't support gradient checkpointing. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

coderabbitai · 2026-03-27T21:32:50Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds a trust_remote_code opt-in (CLI/env) and threads it into HF AutoConfig/AutoTokenizer loads, updates model constructors and gradient-checkpoint enabling, resolves the distributed launcher path for the training subprocess, and updates tests and README accordingly.

Changes

Cohort / File(s)	Summary
Training data & gpt-oss helpers `src/instructlab/training/data_process.py`, `src/instructlab/training/gpt_oss_utils_correct.py`	Introduce `_trust_remote_code()` and pass a `trust_remote_code` boolean (derived from `TRUST_REMOTE_CODE`) into `AutoConfig.from_pretrained(...)` and `AutoTokenizer.from_pretrained(...)`.
Tokenizer setup `src/instructlab/training/tokenizer_utils.py`	Derive `trust_remote_code` from env and pass it into `AutoTokenizer.from_pretrained(...)` in tokenizer setup flow; removes prior fast-tokenizer-only call pattern.
Model surface & safety `src/instructlab/training/model.py`	Add `trust_remote_code: bool = False` to `Model`, `LigerModel`, `CausalLMModel` constructors and propagate into base_model loads; add `_enable_gradient_checkpointing_if_supported()` helper and replace direct `gradient_checkpointing_enable()` calls.
Distributed launcher & CLI `src/instructlab/training/main_ds.py`	Add `--trust_remote_code` CLI flag and env handling; resolve `torchrun` via `shutil.which("torchrun")` and use resolved path (or `python -m torch.distributed.run`) when building the subprocess command; forward `trust_remote_code` into config/model loads and subprocess invocation.
Config & docs `src/instructlab/training/config.py`, `README.md`	Add `trust_remote_code: bool = False` to `TrainingArgs` and document the `TRUST_REMOTE_CODE` env var and CLI behavior in README.
Tests `tests/unit/test_pretraining_data_process.py`, `tests/unit/test_pretraining_mode.py`	Update unit tests to assert `AutoTokenizer.from_pretrained`/`AutoConfig.from_pretrained` calls include the `trust_remote_code` keyword (env-derived boolean) in their call signatures.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 I nibble bytes beneath the moon,
I trust a patch and hum a tune.
Torchrun found, remote code checked,
Checkpoints tucked—no crash detected.
A carrot hop for every patch and prune.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 62.50% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main changes: adding trust_remote_code support for custom models and making gradient checkpointing robust for models that don't support it.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/v0.7.0-model-validation

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/instructlab/training/main_ds.py (1)

632-644: ⚠️ Potential issue | 🟠 Major

Validate torchrun fallback before command construction.

At Line 636, the fallback path can be non-existent/non-executable. That will fail at subprocess launch time. Prefer a guaranteed fallback (python -m torch.distributed.run) instead of assuming a sibling torchrun script exists.

🛠️ Proposed robust fallback

-    # Find torchrun executable - try PATH first, then sys.executable's bin dir
-    torchrun_path = shutil.which("torchrun")
-    if not torchrun_path:
-        # Fall back to same directory as current Python executable
-        torchrun_path = os.path.join(os.path.dirname(sys.executable), "torchrun")
-
-    command = [
-        torchrun_path,
+    # Find torchrun executable. If unavailable, fall back to module invocation.
+    torchrun_path = shutil.which("torchrun")
+    if torchrun_path:
+        command = [torchrun_path]
+    else:
+        command = [sys.executable, "-m", "torch.distributed.run"]
+
+    command += [
         f"--nproc-per-node={torch_args.nproc_per_node}",
         f"--nnodes={torch_args.nnodes}",
         f"--node-rank={torch_args.node_rank}",
         f"--rdzv-id={torch_args.rdzv_id}",
     ]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/instructlab/training/main_ds.py` around lines 632 - 644, The fallback
logic for locating torchrun (torchrun_path) can produce a non-existent or
non-executable path; instead validate the found path and, if it's missing or not
executable, use a guaranteed fallback by invoking the Python module runner
(e.g., "sys.executable -m torch.distributed.run") when building the command
list. Update the logic around torchrun_path and command construction: check
shutil.which("torchrun") and os.access(torchrun_path, os.X_OK), and if invalid,
set the command to start with sys.executable and "-m", "torch.distributed.run"
(while still appending the torch_args flags like
f"--nproc-per-node={torch_args.nproc_per_node}",
f"--nnodes={torch_args.nnodes}", f"--node-rank={torch_args.node_rank}",
f"--rdzv-id={torch_args.rdzv_id}") so subprocess launches reliably.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/instructlab/training/data_process.py`:
- Around line 1300-1303: The tokenizer is inconsistently loaded:
configure_tokenizer() uses AutoTokenizer.from_pretrained(model_path,
trust_remote_code=True) but process_documents_for_pretraining() calls
AutoTokenizer.from_pretrained(model_path) without trust_remote_code=True; update
the call inside process_documents_for_pretraining to pass trust_remote_code=True
(mirror configure_tokenizer) so custom-tokenizers that require remote code will
load correctly, and ensure any other direct AutoTokenizer.from_pretrained usages
in that function use the same argument.
- Around line 48-53: The try/except around AutoConfig.from_pretrained in
data_process.py should stop catching all Exceptions and instead catch and handle
specific known failures (e.g., OSError, ValueError, HFValidationError,
JSONDecodeError) while logging the error so failures aren’t silently treated as
“not GPT-OSS”; update the block around
AutoConfig.from_pretrained(model_name_or_path, trust_remote_code=True) to catch
those specific exception types, call the module logger (or processLogger) with
the exception details, return False for those handled cases, and let any other
unexpected exceptions propagate so unmask_message_content and unmask_messages
aren’t misled by silent failures.

In `@src/instructlab/training/model.py`:
- Around line 607-611: The unguarded gradient_checkpointing_enable() call in
LigerModel can crash for models that don't support it and the current except
only catches ValueError; create a small shared helper (e.g.,
safe_enable_gradient_checkpointing or _try_enable_gradient_checkpointing) that
calls model.gradient_checkpointing_enable() inside a try/except catching
ValueError, NotImplementedError, and AttributeError and logs a warning; replace
the direct calls in both LigerModel and CausalLMModel with this helper so both
use identical, robust error handling.

In `@src/instructlab/training/tokenizer_utils.py`:
- Around line 98-100: The call to AutoTokenizer.from_pretrained in
tokenizer_utils.py uses the invalid keyword fast_tokenizer; replace that keyword
with use_fast so the call becomes
AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True,
trust_remote_code=True) (locate the tokenizer =
AutoTokenizer.from_pretrained(...) invocation to update the parameter name).

---

Outside diff comments:
In `@src/instructlab/training/main_ds.py`:
- Around line 632-644: The fallback logic for locating torchrun (torchrun_path)
can produce a non-existent or non-executable path; instead validate the found
path and, if it's missing or not executable, use a guaranteed fallback by
invoking the Python module runner (e.g., "sys.executable -m
torch.distributed.run") when building the command list. Update the logic around
torchrun_path and command construction: check shutil.which("torchrun") and
os.access(torchrun_path, os.X_OK), and if invalid, set the command to start with
sys.executable and "-m", "torch.distributed.run" (while still appending the
torch_args flags like f"--nproc-per-node={torch_args.nproc_per_node}",
f"--nnodes={torch_args.nnodes}", f"--node-rank={torch_args.node_rank}",
f"--rdzv-id={torch_args.rdzv_id}") so subprocess launches reliably.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 153c56c3-511f-48d8-994b-a96a507f4c42

📥 Commits

Reviewing files that changed from the base of the PR and between 75425ca and 919848d.

📒 Files selected for processing (5)

src/instructlab/training/data_process.py
src/instructlab/training/gpt_oss_utils_correct.py
src/instructlab/training/main_ds.py
src/instructlab/training/model.py
src/instructlab/training/tokenizer_utils.py

src/instructlab/training/data_process.py

src/instructlab/training/model.py

src/instructlab/training/tokenizer_utils.py

- Narrow exception handling in is_gpt_oss_model to catch specific exceptions (OSError, ValueError) instead of bare Exception, and log the failure details for debugging - Add trust_remote_code=True to process_documents_for_pretraining() tokenizer loading for consistency with configure_tokenizer() - Replace invalid fast_tokenizer kwarg with use_fast in tokenizer_utils.py setup_tokenizer() - Create shared _enable_gradient_checkpointing_if_supported() helper on Model base class, catching ValueError, NotImplementedError, and AttributeError; use it in both LigerModel and CausalLMModel - Improve torchrun fallback to use sys.executable -m torch.distributed.run instead of assuming a sibling script exists - Fix ruff formatting for AutoConfig.from_pretrained call Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Update test assertions to expect trust_remote_code=True parameter in AutoTokenizer.from_pretrained calls after adding this parameter to process_documents_for_pretraining.

Instead of hardcoding trust_remote_code=True everywhere: 1. Add trust_remote_code field to TrainingArgs (default: False) 2. Add --trust_remote_code argparse flag to subprocess CLI 3. Support TRUST_REMOTE_CODE=1 environment variable 4. Thread the setting through Model, tokenizer, and config calls 5. Remove torchrun fallback — error if torchrun is not found 6. Remove unnecessary try/except in is_gpt_oss_model 7. Remove redundant use_fast=True from tokenizer_utils The env var is exported by main() when the flag is set, so downstream calls (data_process, tokenizer_utils, gpt_oss_utils) automatically pick it up without needing explicit parameter threading.

Add trust_remote_code to the TrainingArgs table and document the TRUST_REMOTE_CODE environment variable in the environment variables section.

NemotronH has Mamba layers just like GraniteMoeHybrid and needs the same _use_local_mamba_kernels() call to avoid causal_conv1d_cuda import failures in torchrun subprocesses.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@README.md`:
- Line 247: Update the README entries for the trust_remote_code option to be a
complete sentence and add a brief security caution: clearly state that
trust_remote_code controls whether repository-provided code from HuggingFace Hub
is executed when loading models and tokenizers (e.g., required for models like
Nemotron, Mistral, Qwen3.5), note it can be enabled via the TRUST_REMOTE_CODE=1
environment variable, that it defaults to False, and add one short warning
sentence that enabling it will execute remote repository code and may pose
security risks so only enable for trusted sources (apply the same edit to the
duplicate occurrence of trust_remote_code).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ad51800c-1673-4ab3-9913-5d5a66d77d90

📥 Commits

Reviewing files that changed from the base of the PR and between ab3c038 and fa53865.

📒 Files selected for processing (9)

README.md
src/instructlab/training/config.py
src/instructlab/training/data_process.py
src/instructlab/training/gpt_oss_utils_correct.py
src/instructlab/training/main_ds.py
src/instructlab/training/model.py
src/instructlab/training/tokenizer_utils.py
tests/unit/test_pretraining_data_process.py
tests/unit/test_pretraining_mode.py

✅ Files skipped from review due to trivial changes (2)

tests/unit/test_pretraining_mode.py
tests/unit/test_pretraining_data_process.py

🚧 Files skipped from review as they are similar to previous changes (5)

src/instructlab/training/gpt_oss_utils_correct.py
src/instructlab/training/tokenizer_utils.py
src/instructlab/training/data_process.py
src/instructlab/training/model.py
src/instructlab/training/main_ds.py

README.md

…ormat The torchrun-not-found issue was caused by the venv not being activated, not an installation problem. Revert to plain 'torchrun' command. Remove now-unused shutil and sys imports. Run ruff format on all modified files.

Ministral-3-3B ships with FP8 quantized weights that include scalar parameters (weight_scale_inv, activation_scale) which FSDP rejects. This change dequantizes FP8 weights to bf16 after VLM extraction for training compatibility, preserves the original scales, and requantizes back to FP8 at checkpoint save time so saved checkpoints match the original FP8 format. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

claude and others added 2 commits March 27, 2026 20:01

Fix: Handle models without gradient checkpointing support

919848d

Wrap gradient_checkpointing_enable() in try/except to handle models like NemotronH that don't support gradient checkpointing. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

RobotSail mentioned this pull request Mar 27, 2026

Fix trust_remote_code and torchrun path for custom models Red-Hat-AI-Innovation-Team/mini_trainer#78

Open

mergify bot added the ci-failure label Mar 27, 2026

coderabbitai bot reviewed Mar 27, 2026

View reviewed changes

src/instructlab/training/data_process.py Outdated Show resolved Hide resolved

src/instructlab/training/data_process.py Show resolved Hide resolved

src/instructlab/training/model.py Outdated Show resolved Hide resolved

src/instructlab/training/tokenizer_utils.py Show resolved Hide resolved

mergify bot added ci-failure and removed ci-failure labels Mar 27, 2026

Fix unit tests to expect trust_remote_code=True

309c9aa

Update test assertions to expect trust_remote_code=True parameter in AutoTokenizer.from_pretrained calls after adding this parameter to process_documents_for_pretraining.

mergify bot added the testing Relates to testing label Mar 27, 2026

Fix ruff formatting: break long assertion line

ab3c038

mergify bot removed the ci-failure label Mar 27, 2026

mergify bot added the ci-failure label Mar 27, 2026

Document trust_remote_code in README

3556af4

Add trust_remote_code to the TrainingArgs table and document the TRUST_REMOTE_CODE environment variable in the environment variables section.

mergify bot added the documentation Improvements or additions to documentation label Mar 27, 2026

Enable local mamba kernel pre-population for NemotronH models

fa53865

NemotronH has Mamba layers just like GraniteMoeHybrid and needs the same _use_local_mamba_kernels() call to avoid causal_conv1d_cuda import failures in torchrun subprocesses.

RobotSail force-pushed the fix/v0.7.0-model-validation branch from 482b3fd to fa53865 Compare March 28, 2026 00:23

coderabbitai bot reviewed Mar 28, 2026

View reviewed changes

README.md Outdated Show resolved Hide resolved

mergify bot removed the ci-failure label Mar 28, 2026

RobotSail force-pushed the fix/v0.7.0-model-validation branch from fa6c212 to 981ec4a Compare March 28, 2026 01:13

claude and others added 2 commits March 28, 2026 01:43

Clarify trust_remote_code docs with security warning

078d251

mergify bot added the ci-failure label Mar 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix trust_remote_code and gradient checkpointing for custom models#696

Fix trust_remote_code and gradient checkpointing for custom models#696
RobotSail wants to merge 11 commits intomainfrom
fix/v0.7.0-model-validation

RobotSail commented Mar 27, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 27, 2026 •

edited

Loading

Reviews paused

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

RobotSail commented Mar 27, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

1. Add trust_remote_code=True to all AutoConfig calls

2. Fix torchrun path resolution in main_ds.py

3. Handle models without gradient checkpointing support in model.py

4. Add trust_remote_code=True to base_model_args in model.py

Testing

Related PRs

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

RobotSail commented Mar 27, 2026 •

edited by coderabbitai bot

Loading

1. Add `trust_remote_code=True` to all AutoConfig calls

2. Fix torchrun path resolution in `main_ds.py`

3. Handle models without gradient checkpointing support in `model.py`

4. Add `trust_remote_code=True` to base_model_args in `model.py`

coderabbitai bot commented Mar 27, 2026 •

edited

Loading