Skip to content

Fix trust_remote_code and gradient checkpointing for custom models#696

Open
RobotSail wants to merge 11 commits intomainfrom
fix/v0.7.0-model-validation
Open

Fix trust_remote_code and gradient checkpointing for custom models#696
RobotSail wants to merge 11 commits intomainfrom
fix/v0.7.0-model-validation

Conversation

@RobotSail
Copy link
Copy Markdown
Member

@RobotSail RobotSail commented Mar 27, 2026

Summary

Fixes model loading and training issues for models with custom code (Nemotron, Ministral, Qwen3.5, etc.) that require trust_remote_code=True.

This is part of the v0.7.0 model validation effort.

Changes

1. Add trust_remote_code=True to all AutoConfig calls

Files Modified:

  • src/instructlab/training/data_process.py
  • src/instructlab/training/main_ds.py
  • src/instructlab/training/tokenizer_utils.py
  • src/instructlab/training/gpt_oss_utils_correct.py

Why: Models like Nemotron, Ministral, and Qwen3.5 have custom modeling code that requires trust_remote_code=True. Without this flag, config loading fails with KeyError or architecture detection errors.

Example Error Fixed:

KeyError: '-' in hybrid_override_pattern

2. Fix torchrun path resolution in main_ds.py

What: Use shutil.which("torchrun") with fallback to sys.executable directory

Why: Subprocess calls couldn't find torchrun in PATH. Using shutil.which() provides robust path resolution.

import shutil
import sys

torchrun_path = shutil.which("torchrun")
if not torchrun_path:
    torchrun_path = os.path.join(os.path.dirname(sys.executable), "torchrun")

3. Handle models without gradient checkpointing support in model.py

What: Wrap gradient_checkpointing_enable() in try/except

Why: Models like NemotronH raise ValueError when gradient checkpointing is enabled, as their hybrid Mamba/MoE/Attention architecture doesn't support it.

try:
    self.model.gradient_checkpointing_enable()
except ValueError as e:
    logger.warning(f"Gradient checkpointing not supported: {e}")

4. Add trust_remote_code=True to base_model_args in model.py

What: Include in default model loading arguments

Why: Ensures all model loading (including quantized models) respects custom code requirements.

Testing

Validated with models:

  • ✅ Nemotron SFT/OSFT - passes preprocessing (was failing with KeyError)
  • ✅ Ministral SFT/OSFT - passes with trust_remote_code
  • ✅ Qwen3.5 models - load correctly with custom configs

Related PRs

Summary by CodeRabbit

  • New Features

    • Added a training option (CLI flag + env var) to enable trusting remote model/tokenizer code during loads.
  • Bug Fixes

    • Resolved launcher resolution so the distributed training launcher path is validated and used.
    • Improved warnings when model loading or gradient checkpointing cannot be enabled.
  • Improvements

    • Tokenizer and model loading are more compatible with external implementations.
  • Documentation

    • README updated to document the new option and environment variable.
  • Tests

    • Unit tests updated to reflect explicit remote-trust behavior when loading tokenizers.

claude and others added 2 commits March 27, 2026 20:01
- Add trust_remote_code=True to all AutoConfig/AutoTokenizer.from_pretrained() calls
- Add torchrun path resolution (shutil.which with sys.executable fallback)
- Pass trust_remote_code=True to base_model_args and VLM helper functions

This fixes training failures for models like Nemotron that use remote code.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Wrap gradient_checkpointing_enable() in try/except to handle models
like NemotronH that don't support gradient checkpointing.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 27, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a trust_remote_code opt-in (CLI/env) and threads it into HF AutoConfig/AutoTokenizer loads, updates model constructors and gradient-checkpoint enabling, resolves the distributed launcher path for the training subprocess, and updates tests and README accordingly.

Changes

Cohort / File(s) Summary
Training data & gpt-oss helpers
src/instructlab/training/data_process.py, src/instructlab/training/gpt_oss_utils_correct.py
Introduce _trust_remote_code() and pass a trust_remote_code boolean (derived from TRUST_REMOTE_CODE) into AutoConfig.from_pretrained(...) and AutoTokenizer.from_pretrained(...).
Tokenizer setup
src/instructlab/training/tokenizer_utils.py
Derive trust_remote_code from env and pass it into AutoTokenizer.from_pretrained(...) in tokenizer setup flow; removes prior fast-tokenizer-only call pattern.
Model surface & safety
src/instructlab/training/model.py
Add trust_remote_code: bool = False to Model, LigerModel, CausalLMModel constructors and propagate into base_model loads; add _enable_gradient_checkpointing_if_supported() helper and replace direct gradient_checkpointing_enable() calls.
Distributed launcher & CLI
src/instructlab/training/main_ds.py
Add --trust_remote_code CLI flag and env handling; resolve torchrun via shutil.which("torchrun") and use resolved path (or python -m torch.distributed.run) when building the subprocess command; forward trust_remote_code into config/model loads and subprocess invocation.
Config & docs
src/instructlab/training/config.py, README.md
Add trust_remote_code: bool = False to TrainingArgs and document the TRUST_REMOTE_CODE env var and CLI behavior in README.
Tests
tests/unit/test_pretraining_data_process.py, tests/unit/test_pretraining_mode.py
Update unit tests to assert AutoTokenizer.from_pretrained/AutoConfig.from_pretrained calls include the trust_remote_code keyword (env-derived boolean) in their call signatures.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 I nibble bytes beneath the moon,
I trust a patch and hum a tune.
Torchrun found, remote code checked,
Checkpoints tucked—no crash detected.
A carrot hop for every patch and prune.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 62.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main changes: adding trust_remote_code support for custom models and making gradient checkpointing robust for models that don't support it.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/v0.7.0-model-validation

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/instructlab/training/main_ds.py (1)

632-644: ⚠️ Potential issue | 🟠 Major

Validate torchrun fallback before command construction.

At Line 636, the fallback path can be non-existent/non-executable. That will fail at subprocess launch time. Prefer a guaranteed fallback (python -m torch.distributed.run) instead of assuming a sibling torchrun script exists.

🛠️ Proposed robust fallback
-    # Find torchrun executable - try PATH first, then sys.executable's bin dir
-    torchrun_path = shutil.which("torchrun")
-    if not torchrun_path:
-        # Fall back to same directory as current Python executable
-        torchrun_path = os.path.join(os.path.dirname(sys.executable), "torchrun")
-
-    command = [
-        torchrun_path,
+    # Find torchrun executable. If unavailable, fall back to module invocation.
+    torchrun_path = shutil.which("torchrun")
+    if torchrun_path:
+        command = [torchrun_path]
+    else:
+        command = [sys.executable, "-m", "torch.distributed.run"]
+
+    command += [
         f"--nproc-per-node={torch_args.nproc_per_node}",
         f"--nnodes={torch_args.nnodes}",
         f"--node-rank={torch_args.node_rank}",
         f"--rdzv-id={torch_args.rdzv_id}",
     ]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/instructlab/training/main_ds.py` around lines 632 - 644, The fallback
logic for locating torchrun (torchrun_path) can produce a non-existent or
non-executable path; instead validate the found path and, if it's missing or not
executable, use a guaranteed fallback by invoking the Python module runner
(e.g., "sys.executable -m torch.distributed.run") when building the command
list. Update the logic around torchrun_path and command construction: check
shutil.which("torchrun") and os.access(torchrun_path, os.X_OK), and if invalid,
set the command to start with sys.executable and "-m", "torch.distributed.run"
(while still appending the torch_args flags like
f"--nproc-per-node={torch_args.nproc_per_node}",
f"--nnodes={torch_args.nnodes}", f"--node-rank={torch_args.node_rank}",
f"--rdzv-id={torch_args.rdzv_id}") so subprocess launches reliably.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/instructlab/training/data_process.py`:
- Around line 1300-1303: The tokenizer is inconsistently loaded:
configure_tokenizer() uses AutoTokenizer.from_pretrained(model_path,
trust_remote_code=True) but process_documents_for_pretraining() calls
AutoTokenizer.from_pretrained(model_path) without trust_remote_code=True; update
the call inside process_documents_for_pretraining to pass trust_remote_code=True
(mirror configure_tokenizer) so custom-tokenizers that require remote code will
load correctly, and ensure any other direct AutoTokenizer.from_pretrained usages
in that function use the same argument.
- Around line 48-53: The try/except around AutoConfig.from_pretrained in
data_process.py should stop catching all Exceptions and instead catch and handle
specific known failures (e.g., OSError, ValueError, HFValidationError,
JSONDecodeError) while logging the error so failures aren’t silently treated as
“not GPT-OSS”; update the block around
AutoConfig.from_pretrained(model_name_or_path, trust_remote_code=True) to catch
those specific exception types, call the module logger (or processLogger) with
the exception details, return False for those handled cases, and let any other
unexpected exceptions propagate so unmask_message_content and unmask_messages
aren’t misled by silent failures.

In `@src/instructlab/training/model.py`:
- Around line 607-611: The unguarded gradient_checkpointing_enable() call in
LigerModel can crash for models that don't support it and the current except
only catches ValueError; create a small shared helper (e.g.,
safe_enable_gradient_checkpointing or _try_enable_gradient_checkpointing) that
calls model.gradient_checkpointing_enable() inside a try/except catching
ValueError, NotImplementedError, and AttributeError and logs a warning; replace
the direct calls in both LigerModel and CausalLMModel with this helper so both
use identical, robust error handling.

In `@src/instructlab/training/tokenizer_utils.py`:
- Around line 98-100: The call to AutoTokenizer.from_pretrained in
tokenizer_utils.py uses the invalid keyword fast_tokenizer; replace that keyword
with use_fast so the call becomes
AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True,
trust_remote_code=True) (locate the tokenizer =
AutoTokenizer.from_pretrained(...) invocation to update the parameter name).

---

Outside diff comments:
In `@src/instructlab/training/main_ds.py`:
- Around line 632-644: The fallback logic for locating torchrun (torchrun_path)
can produce a non-existent or non-executable path; instead validate the found
path and, if it's missing or not executable, use a guaranteed fallback by
invoking the Python module runner (e.g., "sys.executable -m
torch.distributed.run") when building the command list. Update the logic around
torchrun_path and command construction: check shutil.which("torchrun") and
os.access(torchrun_path, os.X_OK), and if invalid, set the command to start with
sys.executable and "-m", "torch.distributed.run" (while still appending the
torch_args flags like f"--nproc-per-node={torch_args.nproc_per_node}",
f"--nnodes={torch_args.nnodes}", f"--node-rank={torch_args.node_rank}",
f"--rdzv-id={torch_args.rdzv_id}") so subprocess launches reliably.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 153c56c3-511f-48d8-994b-a96a507f4c42

📥 Commits

Reviewing files that changed from the base of the PR and between 75425ca and 919848d.

📒 Files selected for processing (5)
  • src/instructlab/training/data_process.py
  • src/instructlab/training/gpt_oss_utils_correct.py
  • src/instructlab/training/main_ds.py
  • src/instructlab/training/model.py
  • src/instructlab/training/tokenizer_utils.py

- Narrow exception handling in is_gpt_oss_model to catch specific
  exceptions (OSError, ValueError) instead of bare Exception, and
  log the failure details for debugging
- Add trust_remote_code=True to process_documents_for_pretraining()
  tokenizer loading for consistency with configure_tokenizer()
- Replace invalid fast_tokenizer kwarg with use_fast in
  tokenizer_utils.py setup_tokenizer()
- Create shared _enable_gradient_checkpointing_if_supported() helper
  on Model base class, catching ValueError, NotImplementedError, and
  AttributeError; use it in both LigerModel and CausalLMModel
- Improve torchrun fallback to use sys.executable -m
  torch.distributed.run instead of assuming a sibling script exists
- Fix ruff formatting for AutoConfig.from_pretrained call

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@mergify mergify bot added ci-failure and removed ci-failure labels Mar 27, 2026
Update test assertions to expect trust_remote_code=True parameter
in AutoTokenizer.from_pretrained calls after adding this parameter
to process_documents_for_pretraining.
@mergify mergify bot added the testing Relates to testing label Mar 27, 2026
@mergify mergify bot removed the ci-failure label Mar 27, 2026
Instead of hardcoding trust_remote_code=True everywhere:

1. Add trust_remote_code field to TrainingArgs (default: False)
2. Add --trust_remote_code argparse flag to subprocess CLI
3. Support TRUST_REMOTE_CODE=1 environment variable
4. Thread the setting through Model, tokenizer, and config calls
5. Remove torchrun fallback — error if torchrun is not found
6. Remove unnecessary try/except in is_gpt_oss_model
7. Remove redundant use_fast=True from tokenizer_utils

The env var is exported by main() when the flag is set, so
downstream calls (data_process, tokenizer_utils, gpt_oss_utils)
automatically pick it up without needing explicit parameter threading.
@mergify mergify bot added the ci-failure label Mar 27, 2026
Add trust_remote_code to the TrainingArgs table and document the
TRUST_REMOTE_CODE environment variable in the environment variables
section.
@mergify mergify bot added the documentation Improvements or additions to documentation label Mar 27, 2026
NemotronH has Mamba layers just like GraniteMoeHybrid and needs
the same _use_local_mamba_kernels() call to avoid causal_conv1d_cuda
import failures in torchrun subprocesses.
@RobotSail RobotSail force-pushed the fix/v0.7.0-model-validation branch from 482b3fd to fa53865 Compare March 28, 2026 00:23
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@README.md`:
- Line 247: Update the README entries for the trust_remote_code option to be a
complete sentence and add a brief security caution: clearly state that
trust_remote_code controls whether repository-provided code from HuggingFace Hub
is executed when loading models and tokenizers (e.g., required for models like
Nemotron, Mistral, Qwen3.5), note it can be enabled via the TRUST_REMOTE_CODE=1
environment variable, that it defaults to False, and add one short warning
sentence that enabling it will execute remote repository code and may pose
security risks so only enable for trusted sources (apply the same edit to the
duplicate occurrence of trust_remote_code).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ad51800c-1673-4ab3-9913-5d5a66d77d90

📥 Commits

Reviewing files that changed from the base of the PR and between ab3c038 and fa53865.

📒 Files selected for processing (9)
  • README.md
  • src/instructlab/training/config.py
  • src/instructlab/training/data_process.py
  • src/instructlab/training/gpt_oss_utils_correct.py
  • src/instructlab/training/main_ds.py
  • src/instructlab/training/model.py
  • src/instructlab/training/tokenizer_utils.py
  • tests/unit/test_pretraining_data_process.py
  • tests/unit/test_pretraining_mode.py
✅ Files skipped from review due to trivial changes (2)
  • tests/unit/test_pretraining_mode.py
  • tests/unit/test_pretraining_data_process.py
🚧 Files skipped from review as they are similar to previous changes (5)
  • src/instructlab/training/gpt_oss_utils_correct.py
  • src/instructlab/training/tokenizer_utils.py
  • src/instructlab/training/data_process.py
  • src/instructlab/training/model.py
  • src/instructlab/training/main_ds.py

@mergify mergify bot removed the ci-failure label Mar 28, 2026
…ormat

The torchrun-not-found issue was caused by the venv not being
activated, not an installation problem. Revert to plain 'torchrun'
command. Remove now-unused shutil and sys imports. Run ruff format
on all modified files.
@RobotSail RobotSail force-pushed the fix/v0.7.0-model-validation branch from fa6c212 to 981ec4a Compare March 28, 2026 01:13
claude and others added 2 commits March 28, 2026 01:43
Ministral-3-3B ships with FP8 quantized weights that include scalar
parameters (weight_scale_inv, activation_scale) which FSDP rejects.
This change dequantizes FP8 weights to bf16 after VLM extraction for
training compatibility, preserves the original scales, and requantizes
back to FP8 at checkpoint save time so saved checkpoints match the
original FP8 format.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@mergify mergify bot added the ci-failure label Mar 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-failure documentation Improvements or additions to documentation testing Relates to testing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants