Fix trust_remote_code and gradient checkpointing for custom models#696
Fix trust_remote_code and gradient checkpointing for custom models#696
Conversation
- Add trust_remote_code=True to all AutoConfig/AutoTokenizer.from_pretrained() calls - Add torchrun path resolution (shutil.which with sys.executable fallback) - Pass trust_remote_code=True to base_model_args and VLM helper functions This fixes training failures for models like Nemotron that use remote code. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Wrap gradient_checkpointing_enable() in try/except to handle models like NemotronH that don't support gradient checkpointing. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/instructlab/training/main_ds.py (1)
632-644:⚠️ Potential issue | 🟠 MajorValidate torchrun fallback before command construction.
At Line 636, the fallback path can be non-existent/non-executable. That will fail at subprocess launch time. Prefer a guaranteed fallback (
python -m torch.distributed.run) instead of assuming a siblingtorchrunscript exists.🛠️ Proposed robust fallback
- # Find torchrun executable - try PATH first, then sys.executable's bin dir - torchrun_path = shutil.which("torchrun") - if not torchrun_path: - # Fall back to same directory as current Python executable - torchrun_path = os.path.join(os.path.dirname(sys.executable), "torchrun") - - command = [ - torchrun_path, + # Find torchrun executable. If unavailable, fall back to module invocation. + torchrun_path = shutil.which("torchrun") + if torchrun_path: + command = [torchrun_path] + else: + command = [sys.executable, "-m", "torch.distributed.run"] + + command += [ f"--nproc-per-node={torch_args.nproc_per_node}", f"--nnodes={torch_args.nnodes}", f"--node-rank={torch_args.node_rank}", f"--rdzv-id={torch_args.rdzv_id}", ]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/instructlab/training/main_ds.py` around lines 632 - 644, The fallback logic for locating torchrun (torchrun_path) can produce a non-existent or non-executable path; instead validate the found path and, if it's missing or not executable, use a guaranteed fallback by invoking the Python module runner (e.g., "sys.executable -m torch.distributed.run") when building the command list. Update the logic around torchrun_path and command construction: check shutil.which("torchrun") and os.access(torchrun_path, os.X_OK), and if invalid, set the command to start with sys.executable and "-m", "torch.distributed.run" (while still appending the torch_args flags like f"--nproc-per-node={torch_args.nproc_per_node}", f"--nnodes={torch_args.nnodes}", f"--node-rank={torch_args.node_rank}", f"--rdzv-id={torch_args.rdzv_id}") so subprocess launches reliably.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/instructlab/training/data_process.py`:
- Around line 1300-1303: The tokenizer is inconsistently loaded:
configure_tokenizer() uses AutoTokenizer.from_pretrained(model_path,
trust_remote_code=True) but process_documents_for_pretraining() calls
AutoTokenizer.from_pretrained(model_path) without trust_remote_code=True; update
the call inside process_documents_for_pretraining to pass trust_remote_code=True
(mirror configure_tokenizer) so custom-tokenizers that require remote code will
load correctly, and ensure any other direct AutoTokenizer.from_pretrained usages
in that function use the same argument.
- Around line 48-53: The try/except around AutoConfig.from_pretrained in
data_process.py should stop catching all Exceptions and instead catch and handle
specific known failures (e.g., OSError, ValueError, HFValidationError,
JSONDecodeError) while logging the error so failures aren’t silently treated as
“not GPT-OSS”; update the block around
AutoConfig.from_pretrained(model_name_or_path, trust_remote_code=True) to catch
those specific exception types, call the module logger (or processLogger) with
the exception details, return False for those handled cases, and let any other
unexpected exceptions propagate so unmask_message_content and unmask_messages
aren’t misled by silent failures.
In `@src/instructlab/training/model.py`:
- Around line 607-611: The unguarded gradient_checkpointing_enable() call in
LigerModel can crash for models that don't support it and the current except
only catches ValueError; create a small shared helper (e.g.,
safe_enable_gradient_checkpointing or _try_enable_gradient_checkpointing) that
calls model.gradient_checkpointing_enable() inside a try/except catching
ValueError, NotImplementedError, and AttributeError and logs a warning; replace
the direct calls in both LigerModel and CausalLMModel with this helper so both
use identical, robust error handling.
In `@src/instructlab/training/tokenizer_utils.py`:
- Around line 98-100: The call to AutoTokenizer.from_pretrained in
tokenizer_utils.py uses the invalid keyword fast_tokenizer; replace that keyword
with use_fast so the call becomes
AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True,
trust_remote_code=True) (locate the tokenizer =
AutoTokenizer.from_pretrained(...) invocation to update the parameter name).
---
Outside diff comments:
In `@src/instructlab/training/main_ds.py`:
- Around line 632-644: The fallback logic for locating torchrun (torchrun_path)
can produce a non-existent or non-executable path; instead validate the found
path and, if it's missing or not executable, use a guaranteed fallback by
invoking the Python module runner (e.g., "sys.executable -m
torch.distributed.run") when building the command list. Update the logic around
torchrun_path and command construction: check shutil.which("torchrun") and
os.access(torchrun_path, os.X_OK), and if invalid, set the command to start with
sys.executable and "-m", "torch.distributed.run" (while still appending the
torch_args flags like f"--nproc-per-node={torch_args.nproc_per_node}",
f"--nnodes={torch_args.nnodes}", f"--node-rank={torch_args.node_rank}",
f"--rdzv-id={torch_args.rdzv_id}") so subprocess launches reliably.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 153c56c3-511f-48d8-994b-a96a507f4c42
📒 Files selected for processing (5)
src/instructlab/training/data_process.pysrc/instructlab/training/gpt_oss_utils_correct.pysrc/instructlab/training/main_ds.pysrc/instructlab/training/model.pysrc/instructlab/training/tokenizer_utils.py
- Narrow exception handling in is_gpt_oss_model to catch specific exceptions (OSError, ValueError) instead of bare Exception, and log the failure details for debugging - Add trust_remote_code=True to process_documents_for_pretraining() tokenizer loading for consistency with configure_tokenizer() - Replace invalid fast_tokenizer kwarg with use_fast in tokenizer_utils.py setup_tokenizer() - Create shared _enable_gradient_checkpointing_if_supported() helper on Model base class, catching ValueError, NotImplementedError, and AttributeError; use it in both LigerModel and CausalLMModel - Improve torchrun fallback to use sys.executable -m torch.distributed.run instead of assuming a sibling script exists - Fix ruff formatting for AutoConfig.from_pretrained call Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Update test assertions to expect trust_remote_code=True parameter in AutoTokenizer.from_pretrained calls after adding this parameter to process_documents_for_pretraining.
Instead of hardcoding trust_remote_code=True everywhere: 1. Add trust_remote_code field to TrainingArgs (default: False) 2. Add --trust_remote_code argparse flag to subprocess CLI 3. Support TRUST_REMOTE_CODE=1 environment variable 4. Thread the setting through Model, tokenizer, and config calls 5. Remove torchrun fallback — error if torchrun is not found 6. Remove unnecessary try/except in is_gpt_oss_model 7. Remove redundant use_fast=True from tokenizer_utils The env var is exported by main() when the flag is set, so downstream calls (data_process, tokenizer_utils, gpt_oss_utils) automatically pick it up without needing explicit parameter threading.
Add trust_remote_code to the TrainingArgs table and document the TRUST_REMOTE_CODE environment variable in the environment variables section.
NemotronH has Mamba layers just like GraniteMoeHybrid and needs the same _use_local_mamba_kernels() call to avoid causal_conv1d_cuda import failures in torchrun subprocesses.
482b3fd to
fa53865
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@README.md`:
- Line 247: Update the README entries for the trust_remote_code option to be a
complete sentence and add a brief security caution: clearly state that
trust_remote_code controls whether repository-provided code from HuggingFace Hub
is executed when loading models and tokenizers (e.g., required for models like
Nemotron, Mistral, Qwen3.5), note it can be enabled via the TRUST_REMOTE_CODE=1
environment variable, that it defaults to False, and add one short warning
sentence that enabling it will execute remote repository code and may pose
security risks so only enable for trusted sources (apply the same edit to the
duplicate occurrence of trust_remote_code).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: ad51800c-1673-4ab3-9913-5d5a66d77d90
📒 Files selected for processing (9)
README.mdsrc/instructlab/training/config.pysrc/instructlab/training/data_process.pysrc/instructlab/training/gpt_oss_utils_correct.pysrc/instructlab/training/main_ds.pysrc/instructlab/training/model.pysrc/instructlab/training/tokenizer_utils.pytests/unit/test_pretraining_data_process.pytests/unit/test_pretraining_mode.py
✅ Files skipped from review due to trivial changes (2)
- tests/unit/test_pretraining_mode.py
- tests/unit/test_pretraining_data_process.py
🚧 Files skipped from review as they are similar to previous changes (5)
- src/instructlab/training/gpt_oss_utils_correct.py
- src/instructlab/training/tokenizer_utils.py
- src/instructlab/training/data_process.py
- src/instructlab/training/model.py
- src/instructlab/training/main_ds.py
…ormat The torchrun-not-found issue was caused by the venv not being activated, not an installation problem. Revert to plain 'torchrun' command. Remove now-unused shutil and sys imports. Run ruff format on all modified files.
fa6c212 to
981ec4a
Compare
Ministral-3-3B ships with FP8 quantized weights that include scalar parameters (weight_scale_inv, activation_scale) which FSDP rejects. This change dequantizes FP8 weights to bf16 after VLM extraction for training compatibility, preserves the original scales, and requantizes back to FP8 at checkpoint save time so saved checkpoints match the original FP8 format. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
Fixes model loading and training issues for models with custom code (Nemotron, Ministral, Qwen3.5, etc.) that require
trust_remote_code=True.This is part of the v0.7.0 model validation effort.
Changes
1. Add
trust_remote_code=Trueto all AutoConfig callsFiles Modified:
src/instructlab/training/data_process.pysrc/instructlab/training/main_ds.pysrc/instructlab/training/tokenizer_utils.pysrc/instructlab/training/gpt_oss_utils_correct.pyWhy: Models like Nemotron, Ministral, and Qwen3.5 have custom modeling code that requires
trust_remote_code=True. Without this flag, config loading fails with KeyError or architecture detection errors.Example Error Fixed:
2. Fix torchrun path resolution in
main_ds.pyWhat: Use
shutil.which("torchrun")with fallback tosys.executabledirectoryWhy: Subprocess calls couldn't find torchrun in PATH. Using shutil.which() provides robust path resolution.
3. Handle models without gradient checkpointing support in
model.pyWhat: Wrap
gradient_checkpointing_enable()in try/exceptWhy: Models like NemotronH raise ValueError when gradient checkpointing is enabled, as their hybrid Mamba/MoE/Attention architecture doesn't support it.
4. Add
trust_remote_code=Trueto base_model_args inmodel.pyWhat: Include in default model loading arguments
Why: Ensures all model loading (including quantized models) respects custom code requirements.
Testing
Validated with models:
Related PRs
Summary by CodeRabbit
New Features
Bug Fixes
Improvements
Documentation
Tests