Conversation
When `allowed_hardware` is specified, always try the first entry (expected to be cheapest) instead of choosing randomly. This makes GPU selection deterministic and cost-optimal — the caller orders the list by preference, and on failure the scheduler removes the failed entry so the next cycle naturally falls through to the next option. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Allow `requires_vram_gb` to be `None` (meaning "don't filter by VRAM, rely on allowed_hardware instead"). This is useful when the caller specifies explicit GPU tiers via `allowed_hardware` and doesn't want the VRAM heuristic to interfere. Changes: - `Job.requires_vram_gb` and `Jobs.requires_vram_gb` typed as `int | None` - Org manager sorts and computes max VRAM with `or 0` fallback - Worker filters jobs with `or 0` so None doesn't crash the comparison Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Tests verify that determine_gpu_type always picks the first entry in allowed_hardware (not random), parses multi-GPU configs correctly, and falls through to VRAM-based logic when allowed_hardware is None/empty. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Tests verify that None VRAM values are handled correctly in: - Job sorting (treated as 0, sorted last) - Max VRAM computation (None → 0, no crash) - Worker hardware filtering (None fits any worker) All tests are pure-Python logic checks, no DB or RunPod needed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The `rl` module is imported in jobs/__init__.py but doesn't exist on disk, causing an ImportError when the package is loaded. Remove it from both the import statement and __all__. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The health-check thread can set self.current_process = None when cancelling a job. If this happens while the main thread is in the log-streaming loop or calling .wait(), it causes: AttributeError: 'NoneType' object has no attribute 'wait' Fix: capture `proc = self.current_process` before the loop so the local reference remains valid regardless of what the health-check thread does to self.current_process. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Three fixes for compatibility with newer library versions: 1. Remove tokenizer= kwarg from WeightedSFTTrainer — newer Unsloth patches Trainer.__init__ and captures the tokenizer via the data collator instead. 2. Handle BatchEncoding dict return from apply_chat_template — transformers 5.x returns a BatchEncoding with .input_ids instead of a plain tensor; extract input_ids when present. 3. Compute block length via token-count difference instead of text reconstruction — the old find_end_of_block approach fails when the tokenizer splits multi-byte UTF-8 characters across token boundaries, producing U+FFFD replacement characters that don't match the original text. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Keep 3 core tests that verify the actual behavior change (first-entry selection, multi-GPU parsing, None fallback). Remove 4 tests that were either redundant (same logic with different data) or trivially covered. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Keep 5 core tests that verify None handling in sort, max, and worker filtering. Remove 8 tests that either tested Python builtins (max on ints, integer comparison) or were redundant edge cases. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Keep 2 AST checks (rl not in imports, rl not in __all__) that directly verify the fix. Remove filesystem existence check and unrelated module presence check. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Keep 3 tests that exercise the actual race condition pattern (mid-loop null, crash without fix, threaded scenario). Remove test_local_ref_survives_null which only tests that Python local variables survive reassignment of the original. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Keep 5 tests that verify the actual code fixes (tokenizer kwarg removal, BatchEncoding handling, len-difference approach). Remove 4 tests: basic arithmetic checks (15-10=5, 5-5=0, 11-10=1) and a redundant AST check for find_end_of_block that is already covered by the len-difference test. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
fix: weighted SFT compat with newer Unsloth and transformers 5.x
fix: capture process ref before health-check thread can null it
fix: remove broken rl module import from jobs/__init__.py
fix: support requires_vram_gb=None by treating it as 0
fix: pick first allowed_hardware instead of random choice
…structure - Dockerfile: add llm_blender, --no-deps mergekit, upgrade TRL 0.24→1.0 and vLLM 0.11.2→latest for transformers 5.x compatibility - training.py: defer DPO/ORPO imports to avoid pydantic 2.12 + torch.Tensor schema generation error on SFT jobs - orpo_ft.py: fallback import from trl.experimental.orpo for TRL 1.0 - decorators.py: add extra_exceptions param to openai_retry - temporary_api.py: retry on NotFoundError/BadRequestError during vLLM warmup - test_integration.py: sequential cookbook runner with fail-fast, resume support (--skip-until-cookbook), job detection via DB diff and subprocess log fallback, and run log fetching All 13 cookbook examples verified passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Dockerfile: use unsloth/unsloth:2026.3.17-pt2.9.0-vllm-0.16.0 base image with vLLM pre-installed, upgrade transformers to 5.3.0 and TRL to 1.0.0 with --no-deps - jobs.py: update base_image to new image tag - utils.py: unwrap Qwen3VLProcessor to get underlying tokenizer (unsloth returns processor for some models, which lacks pad()) - chat_template_spans.py: use underlying tokenizer for apply_chat_template to avoid ProcessorMixin bug in transformers 5.2-5.3; add message_index to EOS blocks from apply_eos_token_rule - sft.py: tokenizer→processing_class for Trainer() (transformers 5.x API) - CLAUDE.md: add RunPod API safety rule All 13 cookbook examples verified passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove pinned base_image from both examples (v0.8/v0.7 no longer needed) - sampling_callback.py: work around unsloth #3538 (device corruption) by using model.eval() instead of FastLanguageModel.for_inference(), fixing _per_layer_device_index on all layers, and using use_cache=False to bypass unsloth's buggy fast inference path - chat_template_spans.py: use .get() for weight/role keys since blocks from logprob path don't have weights - test_integration.py: add both examples back to COOKBOOK_EXAMPLES list All 15 cookbook examples now passing (13 original + 2 previously skipped). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.