V0.9 - qwen3.5, gemma, olmo support by nielsrolf · Pull Request #55 · longtermrisk/openweights

nielsrolf · 2026-04-02T16:51:30Z

No description provided.

When `allowed_hardware` is specified, always try the first entry (expected to be cheapest) instead of choosing randomly. This makes GPU selection deterministic and cost-optimal — the caller orders the list by preference, and on failure the scheduler removes the failed entry so the next cycle naturally falls through to the next option. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Allow `requires_vram_gb` to be `None` (meaning "don't filter by VRAM, rely on allowed_hardware instead"). This is useful when the caller specifies explicit GPU tiers via `allowed_hardware` and doesn't want the VRAM heuristic to interfere. Changes: - `Job.requires_vram_gb` and `Jobs.requires_vram_gb` typed as `int | None` - Org manager sorts and computes max VRAM with `or 0` fallback - Worker filters jobs with `or 0` so None doesn't crash the comparison Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Tests verify that determine_gpu_type always picks the first entry in allowed_hardware (not random), parses multi-GPU configs correctly, and falls through to VRAM-based logic when allowed_hardware is None/empty. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Tests verify that None VRAM values are handled correctly in: - Job sorting (treated as 0, sorted last) - Max VRAM computation (None → 0, no crash) - Worker hardware filtering (None fits any worker) All tests are pure-Python logic checks, no DB or RunPod needed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The `rl` module is imported in jobs/__init__.py but doesn't exist on disk, causing an ImportError when the package is loaded. Remove it from both the import statement and __all__. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The health-check thread can set self.current_process = None when cancelling a job. If this happens while the main thread is in the log-streaming loop or calling .wait(), it causes: AttributeError: 'NoneType' object has no attribute 'wait' Fix: capture `proc = self.current_process` before the loop so the local reference remains valid regardless of what the health-check thread does to self.current_process. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Three fixes for compatibility with newer library versions: 1. Remove tokenizer= kwarg from WeightedSFTTrainer — newer Unsloth patches Trainer.__init__ and captures the tokenizer via the data collator instead. 2. Handle BatchEncoding dict return from apply_chat_template — transformers 5.x returns a BatchEncoding with .input_ids instead of a plain tensor; extract input_ids when present. 3. Compute block length via token-count difference instead of text reconstruction — the old find_end_of_block approach fails when the tokenizer splits multi-byte UTF-8 characters across token boundaries, producing U+FFFD replacement characters that don't match the original text. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Keep 3 core tests that verify the actual behavior change (first-entry selection, multi-GPU parsing, None fallback). Remove 4 tests that were either redundant (same logic with different data) or trivially covered. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Keep 5 core tests that verify None handling in sort, max, and worker filtering. Remove 8 tests that either tested Python builtins (max on ints, integer comparison) or were redundant edge cases. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Keep 2 AST checks (rl not in imports, rl not in __all__) that directly verify the fix. Remove filesystem existence check and unrelated module presence check. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Keep 3 tests that exercise the actual race condition pattern (mid-loop null, crash without fix, threaded scenario). Remove test_local_ref_survives_null which only tests that Python local variables survive reassignment of the original. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Keep 5 tests that verify the actual code fixes (tokenizer kwarg removal, BatchEncoding handling, len-difference approach). Remove 4 tests: basic arithmetic checks (15-10=5, 5-5=0, 11-10=1) and a redundant AST check for find_end_of_block that is already covered by the len-difference test. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: weighted SFT compat with newer Unsloth and transformers 5.x

fix: capture process ref before health-check thread can null it

fix: remove broken rl module import from jobs/__init__.py

fix: support requires_vram_gb=None by treating it as 0

fix: pick first allowed_hardware instead of random choice

…to v0.9

…structure - Dockerfile: add llm_blender, --no-deps mergekit, upgrade TRL 0.24→1.0 and vLLM 0.11.2→latest for transformers 5.x compatibility - training.py: defer DPO/ORPO imports to avoid pydantic 2.12 + torch.Tensor schema generation error on SFT jobs - orpo_ft.py: fallback import from trl.experimental.orpo for TRL 1.0 - decorators.py: add extra_exceptions param to openai_retry - temporary_api.py: retry on NotFoundError/BadRequestError during vLLM warmup - test_integration.py: sequential cookbook runner with fail-fast, resume support (--skip-until-cookbook), job detection via DB diff and subprocess log fallback, and run log fetching All 13 cookbook examples verified passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ng in v0.8

- Dockerfile: use unsloth/unsloth:2026.3.17-pt2.9.0-vllm-0.16.0 base image with vLLM pre-installed, upgrade transformers to 5.3.0 and TRL to 1.0.0 with --no-deps - jobs.py: update base_image to new image tag - utils.py: unwrap Qwen3VLProcessor to get underlying tokenizer (unsloth returns processor for some models, which lacks pad()) - chat_template_spans.py: use underlying tokenizer for apply_chat_template to avoid ProcessorMixin bug in transformers 5.2-5.3; add message_index to EOS blocks from apply_eos_token_rule - sft.py: tokenizer→processing_class for Trainer() (transformers 5.x API) - CLAUDE.md: add RunPod API safety rule All 13 cookbook examples verified passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Remove pinned base_image from both examples (v0.8/v0.7 no longer needed) - sampling_callback.py: work around unsloth #3538 (device corruption) by using model.eval() instead of FastLanguageModel.for_inference(), fixing _per_layer_device_index on all layers, and using use_cache=False to bypass unsloth's buggy fast inference path - chat_template_spans.py: use .get() for weight/role keys since blocks from logprob path don't have weights - test_integration.py: add both examples back to COOKBOOK_EXAMPLES list All 15 cookbook examples now passing (13 original + 2 previously skipped). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

nielsrolf and others added 25 commits March 11, 2026 09:43

updated dependencies

853ae66

Remove non-meaningful tests

b33f795

Keep 2 AST checks (rl not in imports, rl not in __all__) that directly verify the fix. Remove filesystem existence check and unrelated module presence check. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

cmt

4968ae7

Merge pull request #54 from slacki-ai/fix/weighted_sft_compat

c2ab18e

fix: weighted SFT compat with newer Unsloth and transformers 5.x

Merge pull request #52 from slacki-ai/fix/worker_process_race_condition

ccc15b8

fix: capture process ref before health-check thread can null it

Merge pull request #51 from slacki-ai/fix/remove_broken_rl_import

5af4534

fix: remove broken rl module import from jobs/__init__.py

Merge pull request #49 from slacki-ai/support_vram_none

ceb523d

fix: support requires_vram_gb=None by treating it as 0

Merge pull request #48 from slacki-ai/first_tried_gpu_allocation

3198e88

fix: pick first allowed_hardware instead of random choice

Merge branch 'v0.9' of https://github.com/longtermrisk/openweights in…

d27ac33

…to v0.9

delete slop tests

805dbd5

fix dependencies to get all cookbook examples working that were worki…

55c4ebd

…ng in v0.8

nielsrolf changed the title ~~V0.9~~ V0.9 - qwen3.5, gemma, olmo support Apr 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

V0.9 - qwen3.5, gemma, olmo support#55

V0.9 - qwen3.5, gemma, olmo support#55
nielsrolf wants to merge 25 commits intomainfrom
v0.9

nielsrolf commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nielsrolf commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants