feat(inference): support merging multiple LoRA adapters before vLLM i…#57
Open
Manuscrit wants to merge 1 commit intolongtermrisk:v0.9from
Open
feat(inference): support merging multiple LoRA adapters before vLLM i…#57Manuscrit wants to merge 1 commit intolongtermrisk:v0.9from
Manuscrit wants to merge 1 commit intolongtermrisk:v0.9from
Conversation
…nference When `lora_adapters` (List[str]) is supplied, the job merges all adapters into a single combined adapter via PEFT linear combination on CPU before vLLM is initialised. This keeps the merged rank identical to the input rank so vLLM's max_lora_rank constraint is never violated. Key changes: - `InferenceConfig`: new `lora_adapters` field; validated to require ≥ 2 entries (single adapter stays in `model` as before, preserving compat). - `InferenceJobs.create()`: client-side rank-equality assertion across all adapters, with a clear error before any GPU time is spent. - `cli.py`: new `download_adapter()` helper (handles org/repo/subfolder paths); new `merge_lora_adapters()` runs PEFT `add_weighted_adapter` (combination_type="linear") on CPU, saves the combined adapter to /tmp/merged_lora/, then frees memory before vLLM loads. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
…nference
When
lora_adapters(List[str]) is supplied, the job merges all adapters into a single combined adapter via PEFT linear combination on CPU before vLLM is initialised. This keeps the merged rank identical to the input rank so vLLM's max_lora_rank constraint is never violated.Key changes:
InferenceConfig: newlora_adaptersfield; validated to require ≥ 2 entries (single adapter stays inmodelas before, preserving compat).InferenceJobs.create(): client-side rank-equality assertion across all adapters, with a clear error before any GPU time is spent.cli.py: newdownload_adapter()helper (handles org/repo/subfolder paths); newmerge_lora_adapters()runs PEFTadd_weighted_adapter(combination_type="linear") on CPU, saves the combined adapter to /tmp/merged_lora/, then frees memory before vLLM loads.