Skip to content

feat(inference): support merging multiple LoRA adapters before vLLM i…#57

Open
Manuscrit wants to merge 1 commit intolongtermrisk:v0.9from
slacki-ai:feature/multi-lora-inference
Open

feat(inference): support merging multiple LoRA adapters before vLLM i…#57
Manuscrit wants to merge 1 commit intolongtermrisk:v0.9from
slacki-ai:feature/multi-lora-inference

Conversation

@Manuscrit
Copy link
Copy Markdown
Collaborator

…nference

When lora_adapters (List[str]) is supplied, the job merges all adapters into a single combined adapter via PEFT linear combination on CPU before vLLM is initialised. This keeps the merged rank identical to the input rank so vLLM's max_lora_rank constraint is never violated.

Key changes:

  • InferenceConfig: new lora_adapters field; validated to require ≥ 2 entries (single adapter stays in model as before, preserving compat).
  • InferenceJobs.create(): client-side rank-equality assertion across all adapters, with a clear error before any GPU time is spent.
  • cli.py: new download_adapter() helper (handles org/repo/subfolder paths); new merge_lora_adapters() runs PEFT add_weighted_adapter (combination_type="linear") on CPU, saves the combined adapter to /tmp/merged_lora/, then frees memory before vLLM loads.

…nference

When `lora_adapters` (List[str]) is supplied, the job merges all adapters
into a single combined adapter via PEFT linear combination on CPU before
vLLM is initialised. This keeps the merged rank identical to the input
rank so vLLM's max_lora_rank constraint is never violated.

Key changes:
- `InferenceConfig`: new `lora_adapters` field; validated to require ≥ 2
  entries (single adapter stays in `model` as before, preserving compat).
- `InferenceJobs.create()`: client-side rank-equality assertion across all
  adapters, with a clear error before any GPU time is spent.
- `cli.py`: new `download_adapter()` helper (handles org/repo/subfolder
  paths); new `merge_lora_adapters()` runs PEFT `add_weighted_adapter`
  (combination_type="linear") on CPU, saves the combined adapter to
  /tmp/merged_lora/, then frees memory before vLLM loads.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants