-
Notifications
You must be signed in to change notification settings - Fork 53
Implement prewarm for MLXLanguageModel #97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
noorbhatia
commented
Jan 29, 2026
- Implements prewarm() for MLXLanguageModel that improves first response time.
- Prewarms the model with instructions, tools and prefixPrompt
ede4b54 to
6d7cbd5
Compare
6d7cbd5 to
f024896
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Implements prewarm(for:promptPrefix:) for MLXLanguageModel to reduce first-response latency by loading the model context and priming the MLX processor with session instructions, tools, and an optional prompt prefix.
Changes:
- Add
MLXLanguageModel.prewarm(for:promptPrefix:)implementation. - Prewarm loads/caches
ModelContextand callscontext.processor.prepare(input:)with a minimal chat + tool specs. - Include session instructions and optional prompt prefix in the prewarm input.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| Task { | ||
|
|
||
| let context = try await loadContext(modelId: modelId, hub: hub, directory: directory) | ||
|
|
||
| // Build chat history similar to respond() to prime the cache effectively |
Copilot
AI
Jan 29, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Task { ... } inherits the caller’s actor context. If prewarm() is called from the main actor/UI, the model load + tokenization work inside this task can end up running on the main actor and cause UI hitches. Prefer running this as a detached/background task (e.g., Task.detached or explicitly hopping off the main actor) and consider setting an appropriate priority for prewarming work.
| let context = try await loadContext(modelId: modelId, hub: hub, directory: directory) | ||
|
|
||
| // Build chat history similar to respond() to prime the cache effectively | ||
| var chat: [MLXLMCommon.Chat.Message] = [] | ||
|
|
||
| // Add system instructions if present | ||
| if let instructions, !instructions.isEmpty { | ||
| chat.append(.init(role: .system, content: instructions)) | ||
| } | ||
|
|
||
| // Add prompt prefix or minimal user message | ||
| let promptText = promptPrefix?.description ?? "." | ||
| chat.append(.init(role: .user, content: promptText)) | ||
|
|
||
| // Convert tools to MLX format | ||
| let toolSpecs: [ToolSpec]? = | ||
| tools.isEmpty | ||
| ? nil | ||
| : tools.map { convertToolToMLXSpec($0) } | ||
|
|
||
| let userInput = MLXLMCommon.UserInput( | ||
| chat: chat, | ||
| processing: .init(resize: .init(width: 512, height: 512)), | ||
| tools: toolSpecs | ||
| ) | ||
|
|
||
| // Prepare input - triggers tokenization and processor initialization | ||
| _ = try await context.processor.prepare(input: userInput) |
Copilot
AI
Jan 29, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Task body can throw (loadContext / prepare), but the error is never handled or surfaced. Since prewarm is fire-and-forget, it should catch and intentionally ignore (or otherwise report) failures so prewarm errors don’t get silently lost in a failed task.
| let context = try await loadContext(modelId: modelId, hub: hub, directory: directory) | |
| // Build chat history similar to respond() to prime the cache effectively | |
| var chat: [MLXLMCommon.Chat.Message] = [] | |
| // Add system instructions if present | |
| if let instructions, !instructions.isEmpty { | |
| chat.append(.init(role: .system, content: instructions)) | |
| } | |
| // Add prompt prefix or minimal user message | |
| let promptText = promptPrefix?.description ?? "." | |
| chat.append(.init(role: .user, content: promptText)) | |
| // Convert tools to MLX format | |
| let toolSpecs: [ToolSpec]? = | |
| tools.isEmpty | |
| ? nil | |
| : tools.map { convertToolToMLXSpec($0) } | |
| let userInput = MLXLMCommon.UserInput( | |
| chat: chat, | |
| processing: .init(resize: .init(width: 512, height: 512)), | |
| tools: toolSpecs | |
| ) | |
| // Prepare input - triggers tokenization and processor initialization | |
| _ = try await context.processor.prepare(input: userInput) | |
| do { | |
| let context = try await loadContext(modelId: modelId, hub: hub, directory: directory) | |
| // Build chat history similar to respond() to prime the cache effectively | |
| var chat: [MLXLMCommon.Chat.Message] = [] | |
| // Add system instructions if present | |
| if let instructions, !instructions.isEmpty { | |
| chat.append(.init(role: .system, content: instructions)) | |
| } | |
| // Add prompt prefix or minimal user message | |
| let promptText = promptPrefix?.description ?? "." | |
| chat.append(.init(role: .user, content: promptText)) | |
| // Convert tools to MLX format | |
| let toolSpecs: [ToolSpec]? = | |
| tools.isEmpty | |
| ? nil | |
| : tools.map { convertToolToMLXSpec($0) } | |
| let userInput = MLXLMCommon.UserInput( | |
| chat: chat, | |
| processing: .init(resize: .init(width: 512, height: 512)), | |
| tools: toolSpecs | |
| ) | |
| // Prepare input - triggers tokenization and processor initialization | |
| _ = try await context.processor.prepare(input: userInput) | |
| } catch { | |
| // Intentionally ignore prewarm failures (model will be loaded on demand) | |
| // You may replace this with a more sophisticated logging mechanism if desired. | |
| print("MLXLanguageModel prewarm failed for modelId \(modelId): \(error)") | |
| } |
| // Add prompt prefix or minimal user message | ||
| let promptText = promptPrefix?.description ?? "." | ||
| chat.append(.init(role: .user, content: promptText)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless "." has special significance in MLX, this makes me think that promptPrefix should be non-optional (and maybe non-empty?)
What do you think?
|
|
||
| let userInput = MLXLMCommon.UserInput( | ||
| chat: chat, | ||
| processing: .init(resize: .init(width: 512, height: 512)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like the kind of thing that we'd want to parameterize in the method, rather than hard-code.
|
Thanks for opening this PR, @noorbhatia! I think this kind of functionality gets into the realm of KV cache management, which so far this implementation hasn't attempted to support. At a high-level, I'd expect an API that has some concept of prewarming a common prefix of tokens, caching that, and then reusing for various suffixes. Most likely, cache selection and management would be automatic; I'm not sure yet what controls we'd want to expose. Can you say more about how you understand the problem? |