Problem
I am using a quantized version of pixtral large and I can't load the vision modules of a smaller variant. I cannot perform inference with images, I can only perform inference with text.
I imagine this will be a much needed feature as multimodal inference is always less performant than raw text.
Solution
Create a config for enabling this feature, I have a very strong feeling that this is low-hanging fruit.
Alternatives
No response
Explanation
I imagine this will be a much needed feature as multimodal inference is always less performant than raw text.
Examples
No response
Additional context
No response
Acknowledgements
Problem
I am using a quantized version of pixtral large and I can't load the vision modules of a smaller variant. I cannot perform inference with images, I can only perform inference with text.
I imagine this will be a much needed feature as multimodal inference is always less performant than raw text.
Solution
Create a config for enabling this feature, I have a very strong feeling that this is low-hanging fruit.
Alternatives
No response
Explanation
I imagine this will be a much needed feature as multimodal inference is always less performant than raw text.
Examples
No response
Additional context
No response
Acknowledgements