Describe the Issue
I have successfully used --usemmap to load the model GGUF of ~110% of free/available RAM. But loading the model 3x free RAM has failed (the model consisted of several GGUF files, does it matter for below?).
In terminal (numbers rounded):
done getting tensors: ... moved from CPU_REPACK, using CPU instead
ggml_aligned_malloc: insufficient memory (attempted to allocate 103 000 MB)
ggml_backend_cpu_buffer_type_alloc_buffer: failed to allocate buffer of size 108 000 000 000
alloc_tensor_range: failed to allocate CPU_REPACK buffer of size 108 000 000 000
llama_model_load: error loading model: unable to allocate CPU_REPACK buffer
llama_model_load_from_file_impl: failed to load model
I have used --usemmap, why has the engine tried to allocate amount ~ total size of GGUF? Could it be a bug? If not, does such huge allocation necessity depend of model architecture maybe? Some 120 GB models can be loaded in 40 GB free RAM and some cannot? If so, what it depends on?
https://github.com/LostRuins/koboldcpp/wiki
mmap, or memory-mapped file I/O, maps files or devices into memory. It is a method of reducing the amount of RAM needed for loading the model, as parts can be read from disk into RAM on demand. You can enable it with --usemmap
Additional Information:
v1.112 Linux nocuda
Describe the Issue
I have successfully used
--usemmapto load the model GGUF of ~110% of free/available RAM. But loading the model 3x free RAM has failed (the model consisted of several GGUF files, does it matter for below?).In terminal (numbers rounded):
I have used
--usemmap, why has the engine tried to allocate amount ~ total size of GGUF? Could it be a bug? If not, does such huge allocation necessity depend of model architecture maybe? Some 120 GB models can be loaded in 40 GB free RAM and some cannot? If so, what it depends on?https://github.com/LostRuins/koboldcpp/wiki
Additional Information:
v1.112 Linux nocuda