Skip to content

Update For Qwen3.5 Support#1356

Open
martindevans wants to merge 1 commit intoSciSharp:masterfrom
martindevans:binary_update_qwen3.5
Open

Update For Qwen3.5 Support#1356
martindevans wants to merge 1 commit intoSciSharp:masterfrom
martindevans:binary_update_qwen3.5

Conversation

@martindevans
Copy link
Member

@martindevans martindevans commented Mar 14, 2026

  • Updated binaries to 73c9eb8ceda397b651dbb6661b2935f0283a2b1d (Qwen35 support)
  • Removed deprecated native func llama_adapter_lora_free and related managed method LoraAdapter.Unload

Testing:

  • Windows CUDA
  • Windows Vulkan
  • Linux CUDA
  • Linux Vulkan

….5 support!)

 - Removed deprecated native func `llama_adapter_lora_free` and related managed method `LoraAdapter.Unload`
@zsogitbe
Copy link
Contributor

Great work on this, Martin! Thank you!

I’ve done some testing on Windows with the MtmdInteractiveModeExecute example and Qwen 3.5. Here are my findings:

  1. Strange Layer Distribution: The default settings don't seem to work correctly. I’m seeing a very odd distribution of layers across the CPU and two GPUs (instead of just one):
[llama Debug]: llama_kv_cache: layer   0: filtered
[llama Debug]: llama_kv_cache: layer   1: filtered
[llama Debug]: llama_kv_cache: layer   2: filtered
[llama Debug]: llama_kv_cache: layer   3: dev = CPU
[llama Debug]: llama_kv_cache: layer   4: filtered
[llama Debug]: llama_kv_cache: layer   5: filtered
[llama Debug]: llama_kv_cache: layer   6: filtered
[llama Debug]: llama_kv_cache: layer   7: dev = CPU
[llama Debug]: llama_kv_cache: layer   8: filtered
[llama Debug]: llama_kv_cache: layer   9: filtered
[llama Debug]: llama_kv_cache: layer  10: filtered
[llama Debug]: llama_kv_cache: layer  11: dev = CPU
[llama Debug]: llama_kv_cache: layer  12: filtered
[llama Debug]: llama_kv_cache: layer  13: filtered
[llama Debug]: llama_kv_cache: layer  14: filtered
[llama Debug]: llama_kv_cache: layer  15: dev = CUDA0
[llama Debug]: llama_kv_cache: layer  16: filtered
[llama Debug]: llama_kv_cache: layer  17: filtered
[llama Debug]: llama_kv_cache: layer  18: filtered
[llama Debug]: llama_kv_cache: layer  19: dev = CUDA0
[llama Debug]: llama_kv_cache: layer  20: filtered
[llama Debug]: llama_kv_cache: layer  21: filtered
[llama Debug]: llama_kv_cache: layer  22: filtered
[llama Debug]: llama_kv_cache: layer  23: dev = CUDA0
[llama Debug]: llama_kv_cache: layer  24: filtered
[llama Debug]: llama_kv_cache: layer  25: filtered
[llama Debug]: llama_kv_cache: layer  26: filtered
[llama Debug]: llama_kv_cache: layer  27: dev = CUDA0
[llama Debug]: llama_kv_cache: layer  28: filtered
[llama Debug]: llama_kv_cache: layer  29: filtered
[llama Debug]: llama_kv_cache: layer  30: filtered
[llama Debug]: llama_kv_cache: layer  31: dev = CUDA1
  1. Corrupted Output: With the default settings, the model produces "endless garbage thinking" (distorted output).
  2. CPU Issues: When forcing CPU-only mode, I get the same "garbage" output, and performance is very slow.
  3. GPU Success: When forcing GPU 0 only (n_gpu_layers = -1), the output is correct and works as expected.

So, it seems the GPU implementation is working fine, but there may be an issue with the CPU implementation or the layer partitioning logic.

@aropb
Copy link

aropb commented Mar 15, 2026

@martindevans

GPU RTX 3090
Windows 11

Qwen3.5 model seems to be working:
https://huggingface.co/bartowski/Qwen_Qwen3.5-9B-GGUF
Qwen_Qwen3.5-9B-Q8_0.gguf
mmproj-Qwen_Qwen3.5-9B-bf16.gguf

No, I found problem!

All the code works fine on 0.26.0 for Qwen3-Embedding-0.6B-F16.gguf:
Context = weights.CreateContext(modelParams, logger);

Error:
D:\a\LLamaSharp\LLamaSharp\ggml\src\ggml.c:3214: GGML_ASSERT(ggml_can_mul_mat(a, b)) failed

https://github.com/ggml-org/llama.cpp/blob/ceef6b5233c3b31f454632c48fb42af16944bc5b/ggml/src/ggml.c#L3214

That's the problem, but then how do embedders work?
modelParams.PoolingType = LLamaPoolingType.Mean;

Full error log:

warn: LLama.LLamaWeights[0]
llama_init_from_model: model default pooling_type is [3], but [1] was specified

LLama.LLamaWeights: Warning: llama_init_from_model: model default pooling_type is [3], but [1] was specified

info: LLama.LLamaWeights[0]
llama_context: constructing llama_context

LLama.LLamaWeights: Information: llama_context: constructing llama_context

info: LLama.LLamaWeights[0]
llama_context: n_seq_max = 1

LLama.LLamaWeights: Information: llama_context: n_seq_max = 1

info: LLama.LLamaWeights[0]
llama_context: n_ctx = 1024

LLama.LLamaWeights: Information: llama_context: n_ctx = 1024

info: LLama.LLamaWeights[0]
llama_context: n_ctx_seq = 1024

LLama.LLamaWeights: Information: llama_context: n_ctx_seq = 1024

info: LLama.LLamaWeights[0]
llama_context: n_batch = 256

LLama.LLamaWeights: Information: llama_context: n_batch = 256

info: LLama.LLamaWeights[0]
llama_context: n_ubatch = 256

LLama.LLamaWeights: Information: llama_context: n_ubatch = 256

info: LLama.LLamaWeights[0]
llama_context: causal_attn = 1

LLama.LLamaWeights: Information: llama_context: causal_attn = 1

info: LLama.LLamaWeights[0]
llama_context: flash_attn = enabled

LLama.LLamaWeights: Information: llama_context: flash_attn = enabled

info: LLama.LLamaWeights[0]
llama_context: kv_unified = true

LLama.LLamaWeights: Information: llama_context: kv_unified = true

info: LLama.LLamaWeights[0]
llama_context: freq_base = 1000000.0

LLama.LLamaWeights: Information: llama_context: freq_base = 1000000.0

info: LLama.LLamaWeights[0]
llama_context: freq_scale = 1

LLama.LLamaWeights: Information: llama_context: freq_scale = 1

warn: LLama.LLamaWeights[0]
llama_context: n_ctx_seq (1024) < n_ctx_train (32768) -- the full capacity of the model will not be utilized

LLama.LLamaWeights: Warning: llama_context: n_ctx_seq (1024) < n_ctx_train (32768) -- the full capacity of the model will not be utilized

dbug: LLama.LLamaWeights[0]
set_abort_callback: call

LLama.LLamaWeights: Debug: set_abort_callback: call

info: LLama.LLamaWeights[0]
llama_context: CUDA_Host output buffer size = 0.58 MiB

LLama.LLamaWeights: Information: llama_context: CUDA_Host output buffer size = 0.58 MiB

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 0: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 0: dev = CUDA0

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 1: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 1: dev = CUDA0

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 2: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 2: dev = CUDA0

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 3: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 3: dev = CUDA0

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 4: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 4: dev = CUDA0

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 5: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 5: dev = CUDA0

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 6: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 6: dev = CUDA0

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 7: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 7: dev = CUDA0

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 8: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 8: dev = CUDA0

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 9: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 9: dev = CUDA0

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 10: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 10: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 11: dev = CUDA0

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 11: dev = CUDA0

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 12: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 12: dev = CUDA0

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 13: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 13: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 14: dev = CUDA0

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 14: dev = CUDA0

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 15: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 15: dev = CUDA0

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 16: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 16: dev = CUDA0

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 17: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 17: dev = CUDA0

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 18: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 18: dev = CUDA0

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 19: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 19: dev = CUDA0

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 20: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 20: dev = CUDA0

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 21: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 21: dev = CUDA0

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 22: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 22: dev = CUDA0

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 23: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 23: dev = CUDA0

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 24: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 24: dev = CUDA0

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 25: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 25: dev = CUDA0

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 26: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 26: dev = CUDA0

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 27: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 27: dev = CUDA0

info: LLama.LLamaWeights[0]
llama_kv_cache: CUDA0 KV buffer size = 112.00 MiB

LLama.LLamaWeights: Information: llama_kv_cache: CUDA0 KV buffer size = 112.00 MiB

info: LLama.LLamaWeights[0]
llama_kv_cache: size = 112.00 MiB ( 1024 cells, 28 layers, 1/1 seqs), K (f16): 56.00 MiB, V (f16): 56.00 MiB

LLama.LLamaWeights: Information: llama_kv_cache: size = 112.00 MiB ( 1024 cells, 28 layers, 1/1 seqs), K (f16): 56.00 MiB, V (f16): 56.00 MiB

dbug: LLama.LLamaWeights[0]
llama_context: enumerating backends

LLama.LLamaWeights: Debug: llama_context: enumerating backends

dbug: LLama.LLamaWeights[0]
llama_context: backend_ptrs.size() = 2

LLama.LLamaWeights: Debug: llama_context: backend_ptrs.size() = 2

info: LLama.LLamaWeights[0]
sched_reserve: reserving ...

LLama.LLamaWeights: Information: sched_reserve: reserving ...

dbug: LLama.LLamaWeights[0]
sched_reserve: max_nodes = 2488

LLama.LLamaWeights: Debug: sched_reserve: max_nodes = 2488

dbug: LLama.LLamaWeights[0]
sched_reserve: reserving full memory module

LLama.LLamaWeights: Debug: sched_reserve: reserving full memory module

dbug: LLama.LLamaWeights[0]
sched_reserve: worst-case: n_tokens = 256, n_seqs = 1, n_outputs = 1

LLama.LLamaWeights: Debug: sched_reserve: worst-case: n_tokens = 256, n_seqs = 1, n_outputs = 1

info: LLama.LLamaWeights[0]
sched_reserve: resolving fused Gated Delta Net support:

LLama.LLamaWeights: Information: sched_reserve: resolving fused Gated Delta Net support:

dbug: LLama.LLamaWeights[0]
graph_reserve: reserving a graph for ubatch with n_tokens = 1, n_seqs = 1, n_outputs = 1

LLama.LLamaWeights: Debug: graph_reserve: reserving a graph for ubatch with n_tokens = 1, n_seqs = 1, n_outputs = 1

info: LLama.LLamaWeights[0]
sched_reserve: fused Gated Delta Net (autoregressive) enabled

LLama.LLamaWeights: Information: sched_reserve: fused Gated Delta Net (autoregressive) enabled

dbug: LLama.LLamaWeights[0]
graph_reserve: reserving a graph for ubatch with n_tokens = 16, n_seqs = 1, n_outputs = 1

LLama.LLamaWeights: Debug: graph_reserve: reserving a graph for ubatch with n_tokens = 16, n_seqs = 1, n_outputs = 1

D:\a\LLamaSharp\LLamaSharp\ggml\src\ggml.c:3214: GGML_ASSERT(ggml_can_mul_mat(a, b)) failed

This working log for LLamaSharp 0.26.0:

warn: LLama.LLamaWeights[0]
llama_init_from_model: model default pooling_type is [3], but [1] was specified

LLama.LLamaWeights: Warning: llama_init_from_model: model default pooling_type is [3], but [1] was specified

info: LLama.LLamaWeights[0]
llama_context: constructing llama_context

LLama.LLamaWeights: Information: llama_context: constructing llama_context

info: LLama.LLamaWeights[0]
llama_context: n_seq_max = 64

LLama.LLamaWeights: Information: llama_context: n_seq_max = 64

info: LLama.LLamaWeights[0]
llama_context: n_ctx = 1024

LLama.LLamaWeights: Information: llama_context: n_ctx = 1024

info: LLama.LLamaWeights[0]
llama_context: n_ctx_seq = 1024

LLama.LLamaWeights: Information: llama_context: n_ctx_seq = 1024

info: LLama.LLamaWeights[0]
llama_context: n_batch = 256

LLama.LLamaWeights: Information: llama_context: n_batch = 256

info: LLama.LLamaWeights[0]
llama_context: n_ubatch = 256

LLama.LLamaWeights: Information: llama_context: n_ubatch = 256

info: LLama.LLamaWeights[0]
llama_context: causal_attn = 1

LLama.LLamaWeights: Information: llama_context: causal_attn = 1

info: LLama.LLamaWeights[0]
llama_context: flash_attn = enabled

LLama.LLamaWeights: Information: llama_context: flash_attn = enabled

info: LLama.LLamaWeights[0]
llama_context: kv_unified = true

LLama.LLamaWeights: Information: llama_context: kv_unified = true

info: LLama.LLamaWeights[0]
llama_context: freq_base = 1000000.0

LLama.LLamaWeights: Information: llama_context: freq_base = 1000000.0

info: LLama.LLamaWeights[0]
llama_context: freq_scale = 1

LLama.LLamaWeights: Information: llama_context: freq_scale = 1

warn: LLama.LLamaWeights[0]
llama_context: n_ctx_seq (1024) < n_ctx_train (32768) -- the full capacity of the model will not be utilized

LLama.LLamaWeights: Warning: llama_context: n_ctx_seq (1024) < n_ctx_train (32768) -- the full capacity of the model will not be utilized

dbug: LLama.LLamaWeights[0]
set_abort_callback: call

LLama.LLamaWeights: Debug: set_abort_callback: call

info: LLama.LLamaWeights[0]
llama_context: CUDA_Host output buffer size = 37.28 MiB

LLama.LLamaWeights: Information: llama_context: CUDA_Host output buffer size = 37.28 MiB

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 0: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 0: dev = CUDA0

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 1: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 1: dev = CUDA0

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 2: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 2: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 3: dev = CUDA0

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 3: dev = CUDA0

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 4: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 4: dev = CUDA0

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 5: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 5: dev = CUDA0

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 6: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 6: dev = CUDA0

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 7: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 7: dev = CUDA0

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 8: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 8: dev = CUDA0

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 9: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 9: dev = CUDA0

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 10: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 10: dev = CUDA0

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 11: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 11: dev = CUDA0

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 12: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 12: dev = CUDA0

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 13: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 13: dev = CUDA0

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 14: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 14: dev = CUDA0

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 15: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 15: dev = CUDA0

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 16: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 16: dev = CUDA0

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 17: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 17: dev = CUDA0

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 18: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 18: dev = CUDA0

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 19: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 19: dev = CUDA0

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 20: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 20: dev = CUDA0

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 21: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 21: dev = CUDA0

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 22: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 22: dev = CUDA0

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 23: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 23: dev = CUDA0

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 24: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 24: dev = CUDA0

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 25: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 25: dev = CUDA0

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 26: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 26: dev = CUDA0

dbug: LLama.LLamaWeights[0]
llama_kv_cache: layer 27: dev = CUDA0

LLama.LLamaWeights: Debug: llama_kv_cache: layer 27: dev = CUDA0

info: LLama.LLamaWeights[0]
llama_kv_cache: CUDA0 KV buffer size = 112.00 MiB

LLama.LLamaWeights: Information: llama_kv_cache: CUDA0 KV buffer size = 112.00 MiB

info: LLama.LLamaWeights[0]
llama_kv_cache: size = 112.00 MiB ( 1024 cells, 28 layers, 64/1 seqs), K (f16): 56.00 MiB, V (f16): 56.00 MiB

LLama.LLamaWeights: Information: llama_kv_cache: size = 112.00 MiB ( 1024 cells, 28 layers, 64/1 seqs), K (f16): 56.00 MiB, V (f16): 56.00 MiB

dbug: LLama.LLamaWeights[0]
llama_context: enumerating backends

LLama.LLamaWeights: Debug: llama_context: enumerating backends

dbug: LLama.LLamaWeights[0]
llama_context: backend_ptrs.size() = 2

LLama.LLamaWeights: Debug: llama_context: backend_ptrs.size() = 2

dbug: LLama.LLamaWeights[0]
llama_context: max_nodes = 2488

LLama.LLamaWeights: Debug: llama_context: max_nodes = 2488

dbug: LLama.LLamaWeights[0]
llama_context: reserving full memory module

LLama.LLamaWeights: Debug: llama_context: reserving full memory module

dbug: LLama.LLamaWeights[0]
llama_context: worst-case: n_tokens = 256, n_seqs = 64, n_outputs = 64

LLama.LLamaWeights: Debug: llama_context: worst-case: n_tokens = 256, n_seqs = 64, n_outputs = 64

dbug: LLama.LLamaWeights[0]
graph_reserve: reserving a graph for ubatch with n_tokens = 256, n_seqs = 64, n_outputs = 256

LLama.LLamaWeights: Debug: graph_reserve: reserving a graph for ubatch with n_tokens = 256, n_seqs = 64, n_outputs = 256

dbug: LLama.LLamaWeights[0]
graph_reserve: reserving a graph for ubatch with n_tokens = 64, n_seqs = 64, n_outputs = 64

LLama.LLamaWeights: Debug: graph_reserve: reserving a graph for ubatch with n_tokens = 64, n_seqs = 64, n_outputs = 64

dbug: LLama.LLamaWeights[0]
graph_reserve: reserving a graph for ubatch with n_tokens = 256, n_seqs = 64, n_outputs = 256

LLama.LLamaWeights: Debug: graph_reserve: reserving a graph for ubatch with n_tokens = 256, n_seqs = 64, n_outputs = 256

info: LLama.LLamaWeights[0]
llama_context: CUDA0 compute buffer size = 150.43 MiB

LLama.LLamaWeights: Information: llama_context: CUDA0 compute buffer size = 150.43 MiB

info: LLama.LLamaWeights[0]
llama_context: CUDA_Host compute buffer size = 2.07 MiB

LLama.LLamaWeights: Information: llama_context: CUDA_Host compute buffer size = 2.07 MiB

info: LLama.LLamaWeights[0]
llama_context: graph nodes = 990

LLama.LLamaWeights: Information: llama_context: graph nodes = 990

info: LLama.LLamaWeights[0]
llama_context: graph splits = 2

LLama.LLamaWeights: Information: llama_context: graph splits = 2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants