Skip to content

Eval bug: 0.3.2 compiled from latest source crashes #61

@QBANIN

Description

@QBANIN

Name and Version

llama-cli --version
version: 10144 (9f1bcc9)
built with GNU 13.3.0 for Linux x86_64

Operating systems

Linux

GGML backends

CUDA

Hardware

RTX 4000 ADA 20GB

Models

No response

Problem description & steps to reproduce

llama-server crashes at prompt processing

First Bad Commit

No response

Relevant log output

llama-bee    | 0.54.850.645 I srv          init: init: chat template, thinking = 1
llama-bee    | 0.54.850.829 I srv  llama_server: model loaded
llama-bee    | 0.54.850.856 I srv  llama_server: server is listening on http://0.0.0.0:8001
llama-bee    | 0.54.850.915 I srv  update_slots: all slots are idle
llama-bee    | 1.46.472.187 I srv  params_from_: Chat format: peg-native
llama-bee    | 1.46.659.935 I slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = -1
llama-bee    | 1.46.660.105 I slot get_availabl: id  0 | task -1 | adaptive dm: reset state for LRU slot selection
llama-bee    | 1.46.660.140 I srv  get_availabl: updating prompt cache
llama-bee    | 1.46.660.221 I srv          load:  - looking for better prompt, base f_keep = -1.000, sim = 0.000
llama-bee    | 1.46.660.266 I srv        update:  - cache state: 0 prompts, 0.000 MiB (limits: 0.000 MiB, 100096 tokens, 100096 est)
llama-bee    | 1.46.660.354 I srv  get_availabl: prompt cache update took 0.14 ms
llama-bee    | 1.46.664.695 I reasoning-budget: activated, budget=4096 tokens
llama-bee    | 1.46.664.838 I slot launch_slot_: id  0 | task 0 | processing task, is_child = 0
llama-bee    | 1.50.895.420 I slot print_timing: id  0 | task 0 | prompt processing, n_tokens =   4096, progress = 0.13, t =   4.23 s / 968.23 tokens per second
llama-bee    | 1.53.062.929 I slot print_timing: id  0 | task 0 | prompt processing, n_tokens =   6144, progress = 0.19, t =   6.40 s / 960.32 tokens per second
llama-bee    | 1.55.270.468 I slot print_timing: id  0 | task 0 | prompt processing, n_tokens =   8192, progress = 0.25, t =   8.61 s / 951.96 tokens per second
llama-bee    | 1.57.532.053 I slot print_timing: id  0 | task 0 | prompt processing, n_tokens =  10240, progress = 0.32, t =  10.87 s / 942.30 tokens per second
llama-bee    | 1.59.853.100 I slot print_timing: id  0 | task 0 | prompt processing, n_tokens =  12288, progress = 0.38, t =  13.19 s / 931.75 tokens per second
llama-bee    | 2.02.226.026 I slot print_timing: id  0 | task 0 | prompt processing, n_tokens =  14336, progress = 0.44, t =  15.56 s / 921.28 tokens per second
llama-bee    | 2.04.660.271 I slot print_timing: id  0 | task 0 | prompt processing, n_tokens =  16384, progress = 0.50, t =  18.00 s / 910.46 tokens per second
llama-bee    | 2.07.136.469 I slot print_timing: id  0 | task 0 | prompt processing, n_tokens =  18432, progress = 0.57, t =  20.47 s / 900.38 tokens per second
llama-bee    | 2.09.678.344 I slot print_timing: id  0 | task 0 | prompt processing, n_tokens =  20480, progress = 0.63, t =  23.01 s / 889.92 tokens per second
llama-bee    | 2.12.277.242 I slot print_timing: id  0 | task 0 | prompt processing, n_tokens =  22528, progress = 0.69, t =  25.61 s / 879.58 tokens per second
llama-bee    | 2.14.923.241 I slot print_timing: id  0 | task 0 | prompt processing, n_tokens =  24576, progress = 0.76, t =  28.26 s / 869.69 tokens per second
llama-bee    | 2.17.637.824 I slot print_timing: id  0 | task 0 | prompt processing, n_tokens =  26624, progress = 0.82, t =  30.97 s / 859.59 tokens per second
llama-bee    | 2.20.397.841 I slot print_timing: id  0 | task 0 | prompt processing, n_tokens =  28672, progress = 0.88, t =  33.73 s / 849.97 tokens per second
llama-bee    | 2.23.206.951 I slot print_timing: id  0 | task 0 | prompt processing, n_tokens =  30720, progress = 0.95, t =  36.54 s / 840.68 tokens per second
llama-bee    | 2.24.921.990 I dflash: drafter K/V projection cache enabled (1024-token window)
llama-bee    | 2.24.923.988 I slot print_timing: id  0 | task 0 | prompt processing, n_tokens =  31930, progress = 0.98, t =  38.26 s / 834.58 tokens per second
llama-bee    | 2.25.672.119 I slot print_timing: id  0 | task 0 | prompt processing, n_tokens =  32421, progress = 1.00, t =  39.01 s / 831.16 tokens per second
llama-bee    | 2.25.782.860 I slot create_check: id  0 | task 0 | created context checkpoint 1 of 128 (pos_min = 32420, pos_max = 32420, n_tokens = 32421, size = 247.185 MiB)
llama-bee    | 2.25.871.815 I slot print_timing: id  0 | task 0 | prompt processing, n_tokens =  32442, progress = 1.00, t =  39.21 s / 827.46 tokens per second
llama-bee    | 2.28.425.830 I slot create_check: id  0 | task 0 | created context checkpoint 2 of 128 (pos_min = 32441, pos_max = 32441, n_tokens = 32442, size = 249.236 MiB)
llama-bee    | 2.28.782.907 I slot   operator(): id  0 | task 0 | adaptive dm profit: cur=0 recommended=4 score=12.0 action=apply
llama-bee    | /src/tools/server/server-context.cpp:5124: speculative recurrent rollback requires backup sequences when bounded snapshots are unavailable
llama-bee    | 
llama-bee    | /usr/local/lib/libggml-base.so.0(+0x1d44b)[0x7f499970d44b]
llama-bee    | /usr/local/lib/libggml-base.so.0(ggml_print_backtrace+0x21c)[0x7f499970d8cc]
llama-bee    | /usr/local/lib/libggml-base.so.0(ggml_abort+0x15b)[0x7f499970daab]
llama-bee    | /usr/local/lib/libllama-server-impl.so(_ZN19server_context_impl12update_slotsEv+0x103de)[0x7f499a7ca03e]
llama-bee    | /usr/local/lib/libllama-server-impl.so(_ZN12server_queue10start_loopEl+0x221)[0x7f499a864541]
llama-bee    | /usr/local/lib/libllama-server-impl.so(_Z12llama_serveriPPc+0x250c)[0x7f499a703c9c]
llama-bee    | /usr/lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca)[0x7f499a1901ca]
llama-bee    | /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x8b)[0x7f499a19028b]
llama-bee    | llama-server(+0x12a5)[0x5620ec79e2a5]



    command:
      # === Model ===
      - -m
      - /cache/Qwen3.6-27B-NEO-CODE-HERE-2T-OT-IQ4_XS.gguf
      - --alias
      - "qwen"
      - --mmproj
      - /cache/mmproj-Qwen-Qwen3.6-27B-Q6_K.gguf

      - --no-mmproj-offload


      - --spec-ngram-mod-n-match
      - "24"
      - --spec-ngram-mod-n-min
      - "48"
      - --spec-ngram-mod-n-max.
      - "64"

      - --spec-draft-model.
      - /cache/Qwen3.6-27B-DFlash-IQ4_XS.gguf
      - --spec-type.
      - "dflash,ngram-mod"
      - --spec-dflash-cross-ctx.
      - "1024"
      - --spec-draft-ngl.
      - "all"


      - --cache-ram.
      - "-1"

      - --host
      - "0.0.0.0"
      - --port
      - "8001"

      - -ngl
      - "all"

      - --ctx-size
      - "100000" 
      - --fit
      - "off"
      - --no-context-shift
      - --checkpoint-min-step
      - "0"
      - --ctx-checkpoints
      - "128"

      - --no-warmup
      - --swa-full
      - --temp
      - "1.0" #"0.7" #"0.7" "1.0"
      - --top-p.
      - "0.6" #"0.8" #"0.8" "0.95"
      - --top-k
      - "20"
      - --min-p
      - "0.1"


      - --batch-size
      - "2048"       
      - --ubatch-size
      - "512"       
      - --threads
      - "10"
      - --threads-batch
      - "14"
      - --no-host

      - -ctk
      - "q5_0"
      - -ctv
      - "q4_1"

      - --flash-attn
      - "on"
      - --kv-unified
      - --cache-reuse
      - "512" 
      - --perf
      - --slot-prompt-similarity
      - "0.1"


      - --no-mmap
      - --mlock
      - --parallel
      - "1"
      - --prio
      - "2"

      - --jinja
      - --chat-template-file
      - /cache/qwen-3.6-chat-template-thinking.jinja
      - --chat-template-kwargs
      - '{"preserve_thinking": true}'
      - --reasoning-budget
      - "4096"
      - --reasoning-budget-message
      - "Budżet myślenia wyczerpany. Prawdopodobnie utkwiłem w pętli albo zbyt komplikuję. Muszę natychmiast przestać i przejść do odpowiedzi."
      - --reasoning
      - "on"

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions