Ace Step 1.5 noise

Generating music with Ace Step 1.5 in 1.113.2 results a only noise wav. Same configuration works fine on 1.112.2.

Using musiclowram flag.
Using koboldcpp-linux-x64-nocuda Vulkan on GNU/Linux.
GGUF images from HuggingFace: https://huggingface.co/koboldcpp/music/tree/main


I think the log is not useful, but here it is:

MusicGen LowVRAM mode, will swap models at runtime
Loading Music Gen LLM Model: acestep-5Hz-lm-1.7B-Q8_0.gguf
Loading Music Gen Embed Model: Ace-Qwen3-Embedding-0.6B-BF16.gguf
Loading Music Gen Diffusion Model: acestep-v15-sftturbo50-Q8_0.gguf
Loading Music Gen VAE Model: ace-vae-BF16.gguf
[BPE] Loaded from GGUF: 151643 vocab, 151387 merges
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 570 Series (RADV POLARIS10) (radv) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
[Load] LM backend: Vulkan0 (CPU threads: 6)
[GGUF] acestep-5Hz-lm-1.7B-Q8_0.gguf: 310 tensors, data at offset 5339520
[LM-Config] 28L, H=2048, V=217204, Nh=16, Nkv=8, D=128, tied=1
[Qwen3] Attn: Q+K+V fused
[Qwen3] MLP: gate+up fused
[WeightCtx] Loaded 310 tensors, 1879.2 MB into backend
[LM-Load] CPU embed lookup: type=q8_0, row=2176 bytes
[LM-KV] Allocated 2 sets x 28 layers, 1792.0 MB
Unload Music LM model...
[Load] DiT backend: Vulkan0 (CPU threads: 6)
[Load] Backend init: 3130.9 ms
[GGUF] acestep-v15-sftturbo50-Q8_0.gguf: 678 tensors, data at offset 56800
[DiT] Self-attn: Q+K+V fused
[DiT] Cross-attn: Q+K+V fused
[DiT] MLP: gate+up fused
[Load] null_condition_emb found (CFG available)
[WeightCtx] Loaded 478 tensors, 1600.7 MB into backend
[Load] DiT: 24 layers, H=2048, Nh=16/8, D=128
[Load] DiT weight load: 1996.8 ms
[GGUF] acestep-v15-sftturbo50-Q8_0.gguf: 678 tensors, data at offset 56800
[Load] silence_latent: [15000, 64] from GGUF
[BPE] Loaded from GGUF: 151643 vocab, 151387 merges
[Load] BPE tokenizer: 65.1 ms
[Load] TextEncoder backend: Vulkan0 (CPU threads: 6)
[GGUF] Ace-Qwen3-Embedding-0.6B-BF16.gguf: 310 tensors, data at offset 5337568
[Load] TextEncoder: 28L, H=1024, Nh=16/8
[Qwen3] Attn: Q+K+V fused
[Qwen3] MLP: gate+up fused
[WeightCtx] Loaded 310 tensors, 1136.5 MB into backend
[Load] TextEncoder: 1384.5 ms
[GGUF] Ace-Qwen3-Embedding-0.6B-BF16.gguf: 310 tensors, data at offset 5337568
[Load] CondEncoder backend: Vulkan0 (CPU threads: 6)
[GGUF] acestep-v15-sftturbo50-Q8_0.gguf: 678 tensors, data at offset 56800
[Load] LyricEncoder: 8L
[Qwen3] Attn: Q+K+V fused
[Qwen3] MLP: gate+up fused
[Load] TimbreEncoder: 4L
[Qwen3] Attn: Q+K+V fused
[Qwen3] MLP: gate+up fused
[WeightCtx] Loaded 140 tensors, 616.6 MB into backend
[Load] CondEncoder: lyric(8L), timbre(4L), text_proj, null_cond
[Load] ConditionEncoder: 633.7 ms
[GGUF] acestep-v15-sftturbo50-Q8_0.gguf: 678 tensors, data at offset 56800
[WeightCtx] Loaded 30 tensors, 106.5 MB into backend
[Load] Detokenizer: FSQ(6->2048) + 2L encoder(S=5, 2048->64)
[Load] Detokenizer: 135.8 ms
Unload music diffusion model...
Unload music tokenizer and conditioner model...
[GGUF] ace-vae-BF16.gguf: 365 tensors, data at offset 30048
[Load] VAE-Enc backend: Vulkan0 (CPU threads: 6)
[VAE-Enc] Backend: Vulkan0, Weight buffer: 160.8 MB
[VAE-Enc] Loaded: 5 blocks, downsample=1920x, F32 activations
[Load] VAE Enc weights: 680.5 ms
Unload music VAE enc model...
[GGUF] ace-vae-BF16.gguf: 365 tensors, data at offset 30048
[Load] VAE backend: Vulkan0 (CPU threads: 6)
[VAE] Backend: Vulkan0, Weight buffer: 255.7 MB
[VAE] Loaded: 5 blocks, upsample=1920x
[Load] VAE weights: 569.9 ms
Unload music VAE dec model...

Music Gen Load Complete.
Load Music Models OK: True
Embedded KoboldAI Lite loaded.
Embedded API docs loaded.
Llama.cpp UI loaded.
Embedded MusicUI loaded.
======
Active Modules: MusicGen
Inactive Modules: TextGeneration ImageGeneration VoiceRecognition MultimodalVision MultimodalAudio NetworkMultiplayer ApiKeyPassword WebSearchProxy TextToSpeech VectorEmbeddings AdminControl MCPBridge RouterMode
Enabled APIs: KoboldCppApi
Note: For third party Ollama API Emulation, you should set the port to 11434.
Starting Kobold API on port 5001 at http://localhost:5001/api/
Starting OpenAI Compatible API on port 5001 at http://localhost:5001/v1/
Starting llama.cpp secondary WebUI at http://localhost:5001/lcpp/
MusicUI is available at http://localhost:5001/musicui/
======
Please connect to custom endpoint at http://localhost:5001

Runtime reload Music LM model...
[BPE] Loaded from GGUF: 151643 vocab, 151387 merges
[Load] LM backend: Vulkan0 (CPU threads: 6)
[GGUF] acestep-5Hz-lm-1.7B-Q8_0.gguf: 310 tensors, data at offset 5339520
[LM-Config] 28L, H=2048, V=217204, Nh=16, Nkv=8, D=128, tied=1
[Qwen3] Attn: Q+K+V fused
[Qwen3] MLP: gate+up fused
[WeightCtx] Loaded 310 tensors, 1879.2 MB into backend
[LM-Load] CPU embed lookup: type=q8_0, row=2176 bytes
[LM-KV] Allocated 2 sets x 28 layers, 1792.0 MB
[FSM] Prefix trees: bpm=185, dur=451, key=654, lang=18, tsig=5 nodes
[Request] parsed json (21 fields)
[Simple] Inspiration
[Simple] 42 tokens, N=1, seeds: 251203..251203
[Phase1] Prefill 92ms, 42 tokens, N=1, CFG=1.00
[Phase1] Decode 1040ms
[Phase1 Batch0] seed=251203, 57 tokens
[Simple Batch0] seed=251203:
bpm:143
caption: An upbeat and cheerful instrumental track driven by a bright, staccato piano
duration:194
keyscale:G major
language:en
timesignature:2
</think>

# Lyric
[Instrumental]

[Skip] thinking=false, no code generation
Unload Music LM model...
[Request] parsed json (28 fields)

Runtime reload Music DiT model...
[Load] DiT backend: Vulkan0 (CPU threads: 6)
[Load] Backend init: 89168.8 ms
[GGUF] acestep-v15-sftturbo50-Q8_0.gguf: 678 tensors, data at offset 56800
[DiT] Self-attn: Q+K+V fused
[DiT] Cross-attn: Q+K+V fused
[DiT] MLP: gate+up fused
[Load] null_condition_emb found (CFG available)
[WeightCtx] Loaded 478 tensors, 1600.7 MB into backend
[Load] DiT: 24 layers, H=2048, Nh=16/8, D=128
[Load] DiT weight load: 375.8 ms
[GGUF] acestep-v15-sftturbo50-Q8_0.gguf: 678 tensors, data at offset 56800
[Load] silence_latent: [15000, 64] from GGUF
[BPE] Loaded from GGUF: 151643 vocab, 151387 merges
[Load] BPE tokenizer: 62.0 ms
[Load] TextEncoder backend: Vulkan0 (CPU threads: 6)
[GGUF] Ace-Qwen3-Embedding-0.6B-BF16.gguf: 310 tensors, data at offset 5337568
[Load] TextEncoder: 28L, H=1024, Nh=16/8
[Qwen3] Attn: Q+K+V fused
[Qwen3] MLP: gate+up fused
[WeightCtx] Loaded 310 tensors, 1136.5 MB into backend
[Load] TextEncoder: 283.4 ms
[GGUF] Ace-Qwen3-Embedding-0.6B-BF16.gguf: 310 tensors, data at offset 5337568
[Load] CondEncoder backend: Vulkan0 (CPU threads: 6)
[GGUF] acestep-v15-sftturbo50-Q8_0.gguf: 678 tensors, data at offset 56800
[Load] LyricEncoder: 8L
[Qwen3] Attn: Q+K+V fused
[Qwen3] MLP: gate+up fused
[Load] TimbreEncoder: 4L
[Qwen3] Attn: Q+K+V fused
[Qwen3] MLP: gate+up fused
[WeightCtx] Loaded 140 tensors, 616.6 MB into backend
[Load] CondEncoder: lyric(8L), timbre(4L), text_proj, null_cond
[Load] ConditionEncoder: 123.2 ms
[GGUF] acestep-v15-sftturbo50-Q8_0.gguf: 678 tensors, data at offset 56800
[WeightCtx] Loaded 30 tensors, 106.5 MB into backend
[Load] Detokenizer: FSQ(6->2048) + 2L encoder(S=5, 2048->64)
[Load] Detokenizer: 42.6 ms
[Pipeline] T=3000, S=1500
[Pipeline] seed=251203, steps=8, guidance=1.0, shift=3.0, duration=120.0s
[Pipeline] caption: 71 tokens, lyrics: 15 tokens
[Encode] TextEncoder (71 tokens): 179.1 ms
[Encode] Lyric vocab lookup (15 tokens): 0.1 ms
[CondEnc] Lyric sliding mask: 15x15, window=128
[CondEnc] Timbre sliding mask: 750x750, window=128
[Encode] Packed: lyric=15 + timbre=1 + text=71 = 87 tokens
[Encode] ConditionEncoder: 151.7 ms, enc_S=87
[Context Batch0] noise seed=251203
Unload music tokenizer and conditioner model...
[DiT] Starting: T=3000, S=1500, enc_S=87, steps=8, batch=1
[DiT] Batch N=1, T=3000, S=1500, enc_S=87
[DiT] Graph: 2129 nodes
[DiT] step 1/8 t=1.000
[DiT] step 2/8 t=0.955
[DiT] step 3/8 t=0.900
[DiT] step 4/8 t=0.833
[DiT] step 5/8 t=0.750
[DiT] step 6/8 t=0.643
[DiT] step 7/8 t=0.500
[DiT] step 8/8 t=0.300
[DiT] Total generation: 15263.6 ms (15263.6 ms/sample)
Unload music diffusion model...

Runtime reload Music VAE dec model...
[GGUF] ace-vae-BF16.gguf: 365 tensors, data at offset 30048
[Load] VAE backend: Vulkan0 (CPU threads: 6)
[VAE] Backend: Vulkan0, Weight buffer: 255.7 MB
[VAE] Loaded: 5 blocks, upsample=1920x
[Load] VAE weights: 575.8 ms
[VAE] Tiled decode: 8 tiles (chunk=512, overlap=64, stride=384)
[VAE] Graph: 474 nodes, T_latent=448
[VAE] Upsample factor: 1920.00 (expected ~1920)
[VAE] Graph: 474 nodes, T_latent=512
[VAE] Graph: 474 nodes, T_latent=376
[VAE] Tiled decode done: 8 tiles -> T_audio=5760000 (120.00s @ 48kHz)
[VAE] Decode: 83583.6 ms
[Save Audio] Save as Stereo WAV...
Unload music VAE dec model...
[Request Done: Music Length 120.00s]



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ace Step 1.5 noise #2224

Music Gen Load Complete.
Load Music Models OK: True
Embedded KoboldAI Lite loaded.
Embedded API docs loaded.
Llama.cpp UI loaded.
Embedded MusicUI loaded.

Lyric

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Ace Step 1.5 noise #2224

Description

Music Gen Load Complete. Load Music Models OK: True Embedded KoboldAI Lite loaded. Embedded API docs loaded. Llama.cpp UI loaded. Embedded MusicUI loaded.

Lyric

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Music Gen Load Complete.
Load Music Models OK: True
Embedded KoboldAI Lite loaded.
Embedded API docs loaded.
Llama.cpp UI loaded.
Embedded MusicUI loaded.