Skip to content

Ace Step 1.5 noise #2224

@jabarrera

Description

@jabarrera

Generating music with Ace Step 1.5 in 1.113.2 results a only noise wav. Same configuration works fine on 1.112.2.

Using musiclowram flag.
Using koboldcpp-linux-x64-nocuda Vulkan on GNU/Linux.
GGUF images from HuggingFace: https://huggingface.co/koboldcpp/music/tree/main

I think the log is not useful, but here it is:

MusicGen LowVRAM mode, will swap models at runtime
Loading Music Gen LLM Model: acestep-5Hz-lm-1.7B-Q8_0.gguf
Loading Music Gen Embed Model: Ace-Qwen3-Embedding-0.6B-BF16.gguf
Loading Music Gen Diffusion Model: acestep-v15-sftturbo50-Q8_0.gguf
Loading Music Gen VAE Model: ace-vae-BF16.gguf
[BPE] Loaded from GGUF: 151643 vocab, 151387 merges
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 570 Series (RADV POLARIS10) (radv) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
[Load] LM backend: Vulkan0 (CPU threads: 6)
[GGUF] acestep-5Hz-lm-1.7B-Q8_0.gguf: 310 tensors, data at offset 5339520
[LM-Config] 28L, H=2048, V=217204, Nh=16, Nkv=8, D=128, tied=1
[Qwen3] Attn: Q+K+V fused
[Qwen3] MLP: gate+up fused
[WeightCtx] Loaded 310 tensors, 1879.2 MB into backend
[LM-Load] CPU embed lookup: type=q8_0, row=2176 bytes
[LM-KV] Allocated 2 sets x 28 layers, 1792.0 MB
Unload Music LM model...
[Load] DiT backend: Vulkan0 (CPU threads: 6)
[Load] Backend init: 3130.9 ms
[GGUF] acestep-v15-sftturbo50-Q8_0.gguf: 678 tensors, data at offset 56800
[DiT] Self-attn: Q+K+V fused
[DiT] Cross-attn: Q+K+V fused
[DiT] MLP: gate+up fused
[Load] null_condition_emb found (CFG available)
[WeightCtx] Loaded 478 tensors, 1600.7 MB into backend
[Load] DiT: 24 layers, H=2048, Nh=16/8, D=128
[Load] DiT weight load: 1996.8 ms
[GGUF] acestep-v15-sftturbo50-Q8_0.gguf: 678 tensors, data at offset 56800
[Load] silence_latent: [15000, 64] from GGUF
[BPE] Loaded from GGUF: 151643 vocab, 151387 merges
[Load] BPE tokenizer: 65.1 ms
[Load] TextEncoder backend: Vulkan0 (CPU threads: 6)
[GGUF] Ace-Qwen3-Embedding-0.6B-BF16.gguf: 310 tensors, data at offset 5337568
[Load] TextEncoder: 28L, H=1024, Nh=16/8
[Qwen3] Attn: Q+K+V fused
[Qwen3] MLP: gate+up fused
[WeightCtx] Loaded 310 tensors, 1136.5 MB into backend
[Load] TextEncoder: 1384.5 ms
[GGUF] Ace-Qwen3-Embedding-0.6B-BF16.gguf: 310 tensors, data at offset 5337568
[Load] CondEncoder backend: Vulkan0 (CPU threads: 6)
[GGUF] acestep-v15-sftturbo50-Q8_0.gguf: 678 tensors, data at offset 56800
[Load] LyricEncoder: 8L
[Qwen3] Attn: Q+K+V fused
[Qwen3] MLP: gate+up fused
[Load] TimbreEncoder: 4L
[Qwen3] Attn: Q+K+V fused
[Qwen3] MLP: gate+up fused
[WeightCtx] Loaded 140 tensors, 616.6 MB into backend
[Load] CondEncoder: lyric(8L), timbre(4L), text_proj, null_cond
[Load] ConditionEncoder: 633.7 ms
[GGUF] acestep-v15-sftturbo50-Q8_0.gguf: 678 tensors, data at offset 56800
[WeightCtx] Loaded 30 tensors, 106.5 MB into backend
[Load] Detokenizer: FSQ(6->2048) + 2L encoder(S=5, 2048->64)
[Load] Detokenizer: 135.8 ms
Unload music diffusion model...
Unload music tokenizer and conditioner model...
[GGUF] ace-vae-BF16.gguf: 365 tensors, data at offset 30048
[Load] VAE-Enc backend: Vulkan0 (CPU threads: 6)
[VAE-Enc] Backend: Vulkan0, Weight buffer: 160.8 MB
[VAE-Enc] Loaded: 5 blocks, downsample=1920x, F32 activations
[Load] VAE Enc weights: 680.5 ms
Unload music VAE enc model...
[GGUF] ace-vae-BF16.gguf: 365 tensors, data at offset 30048
[Load] VAE backend: Vulkan0 (CPU threads: 6)
[VAE] Backend: Vulkan0, Weight buffer: 255.7 MB
[VAE] Loaded: 5 blocks, upsample=1920x
[Load] VAE weights: 569.9 ms
Unload music VAE dec model...

Music Gen Load Complete.
Load Music Models OK: True
Embedded KoboldAI Lite loaded.
Embedded API docs loaded.
Llama.cpp UI loaded.
Embedded MusicUI loaded.

Active Modules: MusicGen
Inactive Modules: TextGeneration ImageGeneration VoiceRecognition MultimodalVision MultimodalAudio NetworkMultiplayer ApiKeyPassword WebSearchProxy TextToSpeech VectorEmbeddings AdminControl MCPBridge RouterMode
Enabled APIs: KoboldCppApi
Note: For third party Ollama API Emulation, you should set the port to 11434.
Starting Kobold API on port 5001 at http://localhost:5001/api/
Starting OpenAI Compatible API on port 5001 at http://localhost:5001/v1/
Starting llama.cpp secondary WebUI at http://localhost:5001/lcpp/
MusicUI is available at http://localhost:5001/musicui/

Please connect to custom endpoint at http://localhost:5001

Runtime reload Music LM model...
[BPE] Loaded from GGUF: 151643 vocab, 151387 merges
[Load] LM backend: Vulkan0 (CPU threads: 6)
[GGUF] acestep-5Hz-lm-1.7B-Q8_0.gguf: 310 tensors, data at offset 5339520
[LM-Config] 28L, H=2048, V=217204, Nh=16, Nkv=8, D=128, tied=1
[Qwen3] Attn: Q+K+V fused
[Qwen3] MLP: gate+up fused
[WeightCtx] Loaded 310 tensors, 1879.2 MB into backend
[LM-Load] CPU embed lookup: type=q8_0, row=2176 bytes
[LM-KV] Allocated 2 sets x 28 layers, 1792.0 MB
[FSM] Prefix trees: bpm=185, dur=451, key=654, lang=18, tsig=5 nodes
[Request] parsed json (21 fields)
[Simple] Inspiration
[Simple] 42 tokens, N=1, seeds: 251203..251203
[Phase1] Prefill 92ms, 42 tokens, N=1, CFG=1.00
[Phase1] Decode 1040ms
[Phase1 Batch0] seed=251203, 57 tokens
[Simple Batch0] seed=251203:
bpm:143
caption: An upbeat and cheerful instrumental track driven by a bright, staccato piano
duration:194
keyscale:G major
language:en
timesignature:2

Lyric

[Instrumental]

[Skip] thinking=false, no code generation
Unload Music LM model...
[Request] parsed json (28 fields)

Runtime reload Music DiT model...
[Load] DiT backend: Vulkan0 (CPU threads: 6)
[Load] Backend init: 89168.8 ms
[GGUF] acestep-v15-sftturbo50-Q8_0.gguf: 678 tensors, data at offset 56800
[DiT] Self-attn: Q+K+V fused
[DiT] Cross-attn: Q+K+V fused
[DiT] MLP: gate+up fused
[Load] null_condition_emb found (CFG available)
[WeightCtx] Loaded 478 tensors, 1600.7 MB into backend
[Load] DiT: 24 layers, H=2048, Nh=16/8, D=128
[Load] DiT weight load: 375.8 ms
[GGUF] acestep-v15-sftturbo50-Q8_0.gguf: 678 tensors, data at offset 56800
[Load] silence_latent: [15000, 64] from GGUF
[BPE] Loaded from GGUF: 151643 vocab, 151387 merges
[Load] BPE tokenizer: 62.0 ms
[Load] TextEncoder backend: Vulkan0 (CPU threads: 6)
[GGUF] Ace-Qwen3-Embedding-0.6B-BF16.gguf: 310 tensors, data at offset 5337568
[Load] TextEncoder: 28L, H=1024, Nh=16/8
[Qwen3] Attn: Q+K+V fused
[Qwen3] MLP: gate+up fused
[WeightCtx] Loaded 310 tensors, 1136.5 MB into backend
[Load] TextEncoder: 283.4 ms
[GGUF] Ace-Qwen3-Embedding-0.6B-BF16.gguf: 310 tensors, data at offset 5337568
[Load] CondEncoder backend: Vulkan0 (CPU threads: 6)
[GGUF] acestep-v15-sftturbo50-Q8_0.gguf: 678 tensors, data at offset 56800
[Load] LyricEncoder: 8L
[Qwen3] Attn: Q+K+V fused
[Qwen3] MLP: gate+up fused
[Load] TimbreEncoder: 4L
[Qwen3] Attn: Q+K+V fused
[Qwen3] MLP: gate+up fused
[WeightCtx] Loaded 140 tensors, 616.6 MB into backend
[Load] CondEncoder: lyric(8L), timbre(4L), text_proj, null_cond
[Load] ConditionEncoder: 123.2 ms
[GGUF] acestep-v15-sftturbo50-Q8_0.gguf: 678 tensors, data at offset 56800
[WeightCtx] Loaded 30 tensors, 106.5 MB into backend
[Load] Detokenizer: FSQ(6->2048) + 2L encoder(S=5, 2048->64)
[Load] Detokenizer: 42.6 ms
[Pipeline] T=3000, S=1500
[Pipeline] seed=251203, steps=8, guidance=1.0, shift=3.0, duration=120.0s
[Pipeline] caption: 71 tokens, lyrics: 15 tokens
[Encode] TextEncoder (71 tokens): 179.1 ms
[Encode] Lyric vocab lookup (15 tokens): 0.1 ms
[CondEnc] Lyric sliding mask: 15x15, window=128
[CondEnc] Timbre sliding mask: 750x750, window=128
[Encode] Packed: lyric=15 + timbre=1 + text=71 = 87 tokens
[Encode] ConditionEncoder: 151.7 ms, enc_S=87
[Context Batch0] noise seed=251203
Unload music tokenizer and conditioner model...
[DiT] Starting: T=3000, S=1500, enc_S=87, steps=8, batch=1
[DiT] Batch N=1, T=3000, S=1500, enc_S=87
[DiT] Graph: 2129 nodes
[DiT] step 1/8 t=1.000
[DiT] step 2/8 t=0.955
[DiT] step 3/8 t=0.900
[DiT] step 4/8 t=0.833
[DiT] step 5/8 t=0.750
[DiT] step 6/8 t=0.643
[DiT] step 7/8 t=0.500
[DiT] step 8/8 t=0.300
[DiT] Total generation: 15263.6 ms (15263.6 ms/sample)
Unload music diffusion model...

Runtime reload Music VAE dec model...
[GGUF] ace-vae-BF16.gguf: 365 tensors, data at offset 30048
[Load] VAE backend: Vulkan0 (CPU threads: 6)
[VAE] Backend: Vulkan0, Weight buffer: 255.7 MB
[VAE] Loaded: 5 blocks, upsample=1920x
[Load] VAE weights: 575.8 ms
[VAE] Tiled decode: 8 tiles (chunk=512, overlap=64, stride=384)
[VAE] Graph: 474 nodes, T_latent=448
[VAE] Upsample factor: 1920.00 (expected ~1920)
[VAE] Graph: 474 nodes, T_latent=512
[VAE] Graph: 474 nodes, T_latent=376
[VAE] Tiled decode done: 8 tiles -> T_audio=5760000 (120.00s @ 48kHz)
[VAE] Decode: 83583.6 ms
[Save Audio] Save as Stereo WAV...
Unload music VAE dec model...
[Request Done: Music Length 120.00s]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions