Portable C++17 implementation of ACE-Step 1.5 music generation using GGML. Text + lyrics in, stereo 48kHz WAV out. Runs on CPU, CUDA, ROCm, Metal, Vulkan.
git submodule update --init
mkdir build && cd build
# macOS (Metal + Accelerate BLAS auto-enabled)
cmake ..
# Linux with NVIDIA GPU
cmake .. -DGGML_CUDA=ON
# Linux with AMD GPU (ROCm)
cmake .. -DGGML_HIP=ON
# Linux with Vulkan
cmake .. -DGGML_VULKAN=ON
# CPU with OpenBLAS (recommended for CPU-only machines)
apt install pkg-config libopenblas-dev # Debian/Ubuntu
cmake .. -DGGML_BLAS=ON
# Combine as needed
cmake .. -DGGML_CUDA=ON -DGGML_BLAS=ON
cmake --build . --config Release -j$(nproc)Builds three binaries: ace-qwen3 (LLM), dit-vae (DiT + VAE) and neural-codec (VAE encode/decode).
Pre-quantized GGUFs on Hugging Face.
pip install hf
./models.sh # Q8_0 turbo essentials (~7.7 GB)
./models.sh --all # every model, every quant (~97 GB)
./models.sh --quant Q6_K # pick a specific quant (Q4_K_M, Q5_K_M, Q6_K, Q8_0, BF16)
./models.sh --sft # add SFT DiT variant
./models.sh --shifts # add shift1/shift3/continuous variantsDefault downloads 4 files into models/:
| GGUF | Arch | Size |
|---|---|---|
| Qwen3-Embedding-0.6B-Q8_0.gguf | text encoder (28L, H=1024) | 748 MB |
| acestep-5Hz-lm-4B-Q8_0.gguf | Qwen3 causal LM | 4.2 GB |
| acestep-v15-turbo-Q8_0.gguf | DiT + CondEncoder (24L, H=2048) | 2.4 GB |
| vae-BF16.gguf | AutoencoderOobleck | 322 MB |
Three LM sizes: 0.6B (fast), 1.7B, 4B (best quality). VAE is always BF16 (small, bandwidth-bound, quality-critical).
Building GGUFs from source (checkpoints + convert)
If you want to convert from the original safetensors yourself:
pip install gguf hf
./checkpoints.sh # download raw HF checkpoints (turbo + 4B LM)
./checkpoints.sh --all # all variants (SFT, shift1/3, 0.6B/1.7B LM)
python3 convert.py # convert all checkpoints to GGUF (models/)
./quantize.sh # quantize BF16 -> Q4_K_M/Q5_K_M/Q6_K/Q8_0checkpoints.sh downloads safetensors, config.json, and tokenizer files
into checkpoints/. convert.py packs everything into self-contained
GGUF files in models/, bundling BPE tokenizer, silence_latent, and
config metadata so no external file is needed at runtime.
ace-qwen3 generates lyrics and audio codes, dit-vae synthesizes audio.
The input JSON is never modified. Output is always numbered: request0.json.
cat > /tmp/request.json << 'EOF'
{
"caption": "Upbeat pop rock with driving guitars and catchy hooks",
"inference_steps": 8,
"shift": 3.0,
"vocal_language": "fr"
}
EOF
# LLM: request.json -> request0.json (enriched with lyrics + codes)
./build/ace-qwen3 \
--request /tmp/request.json \
--model models/acestep-5Hz-lm-4B-Q8_0.gguf
# DiT+VAE: request0.json -> request00.wav
./build/dit-vae \
--request /tmp/request0.json \
--text-encoder models/Qwen3-Embedding-0.6B-Q8_0.gguf \
--dit models/acestep-v15-turbo-Q8_0.gguf \
--vae models/vae-BF16.ggufGenerate multiple songs at once with --batch:
# LLM: 2 LM variations x 2 DiT variations = 4 WAVs total
# -> request0.json, request1.json (different lyrics/codes, seeds auto+0, auto+1)
./build/ace-qwen3 \
--request /tmp/request.json \
--model models/acestep-5Hz-lm-4B-Q8_0.gguf \
--batch 2
# DiT+VAE: (2 DiT variations of LM output 1 and 2)
# -> request0.json -> request00.wav, request01.wav
# -> request1.json -> request10.wav, request11.wav
./build/dit-vae \
--request /tmp/request0.json /tmp/request1.json \
--text-encoder models/Qwen3-Embedding-0.6B-Q8_0.gguf \
--dit models/acestep-v15-turbo-Q8_0.gguf \
--vae models/vae-BF16.gguf \
--batch 2The LM decides song structure (lyrics, melody, rhythm via audio codes), so LM batch variations produce genuinely different songs. DiT batch variations only differ by initial noise, producing subtle variations of the same piece (slightly different timbres, minor rhythmic shifts). Use LM batching for diversity, DiT batching for cherry-picking the best render.
Ready-made examples in examples/:
cd examples
./simple.sh # caption only, LLM fills everything
./partial.sh # caption + lyrics + duration
./full.sh # all metadata provided
./dit-only.sh # skip LLM, DiT from noiseEach example has a -sft variant (SFT model, 50 steps, CFG 7.0)
alongside the turbo default (8 steps, no CFG).
The LLM fills what's missing in the JSON and generates audio codes.
Empty field = "fill it". Filled = "don't touch".
All modes always output numbered files (request0.json .. requestN-1.json).
The input JSON is never modified.
Caption only (lyrics=""): two LLM passes. Phase 1 uses the "Expand"
prompt to generate lyrics and metadata (bpm, keyscale, timesignature,
duration) via CoT. Phase 2 reinjects the CoT and generates audio codes using
the "Generate tokens" prompt. CFG is forced to 1.0 in phase 1 (free
sampling); lm_cfg_scale only applies in phase 2. With --batch N, each
element runs its own phase 1 from a different seed, producing N completely
different songs. See examples/simple.json.
Caption + lyrics (+ optional metadata): single LLM pass. The "Generate
tokens" prompt is used directly. Missing metadata is filled via CoT, then
audio codes are generated. User-provided fields are never overwritten.
lm_cfg_scale applies to both CoT and code generation. See
examples/partial.json.
Everything provided (caption, lyrics, bpm, duration, keyscale,
timesignature): the LLM skips CoT and generates audio codes directly.
With --batch N, all elements share the same prompt (single prefill,
KV cache copied). See examples/full.json.
Instrumental (lyrics="[Instrumental]"): treated as "lyrics provided",
so the single-pass "Generate tokens" path is used. No lyrics generation.
The DiT was trained with this exact string as the no-vocal condition.
Passthrough (audio_codes present): LLM is skipped entirely.
Run dit-vae to decode existing codes. See examples/dit-only.json.
Only caption is required. All other fields default to "unset" which means
the LLM fills them, or a sensible runtime default is applied.
{
"caption": "",
"lyrics": "",
"bpm": 0,
"duration": 0,
"keyscale": "",
"timesignature": "",
"vocal_language": "unknown",
"seed": -1,
"lm_temperature": 0.85,
"lm_cfg_scale": 2.0,
"lm_top_p": 0.9,
"lm_top_k": 0,
"lm_negative_prompt": "",
"audio_codes": "",
"inference_steps": 8,
"guidance_scale": 0.0,
"shift": 3.0
}caption (string, required)
Natural language description of the music style, mood, instruments, etc.
Fed to both the LLM and the DiT text encoder.
lyrics (string, default "")
Controls vocal generation. Three valid states:
"": LLM generates lyrics from the caption (phase 1 "Expand" prompt)."[Instrumental]": no vocals. Passed directly to the DiT, LLM skips lyrics generation.- Any other string: user-provided lyrics used as-is, LLM only fills missing metadata.
There is no instrumental flag. This field is the single source of truth for
vocal content.
bpm (int, default 0 = unset)
Beats per minute. LLM generates one if 0.
duration (float seconds, default 0 = unset)
Target audio duration. 0 means the LLM picks it. Clamped to [1, 600]s after
generation. 1 means 1 second.
keyscale (string, default "" = unset)
Musical key and scale, e.g. "C major", "F# minor". LLM fills if empty.
timesignature (string, default "" = unset)
Time signature numerator as a string, e.g. "4" for 4/4, "3" for 3/4.
LLM fills if empty.
vocal_language (string, default "unknown")
BCP-47 language code for lyrics, e.g. "en", "fr", "ja". When set and
lyrics are being generated, the FSM constrains the LLM output to that language.
"unknown" lets the LLM decide.
seed (int64, default -1 = random)
RNG seed. Resolved once at startup to a random value if -1. Batch elements
use seed+0, seed+1, ... seed+N-1.
audio_codes (string, default "")
Comma-separated FSQ token IDs produced by ace-qwen3. When non-empty, the
entire LLM pass is skipped and dit-vae decodes these codes directly
(passthrough / cover mode).
lm_temperature (float, default 0.85)
Sampling temperature for both phase 1 (lyrics/metadata) and phase 2 (audio
codes). Lower = more deterministic.
lm_cfg_scale (float, default 2.0)
Classifier-Free Guidance scale for the LM. Only active in phase 2 (audio
code generation) and in phase 1 when lyrics are already provided. When
lyrics is empty, phase 1 always runs with cfg=1.0 (free sampling).
1.0 disables CFG.
lm_top_p (float, default 0.9)
Nucleus sampling cutoff. 1.0 disables. When top_k=0, an internal
pre-filter of 256 tokens is applied before top_p for performance.
lm_top_k (int, default 0 = disabled)
Top-K sampling. 0 disables hard top-K (top_p still applies).
lm_negative_prompt (string, default "")
Negative caption for CFG in phase 2. Empty string falls back to a
caption-less unconditional prompt.
inference_steps (int, default 8)
Number of diffusion denoising steps. Turbo preset: 8. SFT preset: 50.
guidance_scale (float, default 0.0 = auto)
CFG scale for the DiT. 0.0 is resolved at runtime:
- Turbo models: forced to
1.0(CFG disabled, turbo was trained without it). - SFT/base models:
7.0. Any value > 1.0 on a turbo model is overridden to 1.0 with a warning.
shift (float, default 3.0)
Flow-matching schedule shift. Controls the timestep distribution.
shift = s*t / (1 + (s-1)*t). Turbo preset: 3.0. SFT preset: 6.0.
Turbo preset: inference_steps=8, shift=3.0 (guidance_scale auto-resolved to 1.0).
SFT preset: inference_steps=50, guidance_scale=7.0, shift=6.0.
Usage: ace-qwen3 --request <json> --model <gguf> [options]
Required:
--request <json> Input request JSON
--model <gguf> Model GGUF file
Batch:
--batch <N> Batch N sequences (default: 1)
Output naming: input.json -> input0.json, input1.json, ... (last digit = batch index)
Debug:
--max-seq <N> KV cache size (default: 8192)
--no-fsm Disable FSM constrained decoding
--no-fa Disable flash attention
--dump-logits <path> Dump prefill logits (binary f32)
--dump-tokens <path> Dump prompt token IDs (CSV)
Three LLM sizes: 0.6B (fast), 1.7B, 4B (best quality).
Batching is always active (default N=1). Model weights are read once per decode step for all N sequences. Phase 1 (CoT) and Phase 2 (audio codes) are both batched with independent seeds (seed+0 .. seed+N-1).
Usage: dit-vae --request <json...> --text-encoder <gguf> --dit <gguf> --vae <gguf> [options]
Required:
--request <json...> One or more request JSONs (from ace-qwen3 --request)
--text-encoder <gguf> Text encoder GGUF file
--dit <gguf> DiT GGUF file
--vae <gguf> VAE GGUF file
Batch:
--batch <N> DiT variations per request (default: 1, max 9)
Output naming: input.json -> input0.wav, input1.wav, ... (last digit = batch index)
VAE tiling (memory control):
--vae-chunk <N> Latent frames per tile (default: 256)
--vae-overlap <N> Overlap frames per side (default: 64)
Debug:
--no-fa Disable flash attention
--dump <dir> Dump intermediate tensors
Models are loaded once and reused across all requests.
GGML-native neural audio codec based on the Oobleck VAE encoder and decoder. Serves two purposes: validating the precision of the full VAE chain (encode + decode roundtrip), and compressing music at ~850 B/s with no perceptible difference from the original.
Usage: neural-codec --vae <gguf> --encode|--decode -i <input> [-o <o>] [--q8|--q4]
Required:
--vae <path> VAE GGUF file
--encode | --decode Encode WAV to latent, or decode latent to WAV
-i <path> Input (WAV for encode, latent for decode)
Output:
-o <path> Output file (auto-named if omitted)
--q8 Quantize latent to int8 (~13 kbit/s)
--q4 Quantize latent to int4 (~6.8 kbit/s)
Output naming: song.wav -> song.latent (f32) or song.nac8 (Q8) or song.nac4 (Q4)
song.latent -> song.wav
VAE tiling (memory control):
--vae-chunk <N> Latent frames per tile (default: 256)
--vae-overlap <N> Overlap frames per side (default: 64)
Latent formats (decode auto-detects):
f32: flat [T, 64] f32, no header. ~51 kbit/s.
NAC8: header + per-frame Q8. ~13 kbit/s.
NAC4: header + per-frame Q4. ~6.8 kbit/s.
The encoder is the symmetric mirror of the decoder: same snake activations, same residual units, strided conv1d for downsampling instead of transposed conv1d for upsampling. No new GGML ops. Downsample 2x4x4x6x10 = 1920x.
48kHz stereo audio is compressed to 64-dimensional latent frames at 25 Hz. Three output formats, decode auto-detects from file content:
| Format | Frame size | Bitrate | 3 min song | vs f32 (cossim) |
|---|---|---|---|---|
| f32 | 256B | 51 kbit/s | 1.1 MB | baseline |
| NAC8 | 66B | 13 kbit/s | 290 KB | 0.9999 |
| NAC4 | 34B | 6.8 kbit/s | 150 KB | 0.989 |
NAC = Neural Audio Codec. The NAC8 and NAC4 file formats are headerless
except for a 4-byte magic (NAC8 or NAC4) and a uint32 frame count.
Q8 quantization error is 39 dB below the VAE reconstruction error (free).
Q4 quantization error is 16 dB below the VAE reconstruction error (inaudible
on most material).
# encode (Q4: 6.8 kbit/s, ~150 KB for 3 minutes)
neural-codec --vae models/vae-BF16.gguf --encode --q4 -i song.wav -o song.nac4
# encode (Q8: 13 kbit/s, ~290 KB for 3 minutes)
neural-codec --vae models/vae-BF16.gguf --encode --q8 -i song.wav -o song.nac8
# decode (auto-detects format)
neural-codec --vae models/vae-BF16.gguf --decode -i song.nac4 -o song_decoded.wav
# roundtrip validation: compare song.wav and song_decoded.wav with your earsace-qwen3 (Qwen3 causal LM, 0.6B/1.7B/4B)
Phase 1 (if needed): CoT generates bpm, keyscale, timesignature, lyrics
Phase 2: audio codes (5Hz tokens, FSQ vocabulary)
Both phases batched: N sequences per forward, weights read once
CFG with dual KV cache per batch element (cond + uncond)
Output: request0.json .. requestN-1.json
dit-vae
BPE tokenize
Qwen3-Embedding (28L text encoder)
CondEncoder (lyric 8L + timbre 4L + text_proj)
FSQ detokenizer (audio codes -> flow matching source latents)
DiT (24L flow matching, Euler steps)
VAE (AutoencoderOobleck, tiled decode)
WAV stereo 48kHz
This project started from a simple idea: a Telegram bot using llama.cpp to prompt a music generator, and the desire to make GGML sing. No more, no less. No cloud, no black box, scriptable and nothing between you and the model.
- Remaining modes: Understand, Rewrite (single-pass, no audio codes)
- Reference audio input: repaint and cover tasks (src_audio + cover_strength)
Current: raw PCM f32 WAV via hand-rolled writer, no external deps. Trade-off to document:
- Keep as-is: zero dependencies, clean licensing, works everywhere
- ffmpeg pipe: trivial bash wrapper handles any codec/format, no C++ codec hell
- pro: MP3/FLAC/OGG out of the box, input resampling for reference audio
- con: runtime dependency, not embedded Conclusion pending. Likely ffmpeg as optional external pipe, documented in README.
- JSON HTTP server (minimal, well-documented, stable contract)
- Web interface on top - vibecodeable by anyone, API stays simple Goal: document the internals and how the model actually works, not reproduce the Python spaghetti. Expert-first, no commercial fluff.
Current README is technical study + API reference, intentional.
- Split when a user-facing interface exists: README (user) + ARCHITECTURE.md (internals)
- ACE-Step 2.0: evaluate architecture delta, add headers/weights as needed No commitment, easy to adapt by adding headers or new compilation units as needed.
ace-qwen3 is not a general-purpose chat engine. It is a two-phase autoregressive pipeline specialized for ACE-Step music generation.
Phase 1 (CoT) generates structured metadata (bpm, keyscale, timesignature, caption, duration, language) and optionally lyrics via chain-of-thought reasoning. An FSM (finite state machine) built from a prefix tree enforces valid field names and values at every decode step, hard-masking invalid tokens before sampling.
Phase 2 (audio codes) generates 5Hz FSQ tokens from a 65535-code vocabulary appended
to the base Qwen3 tokenizer. A partial LM head projects only the audio code subrange
of the embedding matrix, cutting the output GEMM by 70% compared to full-vocab
projection. Classifier-free guidance (CFG) is fused into the batch dimension: N
conditional and N unconditional sequences are packed into a single forward pass
(2*N tokens, one weight read), then combined as
logits = uncond + scale * (cond - uncond). The KV cache is a single 4D tensor
[D, max_seq, Nkv, n_sets] shared across all batch elements and CFG paths. Shared
prompts are prefilled once and cloned to other KV sets via copy, avoiding redundant
prefills.
Test logs (turbo + SFT, seed 42, Philox noise, multiple quantizations):
tests/
Each script compares GGML C++ output against the Python reference
(cosine similarity per intermediate tensor). Requires the original
ACE-Step-1.5 repo cloned alongside acestep.cpp (../ACE-Step-1.5).
cd tests
python3 debug-lm-logits.py # Qwen3 LM: first-token logits GGML vs PyTorch (0.6B/1.7B/4B)
python3 debug-detok-cossim.py # FSQ detokenizer: step-by-step cossim C++ vs Python
python3 debug-dit-cossim.py # DiT: per-layer cossim GGML vs Python (turbo/SFT, BF16/quantized)Uses a patched GGML fork (submodule) with two new ops, a Metal im2col optimization, and a CUDA bugfix for the Oobleck VAE decoder. All backends: CPU, CUDA, ROCm, Metal, Vulkan. F32/F16/BF16 data types. The DiT uses only standard GGML ops and needs no patches.
The VAE reconstructs audio from latent space through 5 upsampling blocks (total 1920x), each running a transposed convolution followed by 3 WaveNet-style residual units with dilated convolutions and Snake activations. A single tile builds a graph of 36 snake activations, 5 transposed convolutions, and 32 regular convolutions. At the final blocks, sequence lengths reach 491520 timesteps, which stresses GGML ops designed for short NLP sequences.
Computes y = x + sin^2(a * x) * inv_b in a single kernel. The Oobleck VAE calls this 36 times per tile. Without a fused op, each activation requires 5 separate GGML kernels (mul, sin, sqr, mul, add), causing 5x the memory traffic. The fused kernel reads x once and writes y once. BF16 cast nodes before/after each snake call halve memory bandwidth at the cost of negligible precision loss (cossim > 0.999 vs F32 baseline).
Gather-based reconstruction of a 1D signal from GEMM columns [K*OC, T_in] to
[T_out, OC], with fused padding crop via the p0 parameter.
Upstream ggml_conv_transpose_1d uses a naive kernel (one scalar FMA loop per output
element, no shared memory, no tensor cores). The VAE spends 40% of its FLOP budget on
transposed convolutions. We decompose each as mul_mat + col2im_1d, routing the heavy
GEMM through cuBLAS/BLAS/MPS tensor cores. The col2im_1d gather has a 2-iteration inner
loop and is pure bandwidth. BF16 cast nodes around col2im_1d halve the scatter bandwidth.
The generic Metal kernel_im2col dispatches (IC, 1, OW) threadgroups with K threads
each. For the VAE's 1D convolutions with small kernels (k=1 or k=7), this wastes 78-97%
of SIMD lanes (7 or 1 active threads per 32-wide SIMD group). The dedicated
kernel_im2col_1d uses a flat dispatch identical to snake and col2im_1d:
(total/256, 1, 1) threadgroups with 256 threads, achieving full SIMD utilization.
The dispatch branches on is_2D at runtime; the 2D path and kernel are unchanged.
CUDA and Vulkan already use flat dispatch and are not affected.
VAE decode (M2 Pro 16GB, 86.8s audio @ 48kHz stereo):
| chunk | overlap | im2col | tiles | time |
|---|---|---|---|---|
| 256 | 64 | generic | 17 | 71.2s |
| 1024 | 16 | generic | 3 | 38.9s |
| 256 | 64 | im2col_1d | 17 | 31.8s |
| 1024 | 16 | im2col_1d | 3 | 18.3s |
Upstream im2col_kernel uses OW directly as grid dimension Y, which exceeds the CUDA
65535 gridDim limit on long sequences. The VAE calls ggml_conv_1d (im2col path) 32
times per tile at output widths up to 491520. Fixed with a grid-stride loop on OW and
MIN(OW, MAX_GRIDDIM_Z) clamping.
The GGML submodule diverges from upstream only by the addition of
GGML_OP_SNAKE and GGML_OP_COL2IM_1D. No existing upstream kernel is
modified. These ops are required; the VAE does not work without them.
An earlier approach patched the upstream naive ops instead of adding custom ones. Those patches were dropped. They are documented here in case someone wants to study the naive path:
conv_transpose_1d: bounded loop replacing O(T_in) brute-force, CUDA and Metalim2col: grid-stride loop on OW to fix gridDim.y overflow for large tensors
Independent implementation based on ACE-Step 1.5 by ACE Studio and StepFun. All model weights are theirs, this is just a native backend.
@misc{gong2026acestep,
title={ACE-Step 1.5: Pushing the Boundaries of Open-Source Music Generation},
author={Junmin Gong, Yulin Song, Wenxiao Zhao, Sen Wang, Shengyuan Xu, Jing Guo},
howpublished={\url{https://github.com/ace-step/ACE-Step-1.5}},
year={2026},
note={GitHub repository}
}