Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
138 changes: 138 additions & 0 deletions .agents/dllm-backend.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@
# Working on the dllm Backend

`mudler/dllm.cpp` is a standalone C++/ggml engine for DiffusionGemma
block-diffusion models. LocalAI wraps it with a **pure-Go** backend at
`backend/go/dllm/` that dlopens `libdllm.so` via purego (ebitengine/purego) -
NOT cgo, and NOT a C++ grpc-server fork. The Go side owns chat templating
(gemma4 renderer) and output parsing (gemma4 streaming parser) and implements
the rich gRPC interface (`PredictRich`/`PredictStreamRich`, ChatDelta replies).

> NOTE: github.com/mudler/dllm.cpp is still **private** (publishing is
> planned). Until then the Makefile's anonymous clone fails; use the local-dev
> symlink shortcut documented at the top of `backend/go/dllm/Makefile`
> (symlink an out-of-tree `build/libdllm.so` into the backend dir and skip the
> clone), or a git credential helper with repo access.

## Pin

`backend/go/dllm/Makefile` pins `DLLM_VERSION?=<sha>` at the top
(whisper / parakeet-cpp / ds4 convention). The bump-deps bot
(`.github/workflows/bump_deps.yaml`) tracks `mudler/dllm.cpp` `main` and
rewrites that variable. After a manual bump: `make -C backend/go/dllm purge &&
make -C backend/go/dllm` (the clone is keyed on the directory existing, not
the sha).

## C-ABI and the serialization contract

The binding covers the 9-symbol flat C-ABI from dllm.cpp's
`include/dllm_capi.h` (ABI v1; `main.go` hard-fails on a version mismatch):
`abi_version, load, free, last_error, free_string, tokenize_json, generate,
generate_stream, cancel`. Contract points the Go wiring encodes (`capi.go`
header comment has the full list):

- **One ctx = one concurrent generate/tokenize.** A per-model worker
goroutine (`Dllm.jobs` in `dllm.go`) owns ALL C calls, making the
serialization structural instead of lock discipline.
- **`dllm_capi_cancel` is the ONE exception**: it only flips an atomic and may
be called from any goroutine mid-generate, so `Dllm.Cancel` bypasses the
worker queue. The flag resets at the start of each generate, so a watchdog
racing a new generate must re-issue cancel.
- **`last_error` is a borrowed pointer** and must only be read AFTER the
failing call returned (never while a generate is in flight on the same ctx).
- **Free vs in-flight requests**: requests hold `genMu.RLock` for their full
duration; `Free` takes the write lock, so it only runs when nothing is in
flight, then drains and closes the worker. Post-Free requests get a clean
"model not loaded" error.
- `tokenize_json`/`generate` return malloc'd `char*` (bound as `uintptr`,
copied, then `dllm_capi_free_string`d); opts/params JSON must be a FLAT
object of scalars (`buildOptsJSON` rejects anything else).

## Wire shape

| RPC | Implementation |
|---|---|
| LoadModel | `dllm_capi_load` (params: `n_gpu_layers`, `n_threads`, `ctx_len`); `Options[]` parsed into per-request gen opts (`eb_*`, `blocks`, `kv_cache`) by `parseModelGenOpts` |
| PredictRich | render (if templated) → `dllm_capi_generate` → parse → ONE Reply with aggregated ChatDeltas + legacy `Message` bytes |
| PredictStreamRich | `dllm_capi_generate_stream`; per committed diffusion block → UTF-8 holdback → parser.Feed → one Reply per non-empty delta batch (channel closed by the CALLER, per `pkg/grpc/interface.go`) |
| Predict / PredictStream | Legacy paths, delegate to the rich pair (legacy stream INVERTS channel ownership: the impl closes) |
| TokenizeString | `dllm_capi_tokenize_json` (C side prepends BOS per `vocab.add_bos`) |
| Cancel | `dllm_capi_cancel`, exposed as the `grpc.Cancellable` capability (`pkg/grpc/interface.go`): the gRPC server arms it via `context.AfterFunc` on the Predict/PredictStream context, so client disconnects/timeouts abort the in-flight generate - llama.cpp `IsCancelled()` parity for Go backends |

`n_threads` and `ctx_len` are accepted-but-ignored by the engine at the
current pin (the context bound comes from GGUF `n_ctx_train`); they are sent
for forward compatibility.

## Renderer / parser (the templated chat path)

With `use_tokenizer_template` + raw Messages, the backend owns templating and
parsing (the ds4 precedent, but in Go):

- `gemma4_renderer.go` - `RenderGemma4(msgs, toolsJSON, enableThinking,
addGenerationPrompt)`. The file embeds the FULL `tokenizer.chat_template`
jinja (17466 bytes, md5 `8c34cf93c7a7815b3fdb300a009c4c17`) extracted
verbatim from `diffusiongemma-26B-A4B-it-BF16.gguf` via gguf-py - e.g.
`python scripts/dump_gguf.py model.gguf | grep -A400 chat_template` in the
dllm.cpp checkout - as a numbered comment block; every Go rule cites its
"tpl L<n>" line. Re-verify the md5 before blaming the renderer for a
mismatch with a new GGUF. **BOS exception**: the template emits
`{{- bos_token -}}` but the renderer deliberately does NOT - dllm.cpp's
`run_generate` tokenizes with `prepend_bos = vocab.add_bos` (true for
gemma4), so a literal `<bos>` would double it.
- `gemma4_parser.go` - streaming state machine turning raw model text
(fragments can split anywhere, including mid-marker) into ChatDeltas:
thought channels → `reasoning_content`, `<|tool_call>call:name{...}` →
ToolCallDelta, `<turn|>` → done. Marker grammar cross-checked against vLLM
PR #45163's gemma4 tool/reasoning parsers. Malformed payloads are re-emitted
raw as content, never dropped.
- Thinking is **opt-in** for this family (`Metadata["enable_thinking"]`,
default OFF - the inverse of ds4): the template gates every thinking branch
on `enable_thinking`, and the no-thinking render pre-closes an empty thought
channel, so the parser always starts in content state.
- **UTF-8 boundary holdback** (`splitValidUTF8` in `dllm.go`): per-block
detokenization can split a multi-byte character across block boundaries, and
grpc-go refuses to marshal invalid UTF-8 in proto3 strings. An incomplete
trailing sequence (at most 3 bytes) is carried into the next block; genuinely
undecodable bytes become U+FFFD.

Without `use_tokenizer_template`, the prompt passes through verbatim and the
output is NOT gemma4-parsed (plain content, like any non-autoparsing backend).

## Tests

| Layer | Gate | What |
|---|---|---|
| `backend/go/dllm/*_test.go` (renderer/parser/wiring) | none - run in plain `go test ./backend/go/dllm/...` | Ginkgo specs over a fake `generator` seam; canonical renderer fixtures from transformers' `test_modeling_diffusion_gemma.py`, parser tables from the vLLM gemma4 parsers |
| `backend/go/dllm/dllm_test.go` C-ABI smoke | `DLLM_TEST_LIBRARY` + `DLLM_TEST_TINY_MODEL` (dllm.cpp's `tests/fixtures/tiny_with_vocab.gguf`); Skips when unset | Drives the real `libdllm.so`: ABI check, load, tokenize `[2,18]`, deterministic generate, cancel (incl. mid-stream `Dllm.Cancel` aborting a deliberately slow `eb_max_steps:256` run in ~10ms) |
| `tests/e2e-backends/dllm_test.go` | `BACKEND_TEST_DLLM=1` + `BACKEND_BINARY` (packaged run.sh) + `BACKEND_TEST_MODEL_FILE` (tiny fixture) | Templated chat round trip (Messages + UseTokenizerTemplate) over the real gRPC binary, non-streaming + streaming; plus client-context cancellation mid-stream (proves the `Cancellable` server plumbing end to end) |
| Real-model e2e | `BACKEND_TEST_DLLM_REAL_MODEL_FILE` (26B BF16, ~50 GB) + `BACKEND_TEST_DLLM_REAL_GPU_LAYERS` | CUDA-13-class hardware only |

Tool-call e2e is deliberately absent from the tiny-model spec: the fixture has
random weights and cannot be coaxed into emitting tool markup; the unit tables
carry that coverage.

## Build matrix

`cpu-dllm` (amd64 + arm64), `cuda13-dllm` (amd64), and
`cuda13-nvidia-l4t-arm64-dllm` (arm64 CUDA: Jetson / DGX Spark GB10), via
`.github/backend-matrix.yml`. No darwin/Metal. CUDA builds forward
`-DDLLM_CUDA=ON` (dllm.cpp gates ggml's CUDA behind its own flag - a bare
`-DGGML_CUDA=ON` is overridden by the cache FORCE). `libdllm.so` is
self-contained (ggml statically absorbed, PIC), so `package.sh` only ships
the binary, `run.sh` and that one .so (the parakeet-cpp-style stub layout;
no ldd walk yet).

## Known limitations

- **Cancel granularity**: the C-ABI cancel flag is per-ctx and resets on
every generate entry, so a Cancel racing a NEW generate can be lost, and
with requests queued on the worker it aborts whichever generate is
currently running (acceptable: the server de-registers the hook on normal
completion, one process serves one model).
- **Throughput**: ~0.15 tok/s on the 26B at default settings (GB10) - every
denoise step recomputes the full prompt+canvas. The upstream prefix-KV
cache (dllm.cpp P3) is the fix; `kv_cache:on` errors until it lands
(`auto`/`off` are accepted no-ops).
- **Repo privacy**: see the note at the top - CI clone of dllm.cpp needs the
repo published (or credentials) before the backend images can build.
- Engine spec/validation references: dllm.cpp `docs/validation.md` and
LocalAI `docs/superpowers/specs/2026-06-10-dllm-cpp-design.md`.
55 changes: 55 additions & 0 deletions .github/backend-matrix.yml
Original file line number Diff line number Diff line change
Expand Up @@ -1608,6 +1608,19 @@ include:
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "13"
cuda-minor-version: "0"
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-nvidia-cuda-13-dllm'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "dllm"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "13"
cuda-minor-version: "0"
Expand Down Expand Up @@ -1647,6 +1660,19 @@ include:
backend: "parakeet-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
- build-type: 'cublas'
cuda-major-version: "13"
cuda-minor-version: "0"
platforms: 'linux/arm64'
skip-drivers: 'false'
tag-latest: 'auto'
tag-suffix: '-nvidia-l4t-cuda-13-arm64-dllm'
base-image: "ubuntu:24.04"
ubuntu-version: '2404'
runs-on: 'ubuntu-24.04-arm'
backend: "dllm"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
- build-type: 'cublas'
cuda-major-version: "13"
cuda-minor-version: "0"
Expand Down Expand Up @@ -3145,6 +3171,35 @@ include:
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
# dllm
- build-type: ''
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
platform-tag: 'amd64'
tag-latest: 'auto'
tag-suffix: '-cpu-dllm'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "dllm"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: ''
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/arm64'
platform-tag: 'arm64'
tag-latest: 'auto'
tag-suffix: '-cpu-dllm'
runs-on: 'ubuntu-24.04-arm'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "dllm"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'sycl_f32'
cuda-major-version: ""
cuda-minor-version: ""
Expand Down
4 changes: 4 additions & 0 deletions .github/workflows/bump_deps.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,10 @@ jobs:
variable: "PARAKEET_VERSION"
branch: "master"
file: "backend/go/parakeet-cpp/Makefile"
- repository: "mudler/dllm.cpp"
variable: "DLLM_VERSION"
branch: "main"
file: "backend/go/dllm/Makefile"
- repository: "leejet/stable-diffusion.cpp"
variable: "STABLEDIFFUSION_GGML_VERSION"
branch: "master"
Expand Down
1 change: 1 addition & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ LocalAI follows the Linux kernel project's [guidelines for AI coding assistants]
| [.agents/vllm-backend.md](.agents/vllm-backend.md) | Working on the vLLM / vLLM-omni backends — native parsers, ChatDelta, CPU build, libnuma packaging, backend hooks |
| [.agents/sglang-backend.md](.agents/sglang-backend.md) | Working on the SGLang backend — `engine_args` validation against ServerArgs, speculative-decoding (EAGLE/EAGLE3/DFLASH/MTP) recipes, parser handling |
| [.agents/ds4-backend.md](.agents/ds4-backend.md) | Working on the ds4 backend - DSML state machine, thinking modes, KV cache, Metal+CUDA matrix |
| [.agents/dllm-backend.md](.agents/dllm-backend.md) | Working on the dllm backend (DiffusionGemma block-diffusion) - purego C-ABI binding, per-ctx serialization contract, gemma4 renderer/parser, gated test layers |
| [.agents/testing-mcp-apps.md](.agents/testing-mcp-apps.md) | Testing MCP Apps (interactive tool UIs) in the React UI |
| [.agents/api-endpoints-and-auth.md](.agents/api-endpoints-and-auth.md) | Adding API endpoints, auth middleware, feature permissions, user access control |
| [.agents/debugging-backends.md](.agents/debugging-backends.md) | Debugging runtime backend failures, dependency conflicts, rebuilding backends |
Expand Down
6 changes: 5 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# Disable parallel execution for backend builds
.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/crispasr backends/parakeet-cpp backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/rfdetr-cpp backends/insightface backends/speaker-recognition backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/vibevoice-cpp backends/localvqe backends/tinygrad backends/sherpa-onnx backends/ds4 backends/ds4-darwin backends/liquid-audio
.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/crispasr backends/parakeet-cpp backends/dllm backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/rfdetr-cpp backends/insightface backends/speaker-recognition backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/vibevoice-cpp backends/localvqe backends/tinygrad backends/sherpa-onnx backends/ds4 backends/ds4-darwin backends/liquid-audio

GOCMD=go
GOTEST=$(GOCMD) test
Expand Down Expand Up @@ -1171,6 +1171,9 @@ BACKEND_STABLEDIFFUSION_GGML = stablediffusion-ggml|golang|.|--progress=plain|tr
BACKEND_WHISPER = whisper|golang|.|false|true
BACKEND_CRISPASR = crispasr|golang|.|false|true
BACKEND_PARAKEET_CPP = parakeet-cpp|golang|.|false|true
# dllm is mudler/dllm.cpp, the DiffusionGemma block-diffusion engine,
# wrapped by the purego backend at backend/go/dllm.
BACKEND_DLLM = dllm|golang|.|false|true
BACKEND_VOXTRAL = voxtral|golang|.|false|true
BACKEND_ACESTEP_CPP = acestep-cpp|golang|.|false|true
BACKEND_QWEN3_TTS_CPP = qwen3-tts-cpp|golang|.|false|true
Expand Down Expand Up @@ -1260,6 +1263,7 @@ $(eval $(call generate-docker-build-target,$(BACKEND_STABLEDIFFUSION_GGML)))
$(eval $(call generate-docker-build-target,$(BACKEND_WHISPER)))
$(eval $(call generate-docker-build-target,$(BACKEND_CRISPASR)))
$(eval $(call generate-docker-build-target,$(BACKEND_PARAKEET_CPP)))
$(eval $(call generate-docker-build-target,$(BACKEND_DLLM)))
$(eval $(call generate-docker-build-target,$(BACKEND_VOXTRAL)))
$(eval $(call generate-docker-build-target,$(BACKEND_OPUS)))
$(eval $(call generate-docker-build-target,$(BACKEND_RERANKERS)))
Expand Down
10 changes: 10 additions & 0 deletions backend/go/dllm/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
.cache/
sources/
build/
package/
dllm-grpc
# build artifacts staged in-tree by the Makefile (cp from sources/) or
# symlinked for local dev; the real sources live in dllm.cpp upstream.
*.so
*.so.*
compile_commands.json
Loading