OpenMOSS · cms42 · Mar 19, 2026 · Mar 16, 2026 · Mar 19, 2026 · Mar 19, 2026
diff --git a/docs/moss-tts-firstclass-e2e.md b/docs/moss-tts-firstclass-e2e.md
@@ -24,6 +24,7 @@ Unlike the older `moss_tts_delay/llama_cpp` backend in the `MOSS-TTS` repository
 4. Python packages required by the helper scripts:
    - `numpy`
    - `soundfile`
+   - `tokenizers`
    - `onnxruntime`
 
 ## Build
@@ -55,7 +56,24 @@ You need a first-class MOSS-TTS-Delay GGUF model that already contains:
 
 For example:
 
-- `out/stage1a_moss_delay_firstclass_f16.gguf`
+- `out/moss_delay_firstclass_f16.gguf`
+
+You can generate it directly from the full Hugging Face MOSS-TTS model directory:
+
+```bash
+huggingface-cli download OpenMOSS-Team/MOSS-TTS --local-dir /path/to/MOSS-TTS-hf
+
+python convert_hf_to_gguf.py \
+    /path/to/MOSS-TTS-hf \
+    --outfile /path/to/moss_delay_firstclass_f16.gguf \
+    --outtype f16
+```
+
+Important:
+
+- The `--model-gguf` file used by this e2e pipeline is a **special first-class MOSS-TTS-Delay GGUF** generated from the full `OpenMOSS-Team/MOSS-TTS` Hugging Face model directory with the command above.
+- It is **not** the same thing as a generic GGUF downloaded from `OpenMOSS/MOSS-TTS-GGUF`.
+- Do not point this pipeline at a file from `OpenMOSS/MOSS-TTS-GGUF` unless that file was explicitly produced as a first-class MOSS-TTS-Delay GGUF for this `llama.cpp` implementation.
 
 ### Step 2: Prepare the tokenizer directory
 
@@ -146,7 +164,7 @@ python tools/tts/moss-tts-firstclass-e2e.py \
 | `--onnx-encoder` | path | Audio tokenizer encoder ONNX |
 | `--onnx-decoder` | path | Audio tokenizer decoder ONNX |
 | `--text` / `--text-file` | string / path | Input text, choose exactly one |
-| `--reference-audio` | path | Optional 24 kHz reference audio |
+| `--reference-audio` | path | Optional reference audio; if provided, it must be 24 kHz |
 | `--language` | `zh` / `en` / tag | Language tag passed to the prompt builder |
 | `--max-new-tokens` | int | Maximum generation steps |
 | `--text-temperature` | float | Text-channel sampling temperature, default `1.5` |

diff --git a/docs/moss-tts-firstclass-e2e_zh.md b/docs/moss-tts-firstclass-e2e_zh.md
@@ -24,6 +24,7 @@
 4. helper scripts 需要的 Python 包：
    - `numpy`
    - `soundfile`
+   - `tokenizers`
    - `onnxruntime`
 
 ## 编译
@@ -55,7 +56,24 @@ cmake --build build --target llama-moss-tts -j
 
 例如：
 
-- `out/stage1a_moss_delay_firstclass_f16.gguf`
+- `out/moss_delay_firstclass_f16.gguf`
+
+你可以直接从完整的 Hugging Face MOSS-TTS 模型目录生成它：
+
+```bash
+huggingface-cli download OpenMOSS-Team/MOSS-TTS --local-dir /path/to/MOSS-TTS-hf
+
+python convert_hf_to_gguf.py \
+    /path/to/MOSS-TTS-hf \
+    --outfile /path/to/moss_delay_firstclass_f16.gguf \
+    --outtype f16
+```
+
+重要说明：
+
+- 这里 `--model-gguf` 使用的是一个**特殊的 first-class MOSS-TTS-Delay GGUF**，它需要像上面这样，从完整的 `OpenMOSS-Team/MOSS-TTS` Hugging Face 模型目录直接转换得到。
+- 它**不是** `OpenMOSS/MOSS-TTS-GGUF` 仓库里的通用 GGUF 文件。
+- 除非某个文件被明确说明为适配这套 `llama.cpp` first-class 实现的 MOSS-TTS-Delay GGUF，否则不要把 `OpenMOSS/MOSS-TTS-GGUF` 里的文件直接拿来给这条 e2e 流水线使用。
 
 ### 第二步：准备 tokenizer 目录
 
@@ -147,7 +165,7 @@ python tools/tts/moss-tts-firstclass-e2e.py \
 | `--onnx-encoder` | path | 音频 tokenizer encoder ONNX |
 | `--onnx-decoder` | path | 音频 tokenizer decoder ONNX |
 | `--text` / `--text-file` | string / path | 输入文本，二选一 |
-| `--reference-audio` | path | 可选的 24 kHz 参考音频 |
+| `--reference-audio` | path | 可选参考音频；如果提供，必须是 24 kHz |
 | `--language` | `zh` / `en` / tag | 传给 prompt builder 的语言标签 |
 | `--max-new-tokens` | int | 最大生成步数 |
 | `--text-temperature` | float | 文本通道采样温度，默认 `1.5` |