diff --git a/docs/moss-tts-firstclass-e2e.md b/docs/moss-tts-firstclass-e2e.md index 5015fd77d..bdf9efd96 100644 --- a/docs/moss-tts-firstclass-e2e.md +++ b/docs/moss-tts-firstclass-e2e.md @@ -24,6 +24,7 @@ Unlike the older `moss_tts_delay/llama_cpp` backend in the `MOSS-TTS` repository 4. Python packages required by the helper scripts: - `numpy` - `soundfile` + - `tokenizers` - `onnxruntime` ## Build @@ -55,7 +56,24 @@ You need a first-class MOSS-TTS-Delay GGUF model that already contains: For example: -- `out/stage1a_moss_delay_firstclass_f16.gguf` +- `out/moss_delay_firstclass_f16.gguf` + +You can generate it directly from the full Hugging Face MOSS-TTS model directory: + +```bash +huggingface-cli download OpenMOSS-Team/MOSS-TTS --local-dir /path/to/MOSS-TTS-hf + +python convert_hf_to_gguf.py \ + /path/to/MOSS-TTS-hf \ + --outfile /path/to/moss_delay_firstclass_f16.gguf \ + --outtype f16 +``` + +Important: + +- The `--model-gguf` file used by this e2e pipeline is a **special first-class MOSS-TTS-Delay GGUF** generated from the full `OpenMOSS-Team/MOSS-TTS` Hugging Face model directory with the command above. +- It is **not** the same thing as a generic GGUF downloaded from `OpenMOSS/MOSS-TTS-GGUF`. +- Do not point this pipeline at a file from `OpenMOSS/MOSS-TTS-GGUF` unless that file was explicitly produced as a first-class MOSS-TTS-Delay GGUF for this `llama.cpp` implementation. ### Step 2: Prepare the tokenizer directory @@ -146,7 +164,7 @@ python tools/tts/moss-tts-firstclass-e2e.py \ | `--onnx-encoder` | path | Audio tokenizer encoder ONNX | | `--onnx-decoder` | path | Audio tokenizer decoder ONNX | | `--text` / `--text-file` | string / path | Input text, choose exactly one | -| `--reference-audio` | path | Optional 24 kHz reference audio | +| `--reference-audio` | path | Optional reference audio; if provided, it must be 24 kHz | | `--language` | `zh` / `en` / tag | Language tag passed to the prompt builder | | `--max-new-tokens` | int | Maximum generation steps | | `--text-temperature` | float | Text-channel sampling temperature, default `1.5` | diff --git a/docs/moss-tts-firstclass-e2e_zh.md b/docs/moss-tts-firstclass-e2e_zh.md index 345187e3b..644a4bf4c 100644 --- a/docs/moss-tts-firstclass-e2e_zh.md +++ b/docs/moss-tts-firstclass-e2e_zh.md @@ -24,6 +24,7 @@ 4. helper scripts 需要的 Python 包: - `numpy` - `soundfile` + - `tokenizers` - `onnxruntime` ## 编译 @@ -55,7 +56,24 @@ cmake --build build --target llama-moss-tts -j 例如: -- `out/stage1a_moss_delay_firstclass_f16.gguf` +- `out/moss_delay_firstclass_f16.gguf` + +你可以直接从完整的 Hugging Face MOSS-TTS 模型目录生成它: + +```bash +huggingface-cli download OpenMOSS-Team/MOSS-TTS --local-dir /path/to/MOSS-TTS-hf + +python convert_hf_to_gguf.py \ + /path/to/MOSS-TTS-hf \ + --outfile /path/to/moss_delay_firstclass_f16.gguf \ + --outtype f16 +``` + +重要说明: + +- 这里 `--model-gguf` 使用的是一个**特殊的 first-class MOSS-TTS-Delay GGUF**,它需要像上面这样,从完整的 `OpenMOSS-Team/MOSS-TTS` Hugging Face 模型目录直接转换得到。 +- 它**不是** `OpenMOSS/MOSS-TTS-GGUF` 仓库里的通用 GGUF 文件。 +- 除非某个文件被明确说明为适配这套 `llama.cpp` first-class 实现的 MOSS-TTS-Delay GGUF,否则不要把 `OpenMOSS/MOSS-TTS-GGUF` 里的文件直接拿来给这条 e2e 流水线使用。 ### 第二步:准备 tokenizer 目录 @@ -147,7 +165,7 @@ python tools/tts/moss-tts-firstclass-e2e.py \ | `--onnx-encoder` | path | 音频 tokenizer encoder ONNX | | `--onnx-decoder` | path | 音频 tokenizer decoder ONNX | | `--text` / `--text-file` | string / path | 输入文本,二选一 | -| `--reference-audio` | path | 可选的 24 kHz 参考音频 | +| `--reference-audio` | path | 可选参考音频;如果提供,必须是 24 kHz | | `--language` | `zh` / `en` / tag | 传给 prompt builder 的语言标签 | | `--max-new-tokens` | int | 最大生成步数 | | `--text-temperature` | float | 文本通道采样温度,默认 `1.5` |