This document details the MOSS-TTS-Realtime architecture.
MOSS-TTS-RealTime is built on a scalable architecture consisting of a 1.7B-parameter backbone and a 200M-parameter local Transformer, designed to deliver high-quality, low-latency speech synthesis. The backbone encodes linguistic and contextual information and exposes its hidden states to the local Transformer, which autoregressively generates RVQ-based audio tokens. Audio reconstruction is performed using the MOSS-Audio-Tokenizer, enabling efficient and high-fidelity waveform synthesis.
The model adopts a hierarchical input design in which text tokens and speech tokens are processed at different levels, allowing the system to naturally support streaming text input and streaming audio output. This design enables incremental generation while preserving long-range context, making it well suited for real-time conversational scenarios.
| Feature | Specification |
|---|---|
| Backbone Model | Initialized from Qwen3-1.7B |
| Depth Transformer | 4 Transformer blocks, architecturally aligned with the Qwen3-1.7B backbone |
| Audio Tokenizer | Cat (Causal Audio Tokenizer) |
| Sampling Rate | 24,000 Hz |
| Frame Rate | 12.5 Hz (1s ≈ 12.5 tokens/blocks) |
| Codebooks | 16 RVQ layers |
| Generation Mode | Purely Autoregressive (AR) |