Add MOSS-TTS-Local model family#19
Conversation
|
Thank you for the effort. For AR models, I’d suggest holding off for now because I’m working on stabilizing the patterns and creating a reusable template. I'm being strict about new model PRs because each model adds long-term maintenance cost. I’d like to see the following before merge:
Those are the easy parts. The next part is more important:
|
|
Thanks for the detailed review — all five points are addressed in the latest commit (Phase 6: backbone KV cache, batched prefill, hardware-adaptive dtype, and the test/measurement harness). Summary below; the raw runs are reproducible from the scripts under Headline change: generation was re-forwarding the whole sequence every frame (O(T²)), which is why long text and clones were slower than the reference. That path is now an incremental KV cache (single batched prefill + one cached step per frame), byte-identical to the old greedy output, and every generation-bound path is now faster than Python. 1. Test cases covering the major pathsAdded to 2. Similarity vs the Python referenceTwo things worth flagging up front: for an autoregressive model, free-running greedy/sampled rollouts diverge on tiny fp differences (a flipped codebook token → a different-but-valid continuation), so raw audio cosine is not a meaningful parity metric here — you can see that in the low
3. Performance vs Python (CUDA, bf16; RTF = wall seconds per second of audio, lower is better)
Every generation-bound path is now faster than the reference. The two slow rows are short clones, and it's entirely the MOSS-Audio-Tokenizer-v2 encoder (O(T²) attention over the reference, ~16 s fixed) dominating a few seconds of output — on the long clone that cost amortizes and C++ is ahead again. The encoder is a separate path from the backbone; chunked/banded encoder attention is the logical next step if you want short clones faster than Python too. 4. Long-lived sessionThe 5. LongformWith the KV cache + batched prefill, longform no longer blows up — it's now RTF-bound. The caveat is that the test text (~1000 words) is ~7 minutes of audio generated as a single prefix with no chunking, so it's inherently a multi-minute run regardless of implementation (the Python reference is comparably RTF-bound). The production path for longform is sentence-level text chunking (short prefix + short generation per chunk, concatenated); the single-prefix case here is the worst-case stress test. Also included
Happy to split any of this into separate commits or add more cases if that's easier to review. |
|
A correction to one claim in my previous comment, after profiling the short-clone rows more carefully. I attributed the short-clone slowness to the codec encoder's attention compute. That was wrong. Decomposing the timings shows the actual fp32 encode on CUDA costs only ~1 s — the ~12.5 s that dominates the first clone request is the one-time lazy load of the encoder weights (the Evidence from the long-lived session case (fp32, CUDA):
So steady-state cloning is at parity or faster than the reference; the 0.15–0.22× rows in my table are an accounting artifact of where the one-time load lands. The Python numbers don't include the audio tokenizer load either (it happens in I also tried bf16 encoder weights on CUDA (the codec config's own If first-clone latency matters for your use cases, the clean fix would be building the codec encoder eagerly at session load (mirroring where Python pays it) rather than lazily on first clone — happy to make that change if you'd prefer it. |
|
@justinjohn0306 Amazing work! I'll test and get back to you after the holidy. Can you remove the vibevoce part from this PR? Your PR on vibevoice will be merged first, and I may (or may not) make changes to that part. |
Introduce the moss_tts_local family (OpenMOSS MOSS-TTS-Local-Transformer-v1.5): downloader package, config/asset loading, and family registration; the Qwen3 backbone (per-head QK-norm, GQA, NEOX RoPE); the 1-layer GPT-2-J depth transformer (fused c_attn, interleaved RoPE) with the 12-codebook generation loop and binary end gate; and the text processor that builds the generation prefix. Verified numerically: the C++ greedy first-frame codes match the fp32 Python reference exactly on all 12 codebooks (harness: moss_tts_local_smoke). Codec decoder and session/CLI wiring (audio output, voice cloning) still to come. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Port the codec decode path: the RLFQ dequantizer (12-of-32 residual quantizers, weight-normed 1x1 projections) turning codes into the 768-dim latent, then the "CNN-free" decoder -- six causal ProjectedTransformers (fused-QKV attention, interleaved RoPE, LayerScale, erf-GELU MLP) interleaved with reshape-based patch upsampling -- and channel de-interleave to stereo. Verified numerically against the fp32 Python model.decode: cosine 1.0 / max-abs-diff 1.6e-5 per channel (dequant alone is exact, cosine 1.0). Harnesses: codec_dequant_parity, codec_decode_parity. Session/CLI wiring and the voice-clone encoder remain (Phase 5). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Wire the MOSS-TTS-Local session end to end: the text processor builds the generation prefix, the generator emits RVQ codes, and the codec decoder renders 48 kHz stereo. Add voice cloning via the MOSS-Audio-Tokenizer-v2 encoder (audio -> RLFQ codes), the structural mirror of the decoder; extract the shared ProjectedTransformer machinery into codec_transformer.h so the encoder and decoder share one implementation. The processor's clone prefix embeds the reference speaker's codes under "- Reference(s):"; the session resamples and loudness-normalizes a --voice-ref clip, encodes it, and seeds generation. Parity-verified against the transformers reference: generation loop 96/96 codes over 8 frames, encoder 300/300 codes, clone input_ids 100x13 exact, decoder cosine 1.0. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…st harness Backbone generation re-forwarded the whole sequence every frame (O(T^2)), which made long text slower than the Python reference and voice clones far slower. This adds an incremental KV-cached generation path plus the tests and tooling the PR review asked for. - backbone: per-layer KV cache with a reusable single-position step graph (begin_generation/step) and a single batched prefill that seeds the cache in one forward (prefill), replacing the per-frame re-forward. Removes the now-dead forward_prefill path. Greedy output is byte-identical to the previous path. - session: hardware-adaptive "auto" weight_type (CUDA bf16, CPU f32, other backends native), overridable via moss_tts_local.weight_type. - tools: audiocpp_cli path cases (text-only, voice clone, sampled, long-lived session) and a longform clone case; reference/parity/perf/session-probe scripts; per-request and session timing in the offline batch runner. Perf (CUDA bf16, RTF vs Python): text-only 0.59 vs 0.95, sampled 0.63 vs 0.92, long text 0.60 vs 0.74, long clone 0.61 vs 0.75 -- every generation-bound path now faster than Python. Short clones remain encoder-bound (codec encoder O(T^2), a separate path). Teacher-forced codec-decode parity: cosine 0.999999996. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
0f33510 to
802b2b2
Compare
Done 👍 |
As requested, the VibeVoice commits have been removed from this PR — it's now MOSS-TTS-Local only, rebased on the latest
release-0.1(on top of the merged #14).MOSS-TTS-Local-Transformer-v1.5 — new family
moss_tts_localFull C++/ggml port; no Python at inference or preprocessing.
ProjectedTransformerimplementation--voice-ref): the reference is resampled (torchaudio sinc-Hann parity) + loudness-normalized, encoded to codes, and spliced into the generation prompt — all in C++Every stage numerically verified against the HuggingFace
transformersreference:input_idsUsage: