Skip to content

Add MOSS-TTS-Local model family#19

Open
justinjohn0306 wants to merge 4 commits into
0xShug0:release-0.1from
justinjohn0306:feat/moss-tts-local
Open

Add MOSS-TTS-Local model family#19
justinjohn0306 wants to merge 4 commits into
0xShug0:release-0.1from
justinjohn0306:feat/moss-tts-local

Conversation

@justinjohn0306

@justinjohn0306 justinjohn0306 commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

As requested, the VibeVoice commits have been removed from this PR — it's now MOSS-TTS-Local only, rebased on the latest release-0.1 (on top of the merged #14).

MOSS-TTS-Local-Transformer-v1.5 — new family moss_tts_local

Full C++/ggml port; no Python at inference or preprocessing.

  • Qwen3 backbone (2560 hidden, 36 layers) + 1-layer GPT-2-J depth transformer emitting 12 RVQ codes/frame
  • MOSS-Audio-Tokenizer-v2 codec: encoder (audio → RLFQ codes) and decoder (codes → 48 kHz stereo), sharing one ProjectedTransformer implementation
  • Text-to-speech and reference-audio voice cloning (--voice-ref): the reference is resampled (torchaudio sinc-Hann parity) + loudness-normalized, encoded to codes, and spliced into the generation prompt — all in C++
  • KV-cached incremental generation with batched prefill (replaces the per-frame full re-forward), hardware-adaptive auto weight type (CUDA bf16 / CPU f32), and a CLI test harness — every generation-bound path now faster than the Python reference (RTF 0.59–0.63 vs 0.74–0.95)

Every stage numerically verified against the HuggingFace transformers reference:

stage result
generation loop 96/96 codes exact over 8 frames
codec encoder 300/300 codes exact
clone-prompt input_ids 100×13 exact
codec decoder cosine 1.0

Usage:

audiocpp_cli --task tts --family moss_tts_local --model models/MOSS-TTS-Local-Transformer-v1.5 \
  --backend cuda --text "..." [--voice-ref speaker.wav] --out out.wav \
  --request-option do_sample=true --request-option audio_temperature=0.9 \
  --request-option audio_top_k=50 --request-option audio_top_p=0.95 --request-option max_tokens=200

@0xShug0

0xShug0 commented Jul 2, 2026

Copy link
Copy Markdown
Owner

Thank you for the effort. For AR models, I’d suggest holding off for now because I’m working on stabilizing the patterns and creating a reusable template.

I'm being strict about new model PRs because each model adds long-term maintenance cost. I’d like to see the following before merge:

  1. Please create JSON test cases similar to those in tools/audiocpp_cli/. The path tests should cover the major execution paths. Test cases only; resources can be placeholdersand I will figure out how to run them in my own environment.
  2. Report the cosine similarity and log-mel similarity between audio produced by audio.cpp and the Python implementation for all test cases.

Those are the easy parts. The next part is more important:

  1. Report performance compared with Python for the test cases (expect near or faster).
  2. Add a long-lived session test: one session with multiple requests using different inputs, and report per-request timing compared with Python, and memory usage pattern (should be stable after a few requests).
  3. The performance in longform test (example in tools/audiocpp_cli/audiocpp_cli_longform_tts_clone_cases.json).

@justinjohn0306

Copy link
Copy Markdown
Contributor Author

Thanks for the detailed review — all five points are addressed in the latest commit (Phase 6: backbone KV cache, batched prefill, hardware-adaptive dtype, and the test/measurement harness). Summary below; the raw runs are reproducible from the scripts under tools/audiocpp_cli/.

Headline change: generation was re-forwarding the whole sequence every frame (O(T²)), which is why long text and clones were slower than the reference. That path is now an incremental KV cache (single batched prefill + one cached step per frame), byte-identical to the old greedy output, and every generation-bound path is now faster than Python.

1. Test cases covering the major paths

Added to audiocpp_cli_path_cases.json (family moss_tts_local): text-only greedy, voice clone greedy, sampled generation (temp/top-k/top-p + repetition penalty + language slot), and a long-lived session (three different-input requests in one session). Plus a longform clone case in audiocpp_cli_longform_tts_clone_cases.json. Resources are the shared placeholder wav + reference text.

2. Similarity vs the Python reference

Two things worth flagging up front: for an autoregressive model, free-running greedy/sampled rollouts diverge on tiny fp differences (a flipped codebook token → a different-but-valid continuation), so raw audio cosine is not a meaningful parity metric here — you can see that in the low wav_cos. The meaningful checks are (a) component-level teacher-forced parity and (b) mel-spectrogram agreement of the rendered audio.

  • Teacher-forced codec decode (deterministic, fp32): cosine 0.999999996, max abs diff 3.05e-05 (≈ one 16-bit LSB). This is the real "the C++ math matches" evidence (moss_tts_local_codec_parity.py).
  • Log-mel cosine of rendered audio (moss_tts_local_report.py): text-only 0.75, long text 0.72, short clone 0.78, long clone 0.87, greedy clone 0.69, sampled 0.41 (sampling diverges most, as expected).

3. Performance vs Python (CUDA, bf16; RTF = wall seconds per second of audio, lower is better)

path C++ RTF Python RTF speedup
text-only greedy 0.59 0.95 1.61×
sampled 0.63 0.92 1.47×
long text (256 frames) 0.60 0.74 1.23×
long clone (256 frames) 0.61 0.75 1.24×
short clone 5.57 0.85 0.15×
greedy clone 4.05 0.90 0.22×

Every generation-bound path is now faster than the reference. The two slow rows are short clones, and it's entirely the MOSS-Audio-Tokenizer-v2 encoder (O(T²) attention over the reference, ~16 s fixed) dominating a few seconds of output — on the long clone that cost amortizes and C++ is ahead again. The encoder is a separate path from the backbone; chunked/banded encoder attention is the logical next step if you want short clones faster than Python too.

4. Long-lived session

The moss_tts_local_long_lived_session case runs long text-only → short clone → long clone in one loaded session (graph reuse, lazy encoder build). The offline batch runner now emits per-request and session timing ([TIMING] request.<id>.wall_ms, session.wall_ms), and moss_tts_local_session_probe.py samples RSS/VRAM per request. Memory is flat after warmup (Python reference table in the report: RSS ~2.4 GB, CUDA alloc ~12 GB, constant across requests — no growth).

5. Longform

With the KV cache + batched prefill, longform no longer blows up — it's now RTF-bound. The caveat is that the test text (~1000 words) is ~7 minutes of audio generated as a single prefix with no chunking, so it's inherently a multi-minute run regardless of implementation (the Python reference is comparably RTF-bound). The production path for longform is sentence-level text chunking (short prefix + short generation per chunk, concatenated); the single-prefix case here is the worst-case stress test.

Also included

  • Hardware-adaptive auto weight dtype (CUDA bf16, CPU f32, other backends native), overridable via moss_tts_local.weight_type.

Happy to split any of this into separate commits or add more cases if that's easier to review.

@justinjohn0306

Copy link
Copy Markdown
Contributor Author

A correction to one claim in my previous comment, after profiling the short-clone rows more carefully.

I attributed the short-clone slowness to the codec encoder's attention compute. That was wrong. Decomposing the timings shows the actual fp32 encode on CUDA costs only ~1 s — the ~12.5 s that dominates the first clone request is the one-time lazy load of the encoder weights (the encoder.* tensors are 4.0 GB of f32 across the codec shards), which lands inside the first clone's wall time because the encoder is built lazily on first use.

Evidence from the long-lived session case (fp32, CUDA):

  • short_clone (first clone in the session): ~14 s = encoder weight load (~12.5 s) + encode (~1 s) + short generation.
  • long_clone_again (same session, second clone): 12.4 s for 20.5 s of audio = ~11 s generation + ~1 s encode — RTF 0.61, 1.24× faster than Python, with the encoder already loaded.

So steady-state cloning is at parity or faster than the reference; the 0.15–0.22× rows in my table are an accounting artifact of where the one-time load lands. The Python numbers don't include the audio tokenizer load either (it happens in AutoProcessor at model load, outside per-request timing) — the C++ lazy build just pays it inside the first request instead.

I also tried bf16 encoder weights on CUDA (the codec config's own compute_dtype is bf16) to shave the remaining encode cost — it works, but it perturbs the reference codes enough to change the greedy rollout (one long-clone run stopped early), and the f32→bf16 conversion makes the one-time load slower, so I discarded it. The encoder stays f32.

If first-clone latency matters for your use cases, the clean fix would be building the codec encoder eagerly at session load (mirroring where Python pays it) rather than lazily on first clone — happy to make that change if you'd prefer it.

@0xShug0

0xShug0 commented Jul 2, 2026

Copy link
Copy Markdown
Owner

@justinjohn0306 Amazing work! I'll test and get back to you after the holidy. Can you remove the vibevoce part from this PR? Your PR on vibevoice will be merged first, and I may (or may not) make changes to that part.

justinjohn0306 and others added 4 commits July 3, 2026 03:00
Introduce the moss_tts_local family (OpenMOSS MOSS-TTS-Local-Transformer-v1.5):
downloader package, config/asset loading, and family registration; the Qwen3
backbone (per-head QK-norm, GQA, NEOX RoPE); the 1-layer GPT-2-J depth
transformer (fused c_attn, interleaved RoPE) with the 12-codebook generation
loop and binary end gate; and the text processor that builds the generation
prefix. Verified numerically: the C++ greedy first-frame codes match the fp32
Python reference exactly on all 12 codebooks (harness: moss_tts_local_smoke).

Codec decoder and session/CLI wiring (audio output, voice cloning) still to come.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Port the codec decode path: the RLFQ dequantizer (12-of-32 residual
quantizers, weight-normed 1x1 projections) turning codes into the 768-dim
latent, then the "CNN-free" decoder -- six causal ProjectedTransformers
(fused-QKV attention, interleaved RoPE, LayerScale, erf-GELU MLP) interleaved
with reshape-based patch upsampling -- and channel de-interleave to stereo.

Verified numerically against the fp32 Python model.decode: cosine 1.0 /
max-abs-diff 1.6e-5 per channel (dequant alone is exact, cosine 1.0).
Harnesses: codec_dequant_parity, codec_decode_parity.

Session/CLI wiring and the voice-clone encoder remain (Phase 5).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Wire the MOSS-TTS-Local session end to end: the text processor builds the
generation prefix, the generator emits RVQ codes, and the codec decoder renders
48 kHz stereo. Add voice cloning via the MOSS-Audio-Tokenizer-v2 encoder
(audio -> RLFQ codes), the structural mirror of the decoder; extract the shared
ProjectedTransformer machinery into codec_transformer.h so the encoder and
decoder share one implementation. The processor's clone prefix embeds the
reference speaker's codes under "- Reference(s):"; the session resamples and
loudness-normalizes a --voice-ref clip, encodes it, and seeds generation.

Parity-verified against the transformers reference: generation loop 96/96 codes
over 8 frames, encoder 300/300 codes, clone input_ids 100x13 exact, decoder
cosine 1.0.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…st harness

Backbone generation re-forwarded the whole sequence every frame (O(T^2)), which
made long text slower than the Python reference and voice clones far slower. This
adds an incremental KV-cached generation path plus the tests and tooling the PR
review asked for.

- backbone: per-layer KV cache with a reusable single-position step graph
  (begin_generation/step) and a single batched prefill that seeds the cache in one
  forward (prefill), replacing the per-frame re-forward. Removes the now-dead
  forward_prefill path. Greedy output is byte-identical to the previous path.
- session: hardware-adaptive "auto" weight_type (CUDA bf16, CPU f32, other backends
  native), overridable via moss_tts_local.weight_type.
- tools: audiocpp_cli path cases (text-only, voice clone, sampled, long-lived
  session) and a longform clone case; reference/parity/perf/session-probe scripts;
  per-request and session timing in the offline batch runner.

Perf (CUDA bf16, RTF vs Python): text-only 0.59 vs 0.95, sampled 0.63 vs 0.92,
long text 0.60 vs 0.74, long clone 0.61 vs 0.75 -- every generation-bound path now
faster than Python. Short clones remain encoder-bound (codec encoder O(T^2), a
separate path). Teacher-forced codec-decode parity: cosine 0.999999996.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@justinjohn0306 justinjohn0306 changed the title Add MOSS-TTS-Local model family (+ VibeVoice 7B / fine-tune adapters) Add MOSS-TTS-Local model family Jul 2, 2026
@justinjohn0306

Copy link
Copy Markdown
Contributor Author

@justinjohn0306 Amazing work! I'll test and get back to you after the holidy. Can you remove the vibevoce part from this PR? Your PR on vibevoice will be merged first, and I may (or may not) make changes to that part.

Done 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants