Add MOSS-TTS-Local model family by justinjohn0306 · Pull Request #19 · 0xShug0/audio.cpp

justinjohn0306 · 2026-07-02T05:22:39Z

As requested, the VibeVoice commits have been removed from this PR — it's now MOSS-TTS-Local only, rebased on the latest release-0.1 (on top of the merged #14).

MOSS-TTS-Local-Transformer-v1.5 — new family `moss_tts_local`

Full C++/ggml port; no Python at inference or preprocessing.

Qwen3 backbone (2560 hidden, 36 layers) + 1-layer GPT-2-J depth transformer emitting 12 RVQ codes/frame
MOSS-Audio-Tokenizer-v2 codec: encoder (audio → RLFQ codes) and decoder (codes → 48 kHz stereo), sharing one ProjectedTransformer implementation
Text-to-speech and reference-audio voice cloning (--voice-ref): the reference is resampled (torchaudio sinc-Hann parity) + loudness-normalized, encoded to codes, and spliced into the generation prompt — all in C++
KV-cached incremental generation with batched prefill (replaces the per-frame full re-forward), hardware-adaptive auto weight type (CUDA bf16 / CPU f32), and a CLI test harness — every generation-bound path now faster than the Python reference (RTF 0.59–0.63 vs 0.74–0.95)

Every stage numerically verified against the HuggingFace transformers reference:

stage	result
generation loop	96/96 codes exact over 8 frames
codec encoder	300/300 codes exact
clone-prompt `input_ids`	100×13 exact
codec decoder	cosine 1.0

Usage:

audiocpp_cli --task tts --family moss_tts_local --model models/MOSS-TTS-Local-Transformer-v1.5 \
  --backend cuda --text "..." [--voice-ref speaker.wav] --out out.wav \
  --request-option do_sample=true --request-option audio_temperature=0.9 \
  --request-option audio_top_k=50 --request-option audio_top_p=0.95 --request-option max_tokens=200

0xShug0 · 2026-07-02T06:32:34Z

Thank you for the effort. For AR models, I’d suggest holding off for now because I’m working on stabilizing the patterns and creating a reusable template.

I'm being strict about new model PRs because each model adds long-term maintenance cost. I’d like to see the following before merge:

Please create JSON test cases similar to those in tools/audiocpp_cli/. The path tests should cover the major execution paths. Test cases only; resources can be placeholdersand I will figure out how to run them in my own environment.
Report the cosine similarity and log-mel similarity between audio produced by audio.cpp and the Python implementation for all test cases.

Those are the easy parts. The next part is more important:

Report performance compared with Python for the test cases (expect near or faster).
Add a long-lived session test: one session with multiple requests using different inputs, and report per-request timing compared with Python, and memory usage pattern (should be stable after a few requests).
The performance in longform test (example in tools/audiocpp_cli/audiocpp_cli_longform_tts_clone_cases.json).

justinjohn0306 · 2026-07-02T16:24:09Z

Thanks for the detailed review — all five points are addressed in the latest commit (Phase 6: backbone KV cache, batched prefill, hardware-adaptive dtype, and the test/measurement harness). Summary below; the raw runs are reproducible from the scripts under tools/audiocpp_cli/.

Headline change: generation was re-forwarding the whole sequence every frame (O(T²)), which is why long text and clones were slower than the reference. That path is now an incremental KV cache (single batched prefill + one cached step per frame), byte-identical to the old greedy output, and every generation-bound path is now faster than Python.

1. Test cases covering the major paths

Added to audiocpp_cli_path_cases.json (family moss_tts_local): text-only greedy, voice clone greedy, sampled generation (temp/top-k/top-p + repetition penalty + language slot), and a long-lived session (three different-input requests in one session). Plus a longform clone case in audiocpp_cli_longform_tts_clone_cases.json. Resources are the shared placeholder wav + reference text.

2. Similarity vs the Python reference

Two things worth flagging up front: for an autoregressive model, free-running greedy/sampled rollouts diverge on tiny fp differences (a flipped codebook token → a different-but-valid continuation), so raw audio cosine is not a meaningful parity metric here — you can see that in the low wav_cos. The meaningful checks are (a) component-level teacher-forced parity and (b) mel-spectrogram agreement of the rendered audio.

Teacher-forced codec decode (deterministic, fp32): cosine 0.999999996, max abs diff 3.05e-05 (≈ one 16-bit LSB). This is the real "the C++ math matches" evidence (moss_tts_local_codec_parity.py).
Log-mel cosine of rendered audio (moss_tts_local_report.py): text-only 0.75, long text 0.72, short clone 0.78, long clone 0.87, greedy clone 0.69, sampled 0.41 (sampling diverges most, as expected).

3. Performance vs Python (CUDA, bf16; RTF = wall seconds per second of audio, lower is better)

path	C++ RTF	Python RTF	speedup
text-only greedy	0.59	0.95	1.61×
sampled	0.63	0.92	1.47×
long text (256 frames)	0.60	0.74	1.23×
long clone (256 frames)	0.61	0.75	1.24×
short clone	5.57	0.85	0.15×
greedy clone	4.05	0.90	0.22×

Every generation-bound path is now faster than the reference. The two slow rows are short clones, and it's entirely the MOSS-Audio-Tokenizer-v2 encoder (O(T²) attention over the reference, ~16 s fixed) dominating a few seconds of output — on the long clone that cost amortizes and C++ is ahead again. The encoder is a separate path from the backbone; chunked/banded encoder attention is the logical next step if you want short clones faster than Python too.

4. Long-lived session

The moss_tts_local_long_lived_session case runs long text-only → short clone → long clone in one loaded session (graph reuse, lazy encoder build). The offline batch runner now emits per-request and session timing ([TIMING] request.<id>.wall_ms, session.wall_ms), and moss_tts_local_session_probe.py samples RSS/VRAM per request. Memory is flat after warmup (Python reference table in the report: RSS ~2.4 GB, CUDA alloc ~12 GB, constant across requests — no growth).

5. Longform

With the KV cache + batched prefill, longform no longer blows up — it's now RTF-bound. The caveat is that the test text (~1000 words) is ~7 minutes of audio generated as a single prefix with no chunking, so it's inherently a multi-minute run regardless of implementation (the Python reference is comparably RTF-bound). The production path for longform is sentence-level text chunking (short prefix + short generation per chunk, concatenated); the single-prefix case here is the worst-case stress test.

Also included

Hardware-adaptive auto weight dtype (CUDA bf16, CPU f32, other backends native), overridable via moss_tts_local.weight_type.

Happy to split any of this into separate commits or add more cases if that's easier to review.

justinjohn0306 · 2026-07-02T16:43:57Z

A correction to one claim in my previous comment, after profiling the short-clone rows more carefully.

I attributed the short-clone slowness to the codec encoder's attention compute. That was wrong. Decomposing the timings shows the actual fp32 encode on CUDA costs only ~1 s — the ~12.5 s that dominates the first clone request is the one-time lazy load of the encoder weights (the encoder.* tensors are 4.0 GB of f32 across the codec shards), which lands inside the first clone's wall time because the encoder is built lazily on first use.

Evidence from the long-lived session case (fp32, CUDA):

short_clone (first clone in the session): ~14 s = encoder weight load (~12.5 s) + encode (~1 s) + short generation.
long_clone_again (same session, second clone): 12.4 s for 20.5 s of audio = ~11 s generation + ~1 s encode — RTF 0.61, 1.24× faster than Python, with the encoder already loaded.

So steady-state cloning is at parity or faster than the reference; the 0.15–0.22× rows in my table are an accounting artifact of where the one-time load lands. The Python numbers don't include the audio tokenizer load either (it happens in AutoProcessor at model load, outside per-request timing) — the C++ lazy build just pays it inside the first request instead.

I also tried bf16 encoder weights on CUDA (the codec config's own compute_dtype is bf16) to shave the remaining encode cost — it works, but it perturbs the reference codes enough to change the greedy rollout (one long-clone run stopped early), and the f32→bf16 conversion makes the one-time load slower, so I discarded it. The encoder stays f32.

If first-clone latency matters for your use cases, the clean fix would be building the codec encoder eagerly at session load (mirroring where Python pays it) rather than lazily on first clone — happy to make that change if you'd prefer it.

0xShug0 · 2026-07-02T17:31:32Z

@justinjohn0306 Amazing work! I'll test and get back to you after the holidy. Can you remove the vibevoce part from this PR? Your PR on vibevoice will be merged first, and I may (or may not) make changes to that part.

Introduce the moss_tts_local family (OpenMOSS MOSS-TTS-Local-Transformer-v1.5): downloader package, config/asset loading, and family registration; the Qwen3 backbone (per-head QK-norm, GQA, NEOX RoPE); the 1-layer GPT-2-J depth transformer (fused c_attn, interleaved RoPE) with the 12-codebook generation loop and binary end gate; and the text processor that builds the generation prefix. Verified numerically: the C++ greedy first-frame codes match the fp32 Python reference exactly on all 12 codebooks (harness: moss_tts_local_smoke). Codec decoder and session/CLI wiring (audio output, voice cloning) still to come. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Port the codec decode path: the RLFQ dequantizer (12-of-32 residual quantizers, weight-normed 1x1 projections) turning codes into the 768-dim latent, then the "CNN-free" decoder -- six causal ProjectedTransformers (fused-QKV attention, interleaved RoPE, LayerScale, erf-GELU MLP) interleaved with reshape-based patch upsampling -- and channel de-interleave to stereo. Verified numerically against the fp32 Python model.decode: cosine 1.0 / max-abs-diff 1.6e-5 per channel (dequant alone is exact, cosine 1.0). Harnesses: codec_dequant_parity, codec_decode_parity. Session/CLI wiring and the voice-clone encoder remain (Phase 5). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Wire the MOSS-TTS-Local session end to end: the text processor builds the generation prefix, the generator emits RVQ codes, and the codec decoder renders 48 kHz stereo. Add voice cloning via the MOSS-Audio-Tokenizer-v2 encoder (audio -> RLFQ codes), the structural mirror of the decoder; extract the shared ProjectedTransformer machinery into codec_transformer.h so the encoder and decoder share one implementation. The processor's clone prefix embeds the reference speaker's codes under "- Reference(s):"; the session resamples and loudness-normalizes a --voice-ref clip, encodes it, and seeds generation. Parity-verified against the transformers reference: generation loop 96/96 codes over 8 frames, encoder 300/300 codes, clone input_ids 100x13 exact, decoder cosine 1.0. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…st harness Backbone generation re-forwarded the whole sequence every frame (O(T^2)), which made long text slower than the Python reference and voice clones far slower. This adds an incremental KV-cached generation path plus the tests and tooling the PR review asked for. - backbone: per-layer KV cache with a reusable single-position step graph (begin_generation/step) and a single batched prefill that seeds the cache in one forward (prefill), replacing the per-frame re-forward. Removes the now-dead forward_prefill path. Greedy output is byte-identical to the previous path. - session: hardware-adaptive "auto" weight_type (CUDA bf16, CPU f32, other backends native), overridable via moss_tts_local.weight_type. - tools: audiocpp_cli path cases (text-only, voice clone, sampled, long-lived session) and a longform clone case; reference/parity/perf/session-probe scripts; per-request and session timing in the offline batch runner. Perf (CUDA bf16, RTF vs Python): text-only 0.59 vs 0.95, sampled 0.63 vs 0.92, long text 0.60 vs 0.74, long clone 0.61 vs 0.75 -- every generation-bound path now faster than Python. Short clones remain encoder-bound (codec encoder O(T^2), a separate path). Teacher-forced codec-decode parity: cosine 0.999999996. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

justinjohn0306 · 2026-07-02T21:43:48Z

@justinjohn0306 Amazing work! I'll test and get back to you after the holidy. Can you remove the vibevoce part from this PR? Your PR on vibevoice will be merged first, and I may (or may not) make changes to that part.

Done 👍

0xShug0 mentioned this pull request Jul 2, 2026

Add native Higgs Audio v3 TTS support #20

Open

0xShug0 added needs-parity needs-tests labels Jul 2, 2026

justinjohn0306 and others added 4 commits July 3, 2026 03:00

justinjohn0306 force-pushed the feat/moss-tts-local branch from 0f33510 to 802b2b2 Compare July 2, 2026 21:35

justinjohn0306 changed the title ~~Add MOSS-TTS-Local model family (+ VibeVoice 7B / fine-tune adapters)~~ Add MOSS-TTS-Local model family Jul 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MOSS-TTS-Local model family#19

Add MOSS-TTS-Local model family#19
justinjohn0306 wants to merge 4 commits into
0xShug0:release-0.1from
justinjohn0306:feat/moss-tts-local

justinjohn0306 commented Jul 2, 2026 •

edited

Loading

Uh oh!

0xShug0 commented Jul 2, 2026

Uh oh!

justinjohn0306 commented Jul 2, 2026

Uh oh!

justinjohn0306 commented Jul 2, 2026

Uh oh!

0xShug0 commented Jul 2, 2026

Uh oh!

justinjohn0306 commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

justinjohn0306 commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

MOSS-TTS-Local-Transformer-v1.5 — new family moss_tts_local

Uh oh!

0xShug0 commented Jul 2, 2026

Uh oh!

justinjohn0306 commented Jul 2, 2026

1. Test cases covering the major paths

2. Similarity vs the Python reference

3. Performance vs Python (CUDA, bf16; RTF = wall seconds per second of audio, lower is better)

4. Long-lived session

5. Longform

Also included

Uh oh!

justinjohn0306 commented Jul 2, 2026

Uh oh!

0xShug0 commented Jul 2, 2026

Uh oh!

justinjohn0306 commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

justinjohn0306 commented Jul 2, 2026 •

edited

Loading

MOSS-TTS-Local-Transformer-v1.5 — new family `moss_tts_local`