Add VibeVoice 7B support and decoder LoRA merging by justinjohn0306 · Pull Request #14 · 0xShug0/audio.cpp

justinjohn0306 · 2026-07-01T09:49:56Z

Support the VibeVoice 7B model alongside the 1.5B:

Add the vibevoice_7b model-manager package, reusing the shared Qwen2.5 tokenizer bundle.
Handle the 7B config.json, tolerating its upstream "acostic_vae_dim" key and reading top-level tie_word_embeddings.
Load a separate lm_head.weight when word embeddings are untied and bind the decoder logits head to it.

Add optional PEFT LoRA merging for the decoder:

Merge lora_A/lora_B into the targeted linear weights at load time via a tensor-source decorator, so it composes with the weight-type quantization options at no per-step cost.
Expose it through the vibevoice.lora / vibevoice.lora_scale load options and document it in the README and docs/tts.md.

Support the VibeVoice 7B model alongside the 1.5B: - Add the vibevoice_7b model-manager package, reusing the shared Qwen2.5 tokenizer bundle. - Handle the 7B config.json, tolerating its upstream "acostic_vae_dim" key and reading top-level tie_word_embeddings. - Load a separate lm_head.weight when word embeddings are untied and bind the decoder logits head to it. Add optional PEFT LoRA merging for the decoder: - Merge lora_A/lora_B into the targeted linear weights at load time via a tensor-source decorator, so it composes with the weight-type quantization options at no per-step cost. - Expose it through the vibevoice.lora / vibevoice.lora_scale load options and document it in the README and docs/tts.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

0xShug0 · 2026-07-01T14:38:54Z

@justinjohn0306 Thank you for your PR! Could you add extra logs to confirm the LoRA loading is woring with 7B model and share the run log?

Reviewed by Codex:
docs/tts.md:392 and README news say to use --load-option vibevoice.lora, but the implementation only reads vibevoice.lora from SessionOptions in src/models/vibevoice/session.cpp:34. The CLI stores --load-option separately in load_request.options, while session_options.options only receives --session-option. Since the VibeVoice loader does not consume load_request.options, users following the docs will run without any LoRA merge and get no error

vibevoice.lora was only read from SessionOptions, so the documented --load-option path silently no-op'd. Consume it at load time via a shared apply_vibevoice_finetune_options helper (still handles --session-option too), guarding against passing it through both. Extend the overlay to apply all four trained components (mirroring infer.py's apply_lora): the language-model LoRA is delta-merged into the decoder linears, and the fine-tuned diffusion head and acoustic/semantic connectors replace their base tensors. Connector/head .bin files are read by a new pure-C++ Torch pickle reader (torch_bin), parity-checked against torch.load. Add torch_bin_parity and vibevoice_finetune_overlay_check. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

justinjohn0306 · 2026-07-02T04:13:06Z

LoRA loading confirmed on VibeVoice-7B + `--load-option` fix

Thanks for the review — addressed both points, and extended the adapter support to match infer.py.

1. `--load-option vibevoice.lora` now actually merges (Codex finding)

Previously the adapter was only read from SessionOptions, so the documented --load-option path silently no-op'd. The loader now consumes request.options at load time via a shared apply_vibevoice_finetune_options(...) helper, and the same helper still handles --session-option for backward compatibility. A guard rejects passing it through both at once. So docs/tts.md / README (--load-option vibevoice.lora=<dir>) are now correct.

2. Merge logging + 7B run log

Added --log output confirming each component. Running VibeVoice-7B with the fine-tune via --load-option:

[info][vibevoice] applying fine-tune adapter: ...\finetune_mp1\lora
[info][vibevoice] LM LoRA: merged 196 decoder modules (scale 4.000000)
[info][vibevoice] diffusion head: overrode 26 tensors
[info][vibevoice] acoustic connector: overrode 5 tensors
[info][vibevoice] semantic connector: overrode 5 tensors

3. Confirmed it changes the output toward the fine-tuned speaker

A/B with a fixed seed and identical reference audio — base weights vs the fine-tune adapter — the two outputs correlate 0.956 (not identical), difference RMS ≈ 30% of the signal: same words/prosody, but the voice timbre shifts onto the target speaker. It audibly matches the reference and the Python infer.py output with the same adapter.

Bonus: full fine-tune overlay (not just the LM LoRA)

To mirror infer.py's apply_lora, the overlay now applies all four trained components when present in the adapter dir:

language-model LoRA → delta-merged into the decoder linears
fine-tuned diffusion head → replaces model.prediction_head.*
acoustic + semantic connectors → replace model.{acoustic,semantic}_connector.*

The connector / head .bin files are read by a new pure-C++ Torch-pickle reader (torch_bin.*) — no Python at inference or preprocessing. It's parity-checked bit-for-bit against torch.load (torch_bin_parity), and an overlay check (vibevoice_finetune_overlay_check) confirms the merged tensors equal the adapter's and that the merge reaches the forward pass (verified: set_backend_tensor → require_tensor → require_tensor_data, which the overlay overrides).

Docs and README updated to describe the full fine-tune behavior.

0xShug0 · 2026-07-02T04:20:49Z

Thanks! I will do a final test tomorrow and then merge it.

0xShug0 · 2026-07-02T21:09:09Z

@justinjohn0306 Code merged! Thanks a lot!

0xShug0 · 2026-07-02T22:19:40Z

@justinjohn0306 I did some changes to improve the lora performance. Just pushed acde132.

Metric	Before	After	Improvement
LoRA decoder merge	25,922 ms	11,693 ms	54.9% faster, 2.22x
Decoder weights load	39,001 ms	14,463 ms	62.9% faster, 2.70x
Output WAV	baseline	byte-identical	no change

Merge branch '0xShug0:release-0.1' into release-0.1

c62128a

justinjohn0306 mentioned this pull request Jul 2, 2026

Add MOSS-TTS-Local model family #19

Open

Merge branch 'release-0.1' into release-0.1

2124b0c

0xShug0 merged commit f19168f into 0xShug0:release-0.1 Jul 2, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add VibeVoice 7B support and decoder LoRA merging#14

Add VibeVoice 7B support and decoder LoRA merging#14
0xShug0 merged 4 commits into
0xShug0:release-0.1from
justinjohn0306:release-0.1

justinjohn0306 commented Jul 1, 2026

Uh oh!

0xShug0 commented Jul 1, 2026

Uh oh!

justinjohn0306 commented Jul 2, 2026

Uh oh!

0xShug0 commented Jul 2, 2026

Uh oh!

Uh oh!

0xShug0 commented Jul 2, 2026

Uh oh!

0xShug0 commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

justinjohn0306 commented Jul 1, 2026

Uh oh!

0xShug0 commented Jul 1, 2026

Uh oh!

justinjohn0306 commented Jul 2, 2026

LoRA loading confirmed on VibeVoice-7B + --load-option fix

1. --load-option vibevoice.lora now actually merges (Codex finding)

2. Merge logging + 7B run log

3. Confirmed it changes the output toward the fine-tuned speaker

Bonus: full fine-tune overlay (not just the LM LoRA)

Uh oh!

0xShug0 commented Jul 2, 2026

Uh oh!

Uh oh!

0xShug0 commented Jul 2, 2026

Uh oh!

0xShug0 commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

LoRA loading confirmed on VibeVoice-7B + `--load-option` fix

1. `--load-option vibevoice.lora` now actually merges (Codex finding)