Skip to content

Add VibeVoice 7B support and decoder LoRA merging#14

Merged
0xShug0 merged 4 commits into
0xShug0:release-0.1from
justinjohn0306:release-0.1
Jul 2, 2026
Merged

Add VibeVoice 7B support and decoder LoRA merging#14
0xShug0 merged 4 commits into
0xShug0:release-0.1from
justinjohn0306:release-0.1

Conversation

@justinjohn0306

Copy link
Copy Markdown
Contributor

Support the VibeVoice 7B model alongside the 1.5B:

  • Add the vibevoice_7b model-manager package, reusing the shared Qwen2.5 tokenizer bundle.
  • Handle the 7B config.json, tolerating its upstream "acostic_vae_dim" key and reading top-level tie_word_embeddings.
  • Load a separate lm_head.weight when word embeddings are untied and bind the decoder logits head to it.

Add optional PEFT LoRA merging for the decoder:

  • Merge lora_A/lora_B into the targeted linear weights at load time via a tensor-source decorator, so it composes with the weight-type quantization options at no per-step cost.
  • Expose it through the vibevoice.lora / vibevoice.lora_scale load options and document it in the README and docs/tts.md.

Support the VibeVoice 7B model alongside the 1.5B:
- Add the vibevoice_7b model-manager package, reusing the shared
  Qwen2.5 tokenizer bundle.
- Handle the 7B config.json, tolerating its upstream "acostic_vae_dim"
  key and reading top-level tie_word_embeddings.
- Load a separate lm_head.weight when word embeddings are untied and
  bind the decoder logits head to it.

Add optional PEFT LoRA merging for the decoder:
- Merge lora_A/lora_B into the targeted linear weights at load time via
  a tensor-source decorator, so it composes with the weight-type
  quantization options at no per-step cost.
- Expose it through the vibevoice.lora / vibevoice.lora_scale load
  options and document it in the README and docs/tts.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@0xShug0

0xShug0 commented Jul 1, 2026

Copy link
Copy Markdown
Owner

@justinjohn0306 Thank you for your PR! Could you add extra logs to confirm the LoRA loading is woring with 7B model and share the run log?

Reviewed by Codex:
docs/tts.md:392 and README news say to use --load-option vibevoice.lora, but the implementation only reads vibevoice.lora from SessionOptions in src/models/vibevoice/session.cpp:34. The CLI stores --load-option separately in load_request.options, while session_options.options only receives --session-option. Since the VibeVoice loader does not consume load_request.options, users following the docs will run without any LoRA merge and get no error

vibevoice.lora was only read from SessionOptions, so the documented
--load-option path silently no-op'd. Consume it at load time via a shared
apply_vibevoice_finetune_options helper (still handles --session-option too),
guarding against passing it through both.

Extend the overlay to apply all four trained components (mirroring infer.py's
apply_lora): the language-model LoRA is delta-merged into the decoder linears,
and the fine-tuned diffusion head and acoustic/semantic connectors replace
their base tensors. Connector/head .bin files are read by a new pure-C++ Torch
pickle reader (torch_bin), parity-checked against torch.load. Add
torch_bin_parity and vibevoice_finetune_overlay_check.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@justinjohn0306

Copy link
Copy Markdown
Contributor Author

LoRA loading confirmed on VibeVoice-7B + --load-option fix

Thanks for the review — addressed both points, and extended the adapter support to match infer.py.

1. --load-option vibevoice.lora now actually merges (Codex finding)

Previously the adapter was only read from SessionOptions, so the documented --load-option path silently no-op'd. The loader now consumes request.options at load time via a shared apply_vibevoice_finetune_options(...) helper, and the same helper still handles --session-option for backward compatibility. A guard rejects passing it through both at once. So docs/tts.md / README (--load-option vibevoice.lora=<dir>) are now correct.

2. Merge logging + 7B run log

Added --log output confirming each component. Running VibeVoice-7B with the fine-tune via --load-option:

[info][vibevoice] applying fine-tune adapter: ...\finetune_mp1\lora
[info][vibevoice] LM LoRA: merged 196 decoder modules (scale 4.000000)
[info][vibevoice] diffusion head: overrode 26 tensors
[info][vibevoice] acoustic connector: overrode 5 tensors
[info][vibevoice] semantic connector: overrode 5 tensors

3. Confirmed it changes the output toward the fine-tuned speaker

A/B with a fixed seed and identical reference audio — base weights vs the fine-tune adapter — the two outputs correlate 0.956 (not identical), difference RMS ≈ 30% of the signal: same words/prosody, but the voice timbre shifts onto the target speaker. It audibly matches the reference and the Python infer.py output with the same adapter.

Bonus: full fine-tune overlay (not just the LM LoRA)

To mirror infer.py's apply_lora, the overlay now applies all four trained components when present in the adapter dir:

  • language-model LoRA → delta-merged into the decoder linears
  • fine-tuned diffusion head → replaces model.prediction_head.*
  • acoustic + semantic connectors → replace model.{acoustic,semantic}_connector.*

The connector / head .bin files are read by a new pure-C++ Torch-pickle reader (torch_bin.*) — no Python at inference or preprocessing. It's parity-checked bit-for-bit against torch.load (torch_bin_parity), and an overlay check (vibevoice_finetune_overlay_check) confirms the merged tensors equal the adapter's and that the merge reaches the forward pass (verified: set_backend_tensor → require_tensor → require_tensor_data, which the overlay overrides).

Docs and README updated to describe the full fine-tune behavior.

@0xShug0

0xShug0 commented Jul 2, 2026

Copy link
Copy Markdown
Owner

Thanks! I will do a final test tomorrow and then merge it.

@0xShug0 0xShug0 merged commit f19168f into 0xShug0:release-0.1 Jul 2, 2026
4 checks passed
@0xShug0

0xShug0 commented Jul 2, 2026

Copy link
Copy Markdown
Owner

@justinjohn0306 Code merged! Thanks a lot!

@0xShug0

0xShug0 commented Jul 2, 2026

Copy link
Copy Markdown
Owner

@justinjohn0306 I did some changes to improve the lora performance. Just pushed acde132.

Metric Before After Improvement
LoRA decoder merge 25,922 ms 11,693 ms 54.9% faster, 2.22x
Decoder weights load 39,001 ms 14,463 ms 62.9% faster, 2.70x
Output WAV baseline byte-identical no change

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants