Skip to content

[RNE Rewrite] feat: add voice activity detection pipeline#1298

Draft
msluszniak wants to merge 1 commit into
rne-rewritefrom
@ms/rewrite-vad
Draft

[RNE Rewrite] feat: add voice activity detection pipeline#1298
msluszniak wants to merge 1 commit into
rne-rewritefrom
@ms/rewrite-vad

Conversation

@msluszniak

Copy link
Copy Markdown
Member

Description

Adds a Voice Activity Detection (VAD) task pipeline and a corresponding speech example app. The whole pipeline (feature extraction, chunked inference, segment postprocessing and streaming) runs in TypeScript on top of the core model.execute primitive — no new C++.

Introduces a breaking change?

  • Yes
  • No

Type of change

  • Bug fix (change which fixes an issue)
  • New feature (change which adds functionality)
  • Documentation update (improves or adds clarity to existing documentation)
  • Other (chores, tests, code style improvements etc.)

Tested on

  • iOS
  • Android

Testing instructions

Screenshots

Related issues

Closes #1249

Checklist

  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have updated the documentation accordingly
  • My changes generate no new warnings

Additional notes

  • Depends on the get_dynamic_dims relaxed input validation from the text-embeddings PR ([RNE Rewrite] Add image and text embeddings pipelines #1247): VAD feeds a variable-length [frames, 512] input tensor per chunk. Outputs are still validated exactly, so the output tensor is pre-allocated at the model-declared shape. Requires [RNE Rewrite] Add image and text embeddings pipelines #1247 to land and the fsmn-vad model to be re-exported with a get_dynamic_dims method.
  • Segments are returned in seconds (the old native path returned raw sample indices).
  • The FSMN output contract is assumed to be [1, frames, classes] with class 0 = non-speech (speech = 1 - p0), matching the current native implementation.

Port the VAD feature to the rewrite as a pure-TypeScript pipeline on top of
the core model.execute primitive (no new C++):

- src/extensions/speech/tasks/vad.ts: createVAD runner replicating the native
  FSMN-VAD algorithm (framing + Hann window + pre-emphasis, chunked inference,
  thresholding / min-duration / padding / merge). Segments are returned in
  seconds. Relies on the get_dynamic_dims relaxed input validation for the
  dynamic frame dimension; the fsmn-vad model is re-exported with it.
- src/extensions/speech/vadStreamer.ts: pure streaming state machine driving
  onSpeechBegin / onSpeechEnd over an accumulating buffer.
- src/hooks/useVAD.ts: hook wrapping createVAD + streamer lifecycle.
- Register models.vad.FSMN_VAD and export the speech extension.
- apps/speech: expo-router demo (mirrors apps/nlp) with a real-time mic VAD
  screen via react-native-audio-api.
@msluszniak msluszniak self-assigned this Jul 2, 2026
@msluszniak msluszniak added refactoring feature PRs that implement a new feature labels Jul 2, 2026
@msluszniak msluszniak linked an issue Jul 2, 2026 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature PRs that implement a new feature refactoring

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[RNE Rewrite] Speech - add VAD pipeline implementation

1 participant