medkit.audio.segmentation.webrtc_voice_detector#
Classes#
Voice Activity Detection operation relying on the webrtcvad package. |
Module Contents#
- class medkit.audio.segmentation.webrtc_voice_detector.WebRTCVoiceDetector(output_label: str, aggressiveness: typing_extensions.Literal[0, 1, 2, 3] = 2, frame_duration: typing_extensions.Literal[10, 20, 30] = 30, nb_frames_in_window: int = 10, switch_ratio: float = 0.9, uid: str | None = None)#
Bases:
medkit.core.audio.SegmentationOperation
Voice Activity Detection operation relying on the webrtcvad package.
Per-frame VAD results of webrtcvad are aggregated with a switch algorithm considering the percentage of speech/non-speech frames in a wider sliding window.
Input segments must be mono at 8kHZ, 16kHz, 32kHz or 48Khz.
- Parameters:
- output_labelstr
Label of output speech segments.
- aggressiveness{0, 1, 2, 3}, default=2
Aggressiveness param passed to webrtcvad (the higher, the more likely to detect speech).
- frame_duration{10, 20, 30}, default=30
Duration in milliseconds of frames passed to webrtcvad.
- nb_frames_in_windowint, default=10
Number of frames in the sliding window used when aggregating per-frame VAD results.
- switch_ratiofloat, default=0.9
Percentage of speech/non-speech frames required to switch the window speech state when aggregating per-frame VAD results.
- uidstr, optional
Identifier of the detector.
- init_args#
- output_label#
- aggressiveness#
- frame_duration#
- nb_frames_in_window#
- switch_ratio#
- _vad#
- run(segments: list[medkit.core.audio.Segment]) list[medkit.core.audio.Segment] #
Return all speech segments detected for all input segments.
- Parameters:
- segmentslist of Segment
Audio segments on which to perform VAD.
- Returns:
- list of Segment
Segments detected as containing speech activity.
- _detect_activity_in_segment(segment: medkit.core.audio.Segment) Iterator[medkit.core.audio.Segment] #
- _get_aggregated_vad(frames, sample_rate)#
Return index ranges of voiced frames using webrtcvad.