medkit.audio.segmentation.webrtc_voice_detector#

Classes#

WebRTCVoiceDetector

Voice Activity Detection operation relying on the webrtcvad package.

Module Contents#

class medkit.audio.segmentation.webrtc_voice_detector.WebRTCVoiceDetector(output_label: str, aggressiveness: typing_extensions.Literal[0, 1, 2, 3] = 2, frame_duration: typing_extensions.Literal[10, 20, 30] = 30, nb_frames_in_window: int = 10, switch_ratio: float = 0.9, uid: str | None = None)#

Bases: medkit.core.audio.SegmentationOperation

Voice Activity Detection operation relying on the webrtcvad package.

Per-frame VAD results of webrtcvad are aggregated with a switch algorithm considering the percentage of speech/non-speech frames in a wider sliding window.

Input segments must be mono at 8kHZ, 16kHz, 32kHz or 48Khz.

Parameters:
output_labelstr

Label of output speech segments.

aggressiveness{0, 1, 2, 3}, default=2

Aggressiveness param passed to webrtcvad (the higher, the more likely to detect speech).

frame_duration{10, 20, 30}, default=30

Duration in milliseconds of frames passed to webrtcvad.

nb_frames_in_windowint, default=10

Number of frames in the sliding window used when aggregating per-frame VAD results.

switch_ratiofloat, default=0.9

Percentage of speech/non-speech frames required to switch the window speech state when aggregating per-frame VAD results.

uidstr, optional

Identifier of the detector.

init_args#
output_label#
aggressiveness#
frame_duration#
nb_frames_in_window#
switch_ratio#
_vad#
run(segments: list[medkit.core.audio.Segment]) list[medkit.core.audio.Segment]#

Return all speech segments detected for all input segments.

Parameters:
segmentslist of Segment

Audio segments on which to perform VAD.

Returns:
list of Segment

Segments detected as containing speech activity.

_detect_activity_in_segment(segment: medkit.core.audio.Segment) Iterator[medkit.core.audio.Segment]#
_get_aggregated_vad(frames, sample_rate)#

Return index ranges of voiced frames using webrtcvad.