medkit.audio.segmentation.pa_speaker_detector#
Classes#
Speaker diarization operation relying on pyannote.audio. |
Module Contents#
- class medkit.audio.segmentation.pa_speaker_detector.PASpeakerDetector(model: str | pathlib.Path, output_label: str, min_nb_speakers: int | None = None, max_nb_speakers: int | None = None, min_duration: float = 0.1, device: int = -1, segmentation_batch_size: int = 1, embedding_batch_size: int = 1, hf_auth_token: str | None = None, uid: str | None = None)#
Bases:
medkit.core.audio.SegmentationOperation
Speaker diarization operation relying on pyannote.audio.
Each input segment will be split into several sub-segments corresponding to speech turn, and an attribute will be attached to each of these sub-segments indicating the speaker of the turn.
PASpeakerDetector uses the SpeakerDiarization pipeline from pyannote.audio, which performs the following steps:
perform multi-speaker VAD with a PyanNet segmentation model and extract voiced segments ;
compute embeddings for each voiced segment with a embeddings model (typically speechbrain ECAPA-TDNN) ;
group voice segments by speakers using a clustering algorithm such as agglomerative clustering, HMM, etc.
- Parameters:
- modelstr or Path
Name (on the HuggingFace models hub) or path of a pretrained pipeline. When a path, should point to the .yaml file containing the pipeline configuration.
- output_labelstr
Label of generated turn segments.
- min_nb_speakersint, optional
Minimum number of speakers expected to be found.
- max_nb_speakersint, optional
Maximum number of speakers expected to be found.
- min_durationfloat, default=0.1
Minimum duration of speech segments, in seconds (short segments will be discarded).
- deviceint, default=-1
Device to use for pytorch models. Follows the Hugging Face convention (-1 for cpu and device number for gpu, for instance 0 for “cuda:0”).
- segmentation_batch_sizeint, default=1
Number of input segments in batches processed by segmentation model.
- embedding_batch_sizeint, default=1
Number of pre-segmented audios in batches processed by embedding model.
- hf_auth_tokenstr, optional
HuggingFace Authentication token (to access private models on the hub)
- uidstr, optional
Identifier of the detector.
- init_args#
- output_label#
- min_nb_speakers#
- max_nb_speakers#
- min_duration#
- torch_device#
- _pipeline#
- segmentation_batch_size#
- embedding_batch_size#
- run(segments: list[medkit.core.audio.Segment]) list[medkit.core.audio.Segment] #
Return all turn segments detected for all input segments.
- Parameters:
- segmentslist of Segment
Audio segments on which to perform diarization.
- Returns:
- list of Segment
Segments detected as containing speech activity (with speaker attributes)
- _detect_turns_in_segment(segment: medkit.core.audio.Segment) Iterator[medkit.core.audio.Segment] #