medkit.audio.segmentation.pa_speaker_detector#

This module needs extra-dependencies not installed as core dependencies of medkit. To install them, use pip install medkit-lib[pa-speaker-detector].

Classes:

PASpeakerDetector(segmentation_model, ...[, ...])

Speaker diarization operation relying on pyannote.audio

class PASpeakerDetector(segmentation_model, embedding_model, clustering, output_label, pipeline_params=None, min_nb_speakers=None, max_nb_speakers=None, segmentation_batch_size=1, embedding_batch_size=1, uid=None)[source]#

Speaker diarization operation relying on pyannote.audio

Each input segment will be split into several sub-segments corresponding to speech turn, and an attribute will be attached to each of these sub-segments indicating the speaker of the turn.

PASpeakerDetector uses the SpeakerDiarization pipeline from pyannote.audio, which performs the following steps:

  • perform multi-speaker VAD with a PyanNet segmentation model and extract voiced segments ;

  • compute embeddings for each voiced segment with a embeddings model (typically speechbrain ECAPA-TDNN) ;

  • group voice segments by speakers using a clustering algorithm such as agglomerative clustering, HMM, etc.

Parameters
  • segmentation_model (Union[str, Path]) – Name (on the HuggingFace models hub) or path of the PyanNet segmentation model. When a path, should point to the .bin file containing the model.

  • embedding_model (Union[str, Path]) – Name (on the HuggingFace models hub) or path to the embedding model. When a path to a speechbrain model, should point to the directory containing the model weights and hyperparameters.

  • clustering (Literal['AgglomerativeClustering', 'FINCHClustering', 'HiddenMarkovModelClustering', 'OracleClustering']) – Clustering method to use.

  • output_label (str) – Label of generated turn segments.

  • pipeline_params (Optional[Dict]) – Dictionary of segmentation and clustering parameters. The dictionary can hold a “segmentation” key and a “clustering” key pointing to sub dictionaries. Refer to the pyannote documentation for the supported parameters segmentation and clustering parameters (clustering parameters depend on the clustering method used).

  • min_nb_speakers (Optional[int]) – Minimum number of speakers expected to be found.

  • max_nb_speakers (Optional[int]) – Maximum number of speakers expected to be found.

  • segmentation_batch_size (int) – Number of input segments in batches processed by segmentation model.

  • embedding_batch_size (int) – Number of pre-segmented audios in batches processed by embedding model.

  • uid (str) – Identifier of the detector.

Methods:

run(segments)

Return all turn segments detected for all input segments.

run(segments)[source]#

Return all turn segments detected for all input segments.

Parameters

segments (List[Segment]) – Audio segments on which to perform diarization.

Return type

List[Segment]

Returns

List[~medkit.core.audio.Segment] – Segments detected as containing speech activity (with speaker attributes)