medkit.audio.segmentation.pa_speaker_detector

medkit.audio.segmentation.pa_speaker_detector#

This module needs extra-dependencies not installed as core dependencies of medkit. To install them, use pip install medkit-lib[pa-speaker-detector].

Classes:

PASpeakerDetector(model, output_label[, ...])

Speaker diarization operation relying on pyannote.audio

class PASpeakerDetector(model, output_label, min_nb_speakers=None, max_nb_speakers=None, min_duration=0.1, device=-1, segmentation_batch_size=1, embedding_batch_size=1, hf_auth_token=None, uid=None)[source]#

Speaker diarization operation relying on pyannote.audio

Each input segment will be split into several sub-segments corresponding to speech turn, and an attribute will be attached to each of these sub-segments indicating the speaker of the turn.

PASpeakerDetector uses the SpeakerDiarization pipeline from pyannote.audio, which performs the following steps:

perform multi-speaker VAD with a PyanNet segmentation model and extract voiced segments ;
compute embeddings for each voiced segment with a embeddings model (typically speechbrain ECAPA-TDNN) ;
group voice segments by speakers using a clustering algorithm such as agglomerative clustering, HMM, etc.

Parameters:

model (str or Path) – Name (on the HuggingFace models hub) or path of a pretrained pipeline. When a path, should point to the .yaml file containing the pipeline configuration.
output_label (str) – Label of generated turn segments.
min_nb_speakers (int, optional) – Minimum number of speakers expected to be found.
max_nb_speakers (int, optional) – Maximum number of speakers expected to be found.
min_duration (float, default=0.1) – Minimum duration of speech segments, in seconds (short segments will be discarded).
device (int, default=-1) – Device to use for pytorch models. Follows the Hugging Face convention (-1 for cpu and device number for gpu, for instance 0 for “cuda:0”).
segmentation_batch_size (int, default=1) – Number of input segments in batches processed by segmentation model.
embedding_batch_size (int, default=1) – Number of pre-segmented audios in batches processed by embedding model.
hf_auth_token (str, optional) – HuggingFace Authentication token (to access private models on the hub)
uid (str, optional) – Identifier of the detector.

Methods:

`run`(segments)	Return all turn segments detected for all input segments.
`set_prov_tracer`(prov_tracer)	Enable provenance tracing.

Attributes:

description

Contains all the operation init parameters.

run(segments)[source]#

Return all turn segments detected for all input segments.

Parameters:: segments (list of Segment) – Audio segments on which to perform diarization.
Return type:: list[Segment]
Returns:: list of Segment – Segments detected as containing speech activity (with speaker attributes)

property description: OperationDescription#

Contains all the operation init parameters.

Return type:: OperationDescription

set_prov_tracer(prov_tracer)#

Enable provenance tracing.

Parameters:: prov_tracer (ProvTracer) – The provenance tracer used to trace the provenance.

medkit.audio.segmentation.pa_speaker_detector

Contents

medkit.audio.segmentation.pa_speaker_detector#