medkit.io#

APIs#

For accessing these APIs, you may use import like this:

from medkit.io import <api_to_import>

Classes:

BratInputConverter([uid])

Class in charge of converting brat annotations

BratOutputConverter([anns_labels, attrs, ...])

Class in charge of converting a list of TextDocuments into a brat collection file

RTTMInputConverter([turn_label, ...])

Convert Rich Transcription Time Marked (.rttm) files containing diarization information into turn segments.

RTTMOutputConverter([turn_label, speaker_label])

Build Rich Transcription Time Marked (.rttm) files containing diarization information from Segment objects.

class BratInputConverter(uid=None)[source]#

Class in charge of converting brat annotations

Methods:

load(dir_path[, ann_ext, text_ext])

Create a list of TextDocuments from a folder containing text files and associated brat annotations files.

load_annotations(ann_file)

Load a .ann file and return a list of Annotation objects.

load_doc(ann_path, text_path)

Create a TextDocument from a .ann file and its associated .txt file

load(dir_path, ann_ext='.ann', text_ext='.txt')[source]#

Create a list of TextDocuments from a folder containing text files and associated brat annotations files.

Parameters
  • dir_path (Union[str, Path]) – The path to the directory containing the text files and the annotation files (.ann)

  • ann_ext (str) – The extension of the brat annotation file (e.g. .ann)

  • text_ext (str) – The extension of the text file (e.g. .txt)

Return type

List[TextDocument]

Returns

List[TextDocument] – The list of TextDocuments

load_doc(ann_path, text_path)[source]#

Create a TextDocument from a .ann file and its associated .txt file

Parameters
  • text_path (Union[str, Path]) – The path to the text document file.

  • ann_path (Union[str, Path]) – The path to the brat annotation file.

Return type

TextDocument

Returns

TextDocument – The document containing the text and the annotations

load_annotations(ann_file)[source]#

Load a .ann file and return a list of Annotation objects.

Parameters

ann_file (Union[str, Path]) – Path to the .ann file.

Return type

List[TextAnnotation]

class BratOutputConverter(anns_labels=None, attrs=None, ignore_segments=True, create_config=True, top_values_by_attr=50, uid=None)[source]#

Class in charge of converting a list of TextDocuments into a brat collection file

Initialize the Brat output converter

Parameters
  • anns_labels (Optional[List[str]]) – Labels of medkit annotations to convert into Brat annotations. If None (default) all the annotations will be converted

  • attrs (Optional[List[str]]) – Labels of medkit attributes to add in the annotations that will be included. If None (default) all medkit attributes found in the segments or relations will be converted to Brat attributes

  • ignore_segments (bool) – If True medkit segments will be ignored. Only entities, attributes and relations will be converted to Brat annotations. If False the medkit segments will be converted to Brat annotations as well.

  • create_config (bool) – Whether to create a configuration file for the generated collection. This file defines the types of annotations generated, it is necessary for the correct visualization on Brat.

  • top_values_by_attr (int) – Defines the number of most common values by attribute to show in the configuration. This is useful when an attribute has a large number of values, only the ‘top’ ones will be in the config. By default, the top 50 of values by attr will be in the config.

  • uid (Optional[str]) – Identifier of the converter

Methods:

save(docs, dir_path[, doc_names])

Convert and save a collection or list of TextDocuments into a Brat collection.

save(docs, dir_path, doc_names=None)[source]#

Convert and save a collection or list of TextDocuments into a Brat collection. For each collection or list of documents, a folder is created with ‘.txt’ and ‘.ann’ files; an ‘annotation.conf’ is saved if required.

Parameters
  • docs (List[TextDocument]) – List of medkit doc objects to convert

  • dir_path (Union[str, Path]) – String or path object to save the generated files

  • doc_names (Optional[List[str]]) – Optional list with the names for the generated files. If ‘None’, ‘uid’ will be used as the name. Where ‘uid.txt’ has the raw text of the document and ‘uid.ann’ the Brat annotation file.

class RTTMInputConverter(turn_label='turn', speaker_label='speaker', store=None, converter_id=None)[source]#

Convert Rich Transcription Time Marked (.rttm) files containing diarization information into turn segments.

For each turn in a .rttm file, a Segment will be created, with an associated Attribute holding the name of the turn speaker as value. The segments can be retrieved directly or as part of an AudioDocument instance.

If a ProvTracer is set, provenance information will be added for each segment and each attribute (referencing the input converter as the operation).

Parameters
  • turn_label (str) – Label of segments representing turns in the .rttm file.

  • speaker_label (str) – Label of speaker attributes to add to each segment.

  • store (Optional[Store]) – Optional shared store to hold the annotations when adding them to audio documents.. If none provided, an internal store will be used for each document.

  • converter_id (Optional[str]) – Identifier of the converter.

Attributes:

description

Contains all the input converter init parameters.

Methods:

load(rttm_dir[, audio_dir, audio_ext])

Load all .rttm file in a directory into a list of AudioDocument objects.

load_doc(rttm_file, audio_file)

Load a single .rttm file into an AudioDocument.

load_turns(rttm_file, audio_file)

Load a .rttm file and return a list of Segment objects.

set_prov_tracer(prov_tracer)

Enable provenance tracing.

property description: medkit.core.operation_desc.OperationDescription#

Contains all the input converter init parameters.

Return type

OperationDescription

set_prov_tracer(prov_tracer)[source]#

Enable provenance tracing.

Parameters

prov_tracer (ProvTracer) – The provenance tracer used to trace the provenance.

load(rttm_dir, audio_dir=None, audio_ext='.wav')[source]#

Load all .rttm file in a directory into a list of AudioDocument objects.

For each .rttm file, they must be a corresponding audio file with the same basename, either in the same directory or in an separated audio directory.

Parameters
  • rttm_dir (Union[str, Path]) – Directory containing the .rttm files.

  • audio_dir (Union[str, Path, None]) – Directory containing the audio files corresponding to the .rttm files, if they are not in rttm_dir.

  • audio_ext (str) – File extension to use for audio files.

Return type

List[AudioDocument]

Returns

List[AudioDocument] – List of generated documents.

load_doc(rttm_file, audio_file)[source]#

Load a single .rttm file into an AudioDocument.

Parameters
  • rttm_file (Union[str, Path]) – Path to the .rttm file.

  • audio_file (Union[str, Path]) – Path to the corresponding audio file.

Return type

AudioDocument

Returns

AudioDocument – Generated document.

load_turns(rttm_file, audio_file)[source]#

Load a .rttm file and return a list of Segment objects.

Parameters
  • rttm_file (Union[str, Path]) – Path to the .rttm file.

  • audio_file (Union[str, Path]) – Path to the corresponding audio file.

Return type

List[Segment]

Returns

List[Segment] – Turn segments as found in the .rttm file.

class RTTMOutputConverter(turn_label='turn', speaker_label='speaker')[source]#

Build Rich Transcription Time Marked (.rttm) files containing diarization information from Segment objects.

There must be a segment for each turn, with an associated Attribute holding the name of the turn speaker as value. The segments can be passed directly or as part of AudioDocument instances.

Parameters
  • turn_label (str) – Label of segments representing turns in the audio documents.

  • speaker_label (str) – Label of speaker attributes attached to each turn segment.

Methods:

save(docs, rttm_dir[, doc_names])

Save AudioDocument instances as .rttm files in a directory.

save_doc(doc, rttm_file[, rttm_doc_id])

Save a single AudioDocument as a .rttm file.

save_turn_segments(turn_segments, rttm_file, ...)

Save Segment objects into a .rttm file.

save(docs, rttm_dir, doc_names=None)[source]#

Save AudioDocument instances as .rttm files in a directory.

Parameters
  • docs (List[AudioDocument]) – List of audio documents to save.

  • rttm_dir (Union[str, Path]) – Directory into which the generated .rttm files will be stored.

  • doc_names (Optional[List[str]]) – Optional list of names to use as basenames and file ids for the generated .rttm files (2d column). If none provided, the document ids will be used.

save_doc(doc, rttm_file, rttm_doc_id=None)[source]#

Save a single AudioDocument as a .rttm file.

Parameters
  • doc (AudioDocument) – Audio document to save.

  • rttm_file (Union[str, Path]) – Path of the generated .rttm file.

  • rttm_doc_id (Optional[str]) – File uid to use for the generated .rttm file (2d column). If none provided, the document uid will be used.

save_turn_segments(turn_segments, rttm_file, rttm_doc_id)[source]#

Save Segment objects into a .rttm file.

Parameters
  • turn_segments (List[Segment]) – Turn segments to save.

  • rttm_file (Union[str, Path]) – Path of the generated .rttm file.

  • rttm_doc_id (Optional[str]) – File uid to use for the generated .rttm file (2d column).

Subpackages / Submodules#

medkit.io.brat

medkit.io.medkit_json

medkit.io.rttm

medkit.io.spacy

This module needs extra-dependencies not installed as core dependencies of medkit.