medkit.io#

APIs#

For accessing these APIs, you may use import like this:

from medkit.io import <api_to_import>

Classes:

BratInputConverter([detect_cuis_in_notes, ...])

Class in charge of converting brat annotations

BratOutputConverter([anns_labels, attrs, ...])

Class in charge of converting a list of TextDocuments into a brat collection file.

DoccanoClientConfig([column_text, column_label])

A class representing the configuration in the doccano client.

DoccanoInputConverter(task[, client_config, ...])

Convert doccano files (.JSONL) containing annotations for a given task.

DoccanoOutputConverter(task[, anns_labels, ...])

Convert medkit files to doccano files (.JSONL) for a given task.

DoccanoTask(value[, names, module, ...])

Supported doccano tasks.

RTTMInputConverter([turn_label, ...])

Convert Rich Transcription Time Marked (.rttm) files containing diarization information into turn segments.

RTTMOutputConverter([turn_label, speaker_label])

Build Rich Transcription Time Marked (.rttm) files containing diarization information from Segment objects.

SRTInputConverter([turn_segment_label, ...])

Convert .srt files containing transcription information into turn segments with transcription attributes.

SRTOutputConverter([segment_turn_label, ...])

Build .srt files containing transcription information from Segment objects.

SpacyInputConverter([entities, span_groups, ...])

Class in charge of converting spacy documents into a collection of TextDocuments.

SpacyOutputConverter(nlp[, apply_nlp_spacy, ...])

Class in charge of converting a list of TextDocuments into a list of spacy documents

class BratInputConverter(detect_cuis_in_notes=True, notes_label='brat_note', uid=None)[source]#

Class in charge of converting brat annotations

Parameters:
  • detect_cuis_in_notes (bool, default=True) – If True, strings looking like CUIs in annotator notes of entities will be converted to UMLS normalization attributes rather than creating an Attribute with the whole note text as value.

  • notes_label (str, default="brat_note",) – Label to use for attributes created from annotator notes.

  • uid (str, optional) – Identifier of the converter.

Methods:

load(dir_path[, ann_ext, text_ext])

Create a list of TextDocuments from a folder containing text files and associated brat annotations files.

load_annotations(ann_file)

Load a .ann file and return a list of Annotation objects.

load_doc(ann_path, text_path)

Create a TextDocument from a .ann file and its associated .txt file

load(dir_path, ann_ext='.ann', text_ext='.txt')[source]#

Create a list of TextDocuments from a folder containing text files and associated brat annotations files.

Parameters:
  • dir_path (str or Path) – The path to the directory containing the text files and the annotation files (.ann)

  • ann_ext (str, optional) – The extension of the brat annotation file (e.g. .ann)

  • text_ext (str, optional) – The extension of the text file (e.g. .txt)

Return type:

list[TextDocument]

Returns:

list of TextDocument – The list of TextDocuments

load_doc(ann_path, text_path)[source]#

Create a TextDocument from a .ann file and its associated .txt file

Parameters:
  • ann_path (str or Path) – The path to the brat annotation file.

  • text_path (str or Path) – The path to the text document file.

Return type:

TextDocument

Returns:

TextDocument – The document containing the text and the annotations

load_annotations(ann_file)[source]#

Load a .ann file and return a list of Annotation objects.

Parameters:

ann_file (str or Path) – Path to the .ann file.

Return type:

list[TextAnnotation]

Returns:

list of TextAnnotation – The list of text annotations

class BratOutputConverter(anns_labels=None, attrs=None, notes_label='brat_note', ignore_segments=True, convert_cuis_to_notes=True, create_config=True, top_values_by_attr=50, uid=None)[source]#

Class in charge of converting a list of TextDocuments into a brat collection file.

Hint

BRAT checks the coherence between span and text for each annotation. This converter adjusts the text and spans to get the right visualization and ensure compatibility.

Initialize the Brat output converter

Parameters:
  • anns_labels (list of str, optional) – Labels of medkit annotations to convert into Brat annotations. If None (default) all the annotations will be converted

  • attrs (list of str, optional) – Labels of medkit attributes to add in the annotations that will be included. If None (default) all medkit attributes found in the segments or relations will be converted to Brat attributes

  • notes_label (str, default="brat_note") – Label of attributes that will be converted to annotator notes.

  • ignore_segments (bool, default=True) – If True medkit segments will be ignored. Only entities, attributes and relations will be converted to Brat annotations. If False the medkit segments will be converted to Brat annotations as well.

  • convert_cuis_to_notes (bool, default=True) – If True, UMLS normalization attributes will be converted to annotator notes rather than attributes. For entities with multiple UMLS attributes, CUIs will be separated by spaces (ex: “C0011849 C0004096”).

  • create_config (bool, default=True) – Whether to create a configuration file for the generated collection. This file defines the types of annotations generated, it is necessary for the correct visualization on Brat.

  • top_values_by_attr (int, default=50) – Defines the number of most common values by attribute to show in the configuration. This is useful when an attribute has a large number of values, only the ‘top’ ones will be in the config. By default, the top 50 of values by attr will be in the config.

  • uid (str, optional) – Identifier of the converter

Methods:

save(docs, dir_path[, doc_names])

Convert and save a collection or list of TextDocuments into a Brat collection.

save(docs, dir_path, doc_names=None)[source]#

Convert and save a collection or list of TextDocuments into a Brat collection. For each collection or list of documents, a folder is created with ‘.txt’ and ‘.ann’ files; an ‘annotation.conf’ is saved if required.

Parameters:
  • docs (list of TextDocument) – List of medkit doc objects to convert

  • dir_path (str or Path) – String or path object to save the generated files

  • doc_names (list of str, optional) – Optional list with the names for the generated files. If ‘None’, ‘uid’ will be used as the name. Where ‘uid.txt’ has the raw text of the document and ‘uid.ann’ the Brat annotation file.

class DoccanoInputConverter(task, client_config=None, attr_label='doccano_category', uid=None)[source]#

Convert doccano files (.JSONL) containing annotations for a given task.

For each line, a TextDocument will be created. The doccano files can be loaded from a directory with zip files or from a jsonl file.

The converter supports custom configuration to define the parameters used by doccano when importing the data (c.f. DoccanoClientConfig)

Warning

If the option Count grapheme clusters as one character was selected when creating the doccano project, the converted documents are likely to have alignment problems; the converter does not support this option.

Parameters:
  • task (DocanoTask) – The doccano task for the input converter

  • client_config (DoccanoClientConfig, optional) – Optional client configuration to define default values in doccano interface. This config can change, for example, the name of the text field or labels.

  • attr_label (str, default="doccano_category") – The label to use for the medkit attribute that represents the doccano category. This is related to TEXT_CLASSIFICATION projects.

  • uid (str, optional) – Identifier of the converter.

Methods:

load_from_directory_zip(dir_path)

Create a list of TextDocuments from zip files in a directory.

load_from_file(input_file)

Create a list of TextDocuments from a doccano JSONL file.

load_from_zip(input_file)

Create a list of TextDocuments from a zip file containing a JSONL file coming from doccano.

set_prov_tracer(prov_tracer)

Enable provenance tracing.

Attributes:

description

Contains all the input converter init parameters.

set_prov_tracer(prov_tracer)[source]#

Enable provenance tracing.

Parameters:

prov_tracer (ProvTracer) – The provenance tracer used to trace the provenance.

property description: OperationDescription#

Contains all the input converter init parameters.

Return type:

OperationDescription

load_from_directory_zip(dir_path)[source]#

Create a list of TextDocuments from zip files in a directory. The zip files should contain a JSONL file coming from doccano.

Parameters:

dir_path (str or Path) – The path to the directory containing zip files.

Return type:

list[TextDocument]

Returns:

list of TextDocument – A list of TextDocuments

load_from_zip(input_file)[source]#

Create a list of TextDocuments from a zip file containing a JSONL file coming from doccano.

Parameters:

input_file (str or Path) – The path to the zip file containing a docanno JSONL file

Return type:

list[TextDocument]

Returns:

list of TextDocument – A list of TextDocuments

load_from_file(input_file)[source]#

Create a list of TextDocuments from a doccano JSONL file.

Parameters:

input_file (str or Path) – The path to the JSONL file containing doccano annotations

Return type:

list[TextDocument]

Returns:

list of TextDocument – A list of TextDocuments

class DoccanoClientConfig(column_text='text', column_label='label')[source]#

A class representing the configuration in the doccano client. The default values are the default values used by doccano.

Variables:
  • column_text (str, default="text") – Name or key representing the text

  • column_label (str, default="label") – Name or key representing the label

class DoccanoOutputConverter(task, anns_labels=None, attr_label=None, ignore_segments=True, include_metadata=True, uid=None)[source]#

Convert medkit files to doccano files (.JSONL) for a given task.

For each TextDocument a jsonline will be created.

Parameters:
  • task (DoccanoTask) – The doccano task for the input converter

  • anns_labels (list of str, optional) – Labels of medkit annotations to convert into doccano annotations. If None (default) all the entities or relations will be converted. Useful for SEQUENCE_LABELING or RELATION_EXTRACTION converters.

  • attr_label (str, optional) – The label of the medkit attribute that represents the text category. Useful for TEXT_CLASSIFICATION converters.

  • ignore_segments (bool, default=True) – If True medkit segments will be ignored. Only entities will be converted to Doccano entities. If False the medkit segments will be converted to Doccano entities as well. Useful for SEQUENCE_LABELING or RELATION_EXTRACTION converters.

  • include_metadata (bool, default=True) – Whether include medkit metadata in the converted documents

  • uid (str, optional) – Identifier of the converter.

Methods:

save(docs, output_file)

Convert and save a list of TextDocuments into a doccano file (.JSONL)

save(docs, output_file)[source]#

Convert and save a list of TextDocuments into a doccano file (.JSONL)

Parameters:
  • docs (list of TextDocument) – List of medkit doc objects to convert

  • output_file (str or Path) – Path or string of the JSONL file where to save the converted documents

class DoccanoTask(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]#

Supported doccano tasks. The task defines the type of document to convert.

Variables:
  • TEXT_CLASSIFICATION – Documents with a category

  • RELATION_EXTRACTION – Documents with entities and relations (including IDs)

  • SEQUENCE_LABELING – Documents with entities in tuples

class RTTMInputConverter(turn_label='turn', speaker_label='speaker', converter_id=None)[source]#

Convert Rich Transcription Time Marked (.rttm) files containing diarization information into turn segments.

For each turn in a .rttm file, a Segment will be created, with an associated Attribute holding the name of the turn speaker as value. The segments can be retrieved directly or as part of an AudioDocument instance.

If a ProvTracer is set, provenance information will be added for each segment and each attribute (referencing the input converter as the operation).

Parameters:
  • turn_label (str, default="turn") – Label of segments representing turns in the .rttm file.

  • speaker_label (str, default="speaker") – Label of speaker attributes to add to each segment.

  • converter_id (str, optional) – Identifier of the converter.

Attributes:

description

Contains all the input converter init parameters.

Methods:

load(rttm_dir[, audio_dir, audio_ext])

Load all .rttm files in a directory into a list of AudioDocument objects.

load_doc(rttm_file, audio_file)

Load a single .rttm file into an AudioDocument.

load_turns(rttm_file, audio_file)

Load a .rttm file and return a list of Segment objects.

set_prov_tracer(prov_tracer)

Enable provenance tracing.

property description: OperationDescription#

Contains all the input converter init parameters.

Return type:

OperationDescription

set_prov_tracer(prov_tracer)[source]#

Enable provenance tracing.

Parameters:

prov_tracer (ProvTracer) – The provenance tracer used to trace the provenance.

load(rttm_dir, audio_dir=None, audio_ext='.wav')[source]#

Load all .rttm files in a directory into a list of AudioDocument objects.

For each .rttm file, they must be a corresponding audio file with the same basename, either in the same directory or in an separated audio directory.

Parameters:
  • rttm_dir (str or Path) – Directory containing the .rttm files.

  • audio_dir (str or Path, optional) – Directory containing the audio files corresponding to the .rttm files, if they are not in rttm_dir.

  • audio_ext (str, default=".wav") – File extension to use for audio files.

Return type:

list[AudioDocument]

Returns:

list of AudioDocument – List of generated documents.

load_doc(rttm_file, audio_file)[source]#

Load a single .rttm file into an AudioDocument.

Parameters:
  • rttm_file (str or Path) – Path to the .rttm file.

  • audio_file (str or Path) – Path to the corresponding audio file.

Return type:

AudioDocument

Returns:

AudioDocument – Generated document.

load_turns(rttm_file, audio_file)[source]#

Load a .rttm file and return a list of Segment objects.

Parameters:
  • rttm_file (str or Path) – Path to the .rttm file.

  • audio_file (str or Path) – Path to the corresponding audio file.

Return type:

list[Segment]

Returns:

list of Segment – Turn segments as found in the .rttm file.

class RTTMOutputConverter(turn_label='turn', speaker_label='speaker')[source]#

Build Rich Transcription Time Marked (.rttm) files containing diarization information from Segment objects.

There must be a segment for each turn, with an associated Attribute holding the name of the turn speaker as value. The segments can be passed directly or as part of AudioDocument instances.

Parameters:
  • turn_label (str, default="turn") – Label of segments representing turns in the audio documents.

  • speaker_label (str, default="speaker") – Label of speaker attributes attached to each turn segment.

Methods:

save(docs, rttm_dir[, doc_names])

Save AudioDocument instances as .rttm files in a directory.

save_doc(doc, rttm_file[, rttm_doc_id])

Save a single AudioDocument as a .rttm file.

save_turn_segments(turn_segments, rttm_file, ...)

Save Segment objects into a .rttm file.

save(docs, rttm_dir, doc_names=None)[source]#

Save AudioDocument instances as .rttm files in a directory.

Parameters:
  • docs (list of AudioDocument) – List of audio documents to save.

  • rttm_dir (str or Path) – Directory into which the generated .rttm files will be stored.

  • doc_names (list of str, optional) – Optional list of names to use as basenames and file ids for the generated .rttm files (2d column). If none provided, the document ids will be used.

save_doc(doc, rttm_file, rttm_doc_id=None)[source]#

Save a single AudioDocument as a .rttm file.

Parameters:
  • doc (AudioDocument) – Audio document to save.

  • rttm_file (str or Path) – Path of the generated .rttm file.

  • rttm_doc_id (str, optional) – File uid to use for the generated .rttm file (2d column). If none provided, the document uid will be used.

save_turn_segments(turn_segments, rttm_file, rttm_doc_id)[source]#

Save Segment objects into a .rttm file.

Parameters:
  • turn_segments (list of Segment) – Turn segments to save.

  • rttm_file (str or Path) – Path of the generated .rttm file.

  • rttm_doc_id (str, optional) – File uid to use for the generated .rttm file (2d column).

class SpacyInputConverter(entities=None, span_groups=None, attrs=None, uid=None)[source]#

Class in charge of converting spacy documents into a collection of TextDocuments.

Initialize the spacy input converter

Parameters:
  • entities (list of str, optional) – Labels of spacy entities (doc.ents) to convert into medkit entities. If None (default) all spacy entities will be converted and added into its origin medkit document.

  • span_groups (list of str, optional) – Name of groups of spacy spans (doc.spans) to convert into medkit segments. If None (default) all groups of spacy spans will be converted and added into the medkit document.

  • attrs (list of str, optional) – Name of span extensions to convert into medkit attributes. If None (default) all non-None extensions will be added for each annotation

  • uid (str, optional) – Identifier of the converter

Methods:

load(spacy_docs)

Create a list of TextDocuments from a list of spacy Doc objects.

load(spacy_docs)[source]#

Create a list of TextDocuments from a list of spacy Doc objects. Depending on the configuration of the converted, the selected annotations and attributes are included in the documents.

Parameters:

spacy_docs (list of Doc) – A list of spacy documents to convert

Return type:

list[TextDocument]

Returns:

list of TextDocument

A list of TextDocuments

class SpacyOutputConverter(nlp, apply_nlp_spacy=False, labels_anns=None, attrs=None, uid=None)[source]#

Class in charge of converting a list of TextDocuments into a list of spacy documents

Initialize the spacy output converter

Parameters:
  • nlp (Language) – Language object with the loaded pipeline from Spacy

  • apply_nlp_spacy (bool, default=False) – If True, each component of nlp pipeline is applied to the new spacy document. Some features, such as ‘POS TAG’, are added by a component of the pipeline, this parameter should be True, in order to add such attributes. If False, the nlp pipeline is not applied in the spacy document, so the document contains only the annotations and attributes transferred by medkit.

  • labels_anns (list of str, optional) – Labels of medkit annotations to include in the spacy document. If None (default) all the annotations will be included.

  • attrs (list of str, optional) – Labels of medkit attributes to add in the annotations that will be included. If None (default) all the attributes will be added as custom attributes in each annotation included.

  • uid (str, optional) – Identifier of the pipeline

Methods:

convert(medkit_docs)

Convert a list of TextDocuments into a list of spacy Doc objects.

convert(medkit_docs)[source]#

Convert a list of TextDocuments into a list of spacy Doc objects. Depending on the configuration of the converted, the selected annotations and attributes are included in the documents.

Parameters:

medkit_docs (list of TextDocument) – A list of TextDocuments to convert

Return type:

list[Doc]

Returns:

list of Doc – A list of spacy Doc objects

class SRTInputConverter(turn_segment_label='turn', transcription_attr_label='transcribed_text', converter_id=None)[source]#

Convert .srt files containing transcription information into turn segments with transcription attributes.

For each turn in a .srt file, a Segment will be created, with an associated Attribute holding the transcribed text as value. The segments can be retrieved directly or as part of an AudioDocument instance.

If a ProvTracer is set, provenance information will be added for each segment and each attribute (referencing the input converter as the operation).

Parameters:
  • turn_segment_label (str, default="turn") – Label to use for segments representing turns in the .srt file.

  • transcription_attr_label (str, default="transcribed_text") – Label to use for segments attributes containing the transcribed text.

  • converter_id (str, optional) – Identifier of the converter.

Attributes:

description

Contains all the input converter init parameters.

Methods:

load(srt_dir[, audio_dir, audio_ext])

Load all .srt files in a directory into a list of AudioDocument objects.

load_doc(srt_file, audio_file)

Load a single .srt file into an AudioDocument containing turn segments with transcription attributes.

load_segments(srt_file, audio_file)

Load a .srt file and return a list of Segment objects corresponding to turns, with transcription attributes.

set_prov_tracer(prov_tracer)

Enable provenance tracing.

property description: OperationDescription#

Contains all the input converter init parameters.

Return type:

OperationDescription

set_prov_tracer(prov_tracer)[source]#

Enable provenance tracing.

Parameters:

prov_tracer (ProvTracer) – The provenance tracer used to trace the provenance.

load(srt_dir, audio_dir=None, audio_ext='.wav')[source]#

Load all .srt files in a directory into a list of AudioDocument objects.

For each .srt file, they must be a corresponding audio file with the same basename, either in the same directory or in an separated audio directory.

Parameters:
  • srt_dir (str or Path) – Directory containing the .srt files.

  • audio_dir (str or Path, optional) – Directory containing the audio files corresponding to the .srt files, if they are not in srt_dir.

  • audio_ext (str, default=".wav") – File extension to use for audio files.

Return type:

list[AudioDocument]

Returns:

list of AudioDocument – List of generated documents.

load_doc(srt_file, audio_file)[source]#

Load a single .srt file into an AudioDocument containing turn segments with transcription attributes.

Parameters:
  • srt_file (str or Path) – Path to the .srt file.

  • audio_file (str or Path) – Path to the corresponding audio file.

Return type:

AudioDocument

Returns:

AudioDocument – Generated document.

load_segments(srt_file, audio_file)[source]#

Load a .srt file and return a list of Segment objects corresponding to turns, with transcription attributes.

Parameters:
  • srt_file (str or Path) – Path to the .srt file.

  • audio_file (str or Path) – Path to the corresponding audio file.

Return type:

list[Segment]

Returns:

list of Segment – Turn segments as found in the .srt file, with transcription attributes attached.

class SRTOutputConverter(segment_turn_label='turn', transcription_attr_label='transcribed_text')[source]#

Build .srt files containing transcription information from Segment objects.

There must be a segment for each turn, with an associated Attribute holding the transcribed text as value. The segments can be passed directly or as part of AudioDocument instances.

Parameters:
  • segment_turn_label (str, default="turn") – Label of segments representing turns in the audio documents.

  • transcription_attr_label (str, default="transcribed_text") – Label of segments attributes containing the transcribed text.

Methods:

save(docs, srt_dir[, doc_names])

Save AudioDocument instances as .srt files in a directory.

save_doc(doc, srt_file)

Save a single AudioDocument as a .srt file.

save_segments(segments, srt_file)

Save Segment objects representing turns into a .srt file.

save(docs, srt_dir, doc_names=None)[source]#

Save AudioDocument instances as .srt files in a directory.

Parameters:
  • docs (list of AudioDocument) – List of audio documents to save.

  • srt_dir (str or Path) – Directory into which the generated .str files will be stored.

  • doc_names (list of str, optional) – Optional list of names to use as basenames for the generated .srt files.

save_doc(doc, srt_file)[source]#

Save a single AudioDocument as a .srt file.

Parameters:
  • doc (AudioDocument) – Audio document to save.

  • srt_file (str or Path) – Path of the generated .srt file.

save_segments(segments, srt_file)[source]#

Save Segment objects representing turns into a .srt file.

Parameters:
  • segments (list of Segment) – Turn segments to save.

  • srt_file (str or Path) – Path of the generated .srt file.

Subpackages / Submodules#

medkit.io.medkit_json