medkit.io#
APIs#
For accessing these APIs, you may use import like this:
from medkit.io import <api_to_import>
Classes:
|
Class in charge of converting brat annotations |
|
Class in charge of converting a list of TextDocuments into a brat collection file. |
|
A class representing the configuration in the doccano client. |
|
Convert doccano files (.JSONL) containing annotations for a given task. |
|
Convert medkit files to doccano files (.JSONL) for a given task. |
|
Supported doccano tasks. |
|
Convert Rich Transcription Time Marked (.rttm) files containing diarization information into turn segments. |
|
Build Rich Transcription Time Marked (.rttm) files containing diarization information from |
|
Convert .srt files containing transcription information into turn segments with transcription attributes. |
|
Build .srt files containing transcription information from |
|
Class in charge of converting spacy documents into a collection of TextDocuments. |
|
Class in charge of converting a list of TextDocuments into a list of spacy documents |
- class BratInputConverter(detect_cuis_in_notes=True, notes_label='brat_note', uid=None)[source]#
Class in charge of converting brat annotations
- Parameters:
detect_cuis_in_notes (bool, default=True) – If True, strings looking like CUIs in annotator notes of entities will be converted to UMLS normalization attributes rather than creating an
Attributewith the whole note text as value.notes_label (str, default="brat_note",) – Label to use for attributes created from annotator notes.
uid (str, optional) – Identifier of the converter.
Methods:
load(dir_path[, ann_ext, text_ext])Create a list of TextDocuments from a folder containing text files and associated brat annotations files.
load_annotations(ann_file)Load a .ann file and return a list of
Annotationobjects.load_doc(ann_path, text_path)Create a TextDocument from a .ann file and its associated .txt file
- load(dir_path, ann_ext='.ann', text_ext='.txt')[source]#
Create a list of TextDocuments from a folder containing text files and associated brat annotations files.
- Parameters:
dir_path (str or Path) – The path to the directory containing the text files and the annotation files (.ann)
ann_ext (str, optional) – The extension of the brat annotation file (e.g. .ann)
text_ext (str, optional) – The extension of the text file (e.g. .txt)
- Return type:
list[TextDocument]- Returns:
list of TextDocument – The list of TextDocuments
- load_doc(ann_path, text_path)[source]#
Create a TextDocument from a .ann file and its associated .txt file
- Parameters:
ann_path (str or Path) – The path to the brat annotation file.
text_path (str or Path) – The path to the text document file.
- Return type:
- Returns:
TextDocument – The document containing the text and the annotations
- load_annotations(ann_file)[source]#
Load a .ann file and return a list of
Annotationobjects.- Parameters:
ann_file (str or Path) – Path to the .ann file.
- Return type:
list[TextAnnotation]- Returns:
list of TextAnnotation – The list of text annotations
- class BratOutputConverter(anns_labels=None, attrs=None, notes_label='brat_note', ignore_segments=True, convert_cuis_to_notes=True, create_config=True, top_values_by_attr=50, uid=None)[source]#
Class in charge of converting a list of TextDocuments into a brat collection file.
Hint
BRAT checks the coherence between span and text for each annotation. This converter adjusts the text and spans to get the right visualization and ensure compatibility.
Initialize the Brat output converter
- Parameters:
anns_labels (list of str, optional) – Labels of medkit annotations to convert into Brat annotations. If None (default) all the annotations will be converted
attrs (list of str, optional) – Labels of medkit attributes to add in the annotations that will be included. If None (default) all medkit attributes found in the segments or relations will be converted to Brat attributes
notes_label (str, default="brat_note") – Label of attributes that will be converted to annotator notes.
ignore_segments (bool, default=True) – If True medkit segments will be ignored. Only entities, attributes and relations will be converted to Brat annotations. If False the medkit segments will be converted to Brat annotations as well.
convert_cuis_to_notes (bool, default=True) – If True, UMLS normalization attributes will be converted to annotator notes rather than attributes. For entities with multiple UMLS attributes, CUIs will be separated by spaces (ex: “C0011849 C0004096”).
create_config (bool, default=True) – Whether to create a configuration file for the generated collection. This file defines the types of annotations generated, it is necessary for the correct visualization on Brat.
top_values_by_attr (int, default=50) – Defines the number of most common values by attribute to show in the configuration. This is useful when an attribute has a large number of values, only the ‘top’ ones will be in the config. By default, the top 50 of values by attr will be in the config.
uid (str, optional) – Identifier of the converter
Methods:
save(docs, dir_path[, doc_names])Convert and save a collection or list of TextDocuments into a Brat collection.
- save(docs, dir_path, doc_names=None)[source]#
Convert and save a collection or list of TextDocuments into a Brat collection. For each collection or list of documents, a folder is created with ‘.txt’ and ‘.ann’ files; an ‘annotation.conf’ is saved if required.
- Parameters:
docs (list of TextDocument) – List of medkit doc objects to convert
dir_path (str or Path) – String or path object to save the generated files
doc_names (list of str, optional) – Optional list with the names for the generated files. If ‘None’, ‘uid’ will be used as the name. Where ‘uid.txt’ has the raw text of the document and ‘uid.ann’ the Brat annotation file.
- class DoccanoInputConverter(task, client_config=None, attr_label='doccano_category', uid=None)[source]#
Convert doccano files (.JSONL) containing annotations for a given task.
For each line, a
TextDocumentwill be created. The doccano files can be loaded from a directory with zip files or from a jsonl file.The converter supports custom configuration to define the parameters used by doccano when importing the data (c.f.
DoccanoClientConfig)Warning
If the option Count grapheme clusters as one character was selected when creating the doccano project, the converted documents are likely to have alignment problems; the converter does not support this option.
- Parameters:
task (DocanoTask) – The doccano task for the input converter
client_config (DoccanoClientConfig, optional) – Optional client configuration to define default values in doccano interface. This config can change, for example, the name of the text field or labels.
attr_label (str, default="doccano_category") – The label to use for the medkit attribute that represents the doccano category. This is related to
TEXT_CLASSIFICATIONprojects.uid (str, optional) – Identifier of the converter.
Methods:
load_from_directory_zip(dir_path)Create a list of TextDocuments from zip files in a directory.
load_from_file(input_file)Create a list of TextDocuments from a doccano JSONL file.
load_from_zip(input_file)Create a list of TextDocuments from a zip file containing a JSONL file coming from doccano.
set_prov_tracer(prov_tracer)Enable provenance tracing.
Attributes:
Contains all the input converter init parameters.
- set_prov_tracer(prov_tracer)[source]#
Enable provenance tracing.
- Parameters:
prov_tracer (ProvTracer) – The provenance tracer used to trace the provenance.
- property description: OperationDescription#
Contains all the input converter init parameters.
- Return type:
- load_from_directory_zip(dir_path)[source]#
Create a list of TextDocuments from zip files in a directory. The zip files should contain a JSONL file coming from doccano.
- Parameters:
dir_path (str or Path) – The path to the directory containing zip files.
- Return type:
list[TextDocument]- Returns:
list of TextDocument – A list of TextDocuments
- load_from_zip(input_file)[source]#
Create a list of TextDocuments from a zip file containing a JSONL file coming from doccano.
- Parameters:
input_file (str or Path) – The path to the zip file containing a docanno JSONL file
- Return type:
list[TextDocument]- Returns:
list of TextDocument – A list of TextDocuments
- load_from_file(input_file)[source]#
Create a list of TextDocuments from a doccano JSONL file.
- Parameters:
input_file (str or Path) – The path to the JSONL file containing doccano annotations
- Return type:
list[TextDocument]- Returns:
list of TextDocument – A list of TextDocuments
- class DoccanoClientConfig(column_text='text', column_label='label')[source]#
A class representing the configuration in the doccano client. The default values are the default values used by doccano.
- Variables:
column_text (str, default="text") – Name or key representing the text
column_label (str, default="label") – Name or key representing the label
- class DoccanoOutputConverter(task, anns_labels=None, attr_label=None, ignore_segments=True, include_metadata=True, uid=None)[source]#
Convert medkit files to doccano files (.JSONL) for a given task.
For each
TextDocumenta jsonline will be created.- Parameters:
task (DoccanoTask) – The doccano task for the input converter
anns_labels (list of str, optional) – Labels of medkit annotations to convert into doccano annotations. If None (default) all the entities or relations will be converted. Useful for
SEQUENCE_LABELINGorRELATION_EXTRACTIONconverters.attr_label (str, optional) – The label of the medkit attribute that represents the text category. Useful for
TEXT_CLASSIFICATIONconverters.ignore_segments (bool, default=True) – If True medkit segments will be ignored. Only entities will be converted to Doccano entities. If False the medkit segments will be converted to Doccano entities as well. Useful for
SEQUENCE_LABELINGorRELATION_EXTRACTIONconverters.include_metadata (bool, default=True) – Whether include medkit metadata in the converted documents
uid (str, optional) – Identifier of the converter.
Methods:
save(docs, output_file)Convert and save a list of TextDocuments into a doccano file (.JSONL)
- save(docs, output_file)[source]#
Convert and save a list of TextDocuments into a doccano file (.JSONL)
- Parameters:
docs (list of TextDocument) – List of medkit doc objects to convert
output_file (str or Path) – Path or string of the JSONL file where to save the converted documents
- class DoccanoTask(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]#
Supported doccano tasks. The task defines the type of document to convert.
- Variables:
TEXT_CLASSIFICATION – Documents with a category
RELATION_EXTRACTION – Documents with entities and relations (including IDs)
SEQUENCE_LABELING – Documents with entities in tuples
- class RTTMInputConverter(turn_label='turn', speaker_label='speaker', converter_id=None)[source]#
Convert Rich Transcription Time Marked (.rttm) files containing diarization information into turn segments.
For each turn in a .rttm file, a
Segmentwill be created, with an associatedAttributeholding the name of the turn speaker as value. The segments can be retrieved directly or as part of anAudioDocumentinstance.If a
ProvTraceris set, provenance information will be added for each segment and each attribute (referencing the input converter as the operation).- Parameters:
turn_label (str, default="turn") – Label of segments representing turns in the .rttm file.
speaker_label (str, default="speaker") – Label of speaker attributes to add to each segment.
converter_id (str, optional) – Identifier of the converter.
Attributes:
Contains all the input converter init parameters.
Methods:
load(rttm_dir[, audio_dir, audio_ext])Load all .rttm files in a directory into a list of
AudioDocumentobjects.load_doc(rttm_file, audio_file)Load a single .rttm file into an
AudioDocument.load_turns(rttm_file, audio_file)Load a .rttm file and return a list of
Segmentobjects.set_prov_tracer(prov_tracer)Enable provenance tracing.
- property description: OperationDescription#
Contains all the input converter init parameters.
- Return type:
- set_prov_tracer(prov_tracer)[source]#
Enable provenance tracing.
- Parameters:
prov_tracer (
ProvTracer) – The provenance tracer used to trace the provenance.
- load(rttm_dir, audio_dir=None, audio_ext='.wav')[source]#
Load all .rttm files in a directory into a list of
AudioDocumentobjects.For each .rttm file, they must be a corresponding audio file with the same basename, either in the same directory or in an separated audio directory.
- Parameters:
rttm_dir (str or Path) – Directory containing the .rttm files.
audio_dir (str or Path, optional) – Directory containing the audio files corresponding to the .rttm files, if they are not in rttm_dir.
audio_ext (str, default=".wav") – File extension to use for audio files.
- Return type:
list[AudioDocument]- Returns:
list of AudioDocument – List of generated documents.
- load_doc(rttm_file, audio_file)[source]#
Load a single .rttm file into an
AudioDocument.- Parameters:
rttm_file (str or Path) – Path to the .rttm file.
audio_file (str or Path) – Path to the corresponding audio file.
- Return type:
- Returns:
AudioDocument – Generated document.
- load_turns(rttm_file, audio_file)[source]#
Load a .rttm file and return a list of
Segmentobjects.- Parameters:
rttm_file (str or Path) – Path to the .rttm file.
audio_file (str or Path) – Path to the corresponding audio file.
- Return type:
list[Segment]- Returns:
list of Segment – Turn segments as found in the .rttm file.
- class RTTMOutputConverter(turn_label='turn', speaker_label='speaker')[source]#
Build Rich Transcription Time Marked (.rttm) files containing diarization information from
Segmentobjects.There must be a segment for each turn, with an associated
Attributeholding the name of the turn speaker as value. The segments can be passed directly or as part ofAudioDocumentinstances.- Parameters:
turn_label (str, default="turn") – Label of segments representing turns in the audio documents.
speaker_label (str, default="speaker") – Label of speaker attributes attached to each turn segment.
Methods:
save(docs, rttm_dir[, doc_names])Save
AudioDocumentinstances as .rttm files in a directory.save_doc(doc, rttm_file[, rttm_doc_id])Save a single
AudioDocumentas a .rttm file.save_turn_segments(turn_segments, rttm_file, ...)Save
Segmentobjects into a .rttm file.- save(docs, rttm_dir, doc_names=None)[source]#
Save
AudioDocumentinstances as .rttm files in a directory.- Parameters:
docs (list of AudioDocument) – List of audio documents to save.
rttm_dir (str or Path) – Directory into which the generated .rttm files will be stored.
doc_names (list of str, optional) – Optional list of names to use as basenames and file ids for the generated .rttm files (2d column). If none provided, the document ids will be used.
- save_doc(doc, rttm_file, rttm_doc_id=None)[source]#
Save a single
AudioDocumentas a .rttm file.- Parameters:
doc (AudioDocument) – Audio document to save.
rttm_file (str or Path) – Path of the generated .rttm file.
rttm_doc_id (str, optional) – File uid to use for the generated .rttm file (2d column). If none provided, the document uid will be used.
- save_turn_segments(turn_segments, rttm_file, rttm_doc_id)[source]#
Save
Segmentobjects into a .rttm file.- Parameters:
turn_segments (list of Segment) – Turn segments to save.
rttm_file (str or Path) – Path of the generated .rttm file.
rttm_doc_id (str, optional) – File uid to use for the generated .rttm file (2d column).
- class SpacyInputConverter(entities=None, span_groups=None, attrs=None, uid=None)[source]#
Class in charge of converting spacy documents into a collection of TextDocuments.
Initialize the spacy input converter
- Parameters:
entities (list of str, optional) – Labels of spacy entities (doc.ents) to convert into medkit entities. If None (default) all spacy entities will be converted and added into its origin medkit document.
span_groups (list of str, optional) – Name of groups of spacy spans (doc.spans) to convert into medkit segments. If None (default) all groups of spacy spans will be converted and added into the medkit document.
attrs (list of str, optional) – Name of span extensions to convert into medkit attributes. If None (default) all non-None extensions will be added for each annotation
uid (str, optional) – Identifier of the converter
Methods:
load(spacy_docs)Create a list of TextDocuments from a list of spacy Doc objects.
- load(spacy_docs)[source]#
Create a list of TextDocuments from a list of spacy Doc objects. Depending on the configuration of the converted, the selected annotations and attributes are included in the documents.
- Parameters:
spacy_docs (list of Doc) – A list of spacy documents to convert
- Return type:
list[TextDocument]- Returns:
- list of TextDocument
A list of TextDocuments
- class SpacyOutputConverter(nlp, apply_nlp_spacy=False, labels_anns=None, attrs=None, uid=None)[source]#
Class in charge of converting a list of TextDocuments into a list of spacy documents
Initialize the spacy output converter
- Parameters:
nlp (Language) – Language object with the loaded pipeline from Spacy
apply_nlp_spacy (bool, default=False) – If True, each component of nlp pipeline is applied to the new spacy document. Some features, such as ‘POS TAG’, are added by a component of the pipeline, this parameter should be True, in order to add such attributes. If False, the nlp pipeline is not applied in the spacy document, so the document contains only the annotations and attributes transferred by medkit.
labels_anns (list of str, optional) – Labels of medkit annotations to include in the spacy document. If None (default) all the annotations will be included.
attrs (list of str, optional) – Labels of medkit attributes to add in the annotations that will be included. If None (default) all the attributes will be added as custom attributes in each annotation included.
uid (str, optional) – Identifier of the pipeline
Methods:
convert(medkit_docs)Convert a list of TextDocuments into a list of spacy Doc objects.
- convert(medkit_docs)[source]#
Convert a list of TextDocuments into a list of spacy Doc objects. Depending on the configuration of the converted, the selected annotations and attributes are included in the documents.
- Parameters:
medkit_docs (list of TextDocument) – A list of TextDocuments to convert
- Return type:
list[Doc]- Returns:
list of Doc – A list of spacy Doc objects
- class SRTInputConverter(turn_segment_label='turn', transcription_attr_label='transcribed_text', converter_id=None)[source]#
Convert .srt files containing transcription information into turn segments with transcription attributes.
For each turn in a .srt file, a
Segmentwill be created, with an associatedAttributeholding the transcribed text as value. The segments can be retrieved directly or as part of anAudioDocumentinstance.If a
ProvTraceris set, provenance information will be added for each segment and each attribute (referencing the input converter as the operation).- Parameters:
turn_segment_label (str, default="turn") – Label to use for segments representing turns in the .srt file.
transcription_attr_label (str, default="transcribed_text") – Label to use for segments attributes containing the transcribed text.
converter_id (str, optional) – Identifier of the converter.
Attributes:
Contains all the input converter init parameters.
Methods:
load(srt_dir[, audio_dir, audio_ext])Load all .srt files in a directory into a list of
AudioDocumentobjects.load_doc(srt_file, audio_file)Load a single .srt file into an
AudioDocumentcontaining turn segments with transcription attributes.load_segments(srt_file, audio_file)Load a .srt file and return a list of
Segmentobjects corresponding to turns, with transcription attributes.set_prov_tracer(prov_tracer)Enable provenance tracing.
- property description: OperationDescription#
Contains all the input converter init parameters.
- Return type:
- set_prov_tracer(prov_tracer)[source]#
Enable provenance tracing.
- Parameters:
prov_tracer (ProvTracer) – The provenance tracer used to trace the provenance.
- load(srt_dir, audio_dir=None, audio_ext='.wav')[source]#
Load all .srt files in a directory into a list of
AudioDocumentobjects.For each .srt file, they must be a corresponding audio file with the same basename, either in the same directory or in an separated audio directory.
- Parameters:
srt_dir (str or Path) – Directory containing the .srt files.
audio_dir (str or Path, optional) – Directory containing the audio files corresponding to the .srt files, if they are not in srt_dir.
audio_ext (str, default=".wav") – File extension to use for audio files.
- Return type:
list[AudioDocument]- Returns:
list of AudioDocument – List of generated documents.
- load_doc(srt_file, audio_file)[source]#
Load a single .srt file into an
AudioDocumentcontaining turn segments with transcription attributes.- Parameters:
srt_file (str or Path) – Path to the .srt file.
audio_file (str or Path) – Path to the corresponding audio file.
- Return type:
- Returns:
AudioDocument – Generated document.
- load_segments(srt_file, audio_file)[source]#
Load a .srt file and return a list of
Segmentobjects corresponding to turns, with transcription attributes.- Parameters:
srt_file (str or Path) – Path to the .srt file.
audio_file (str or Path) – Path to the corresponding audio file.
- Return type:
list[Segment]- Returns:
list of Segment – Turn segments as found in the .srt file, with transcription attributes attached.
- class SRTOutputConverter(segment_turn_label='turn', transcription_attr_label='transcribed_text')[source]#
Build .srt files containing transcription information from
Segmentobjects.There must be a segment for each turn, with an associated
Attributeholding the transcribed text as value. The segments can be passed directly or as part ofAudioDocumentinstances.- Parameters:
segment_turn_label (str, default="turn") – Label of segments representing turns in the audio documents.
transcription_attr_label (str, default="transcribed_text") – Label of segments attributes containing the transcribed text.
Methods:
save(docs, srt_dir[, doc_names])Save
AudioDocumentinstances as .srt files in a directory.save_doc(doc, srt_file)Save a single
AudioDocumentas a .srt file.save_segments(segments, srt_file)Save
Segmentobjects representing turns into a .srt file.- save(docs, srt_dir, doc_names=None)[source]#
Save
AudioDocumentinstances as .srt files in a directory.- Parameters:
docs (list of AudioDocument) – List of audio documents to save.
srt_dir (str or Path) – Directory into which the generated .str files will be stored.
doc_names (list of str, optional) – Optional list of names to use as basenames for the generated .srt files.
- save_doc(doc, srt_file)[source]#
Save a single
AudioDocumentas a .srt file.- Parameters:
doc (AudioDocument) – Audio document to save.
srt_file (str or Path) – Path of the generated .srt file.