medkit.core#

APIs#

For accessing these APIs, you may use import like this:

from medkit.core import <api_to_import>

Classes:

AnnotationContainer(doc_id)

Manage a list of annotations belonging to a document.

Attribute(label[, value, metadata, uid])

Medkit attribute, to be added to an annotation

AttributeContainer(ann_id)

Manage a list of attributes attached to an annotation.

Collection(*[, text_docs, audio_docs])

Collection of documents of any modality (text, audio).

DocOperation([uid, name])

Abstract operation directly executed on text documents.

DocPipeline(pipeline, labels_by_input_key[, uid])

Wrapper around the Pipeline class that runs a pipeline on a list (or collection) of documents, retrieving input annotations from each document and attaching output annotations back to documents.

Document(*args, **kwds)

Base document protocol that must be implemented by document classes of all modalities (text, audio, etc).

GlobalStore()

Global store

InputConverter()

Abstract class for converting external document to medkit documents

Operation([uid, name])

Abstract class for all annotator modules

OperationDescription(uid, name[, ...])

Description of a specific instance of an operation

OutputConverter()

Abstract class for converting medkit document to external format

Pipeline(steps, input_keys, output_keys[, ...])

Graph of processing operations

PipelineStep(operation, input_keys, output_keys)

Pipeline item describing how a processing operation is connected to other

Prov(data_item, op_desc, source_data_items, ...)

Provenance information for a specific data item.

ProvTracer([store, _graph])

Provenance tracing component.

Store(*args, **kwds)

Store protocol

class AnnotationContainer(doc_id)[source]#

Manage a list of annotations belonging to a document.

This behaves more or less like a list: calling len() and iterating are supported. Additional filtering is available through the get() method.

The annotations will be stored in a Store, which can rely on a simple dict or something more complicated like a database.

This global store may be initialized using :class:~medkit.core.GlobalStore. Otherwise, a default one (i.e. dict store) is used.

Instantiate the annotation container

Parameters

doc_id (str) – The identifier of the document which annotations belong to.

Methods:

add(ann)

Attach an annotation to the document.

get(*[, label, key])

Return a list of the annotations of the document, optionally filtering by label or key.

get_by_id(uid)

Return the annotation corresponding to a specific identifier.

get_ids(*[, label, key])

Return an iterator of the identifiers of the annotations of the document, optionally filtering by label or key.

add(ann)[source]#

Attach an annotation to the document.

Parameters

ann (~AnnotationType) – Annotation to add.

Raises

ValueError – If the annotation is already attached to the document (based on annotation.uid)

get(*, label=None, key=None)[source]#

Return a list of the annotations of the document, optionally filtering by label or key.

Parameters
  • label (Optional[str]) – Label to use to filter annotations.

  • key (Optional[str]) – Key to use to filter annotations.

Return type

List[~AnnotationType]

get_ids(*, label=None, key=None)[source]#

Return an iterator of the identifiers of the annotations of the document, optionally filtering by label or key.

This method is provided, so it is easier to implement additional filtering in subclasses.

Parameters
  • label (Optional[str]) – Label to use to filter annotations.

  • key (Optional[str]) – Key to use to filter annotations.

Return type

Iterator[str]

get_by_id(uid)[source]#

Return the annotation corresponding to a specific identifier.

Parameters

uid (str) – Identifier of the annotation to return.

Return type

~AnnotationType

class Attribute(label, value=None, metadata=None, uid=None)[source]#

Medkit attribute, to be added to an annotation

Variables
  • label (str) – The attribute label

  • value (Optional[Any]) – The value of the attribute. Should be either simple built-in types (int, float, bool, str) or collections of these types (list, dict, tuple). If you need structured complex data you should create a subclass of Attribute.

  • metadata (Dict[str, Any]) – The metadata of the attribute

  • uid (str) – The identifier of the attribute

Methods:

copy()

Create a new attribute that is a copy of the current instance, but with a new identifier

from_dict(attribute_dict)

Creates an Attribute from a dict

to_brat()

Return a value compatible with the brat format

to_spacy()

Return a value compatible with spaCy

to_brat()[source]#

Return a value compatible with the brat format

Return type

Optional[Any]

to_spacy()[source]#

Return a value compatible with spaCy

Return type

Optional[Any]

copy()[source]#

Create a new attribute that is a copy of the current instance, but with a new identifier

This is used when we want to duplicate an existing attribute onto a different annotation.

Return type

Attribute

classmethod from_dict(attribute_dict)[source]#

Creates an Attribute from a dict

Parameters

attribute_dict (dict) – A dictionary from a serialized Attribute as generated by to_dict()

Return type

Self

class AttributeContainer(ann_id)[source]#

Manage a list of attributes attached to an annotation.

This behaves more or less like a list: calling len() and iterating are supported. Additional filtering is available through the get() method.

The attributes will be stored in a Store, which can rely on a simple dict or something more complicated like a database.

This global store may be initialized using :class:~medkit.core.GlobalStore. Otherwise, a default one (i.e. dict store) is used.

Methods:

add(attr)

Attach an attribute to the annotation.

get(*[, label])

Return a list of the attributes of the annotation, optionally filtering by label.

get(*, label=None)[source]#

Return a list of the attributes of the annotation, optionally filtering by label.

Parameters

label (Optional[str]) – Label to use to filter attributes.

Return type

List[Attribute]

add(attr)[source]#

Attach an attribute to the annotation.

Parameters

attr (Attribute) – Attribute to add.

Raises

ValueError – If the attribute is already attached to the annotation (based on attr.uid).

class Collection(*, text_docs=None, audio_docs=None)[source]#

Collection of documents of any modality (text, audio).

This class allows to group together a set of documents representing a common unit (for instance a patient), even if they don’t belong to the same modality.

This class is still a work-in-progress. In the future it should be possible to attach additional information to a Collection.

Parameters
  • text_docs (Optional[List[TextDocument]]) – List of text documents

  • audio_docs (Optional[List[AudioDocument]]) – List of audio documents

Attributes:

all_docs

List of all the documents belonging to the document, whatever they modality

property all_docs: List[medkit.core.document.Document]#

List of all the documents belonging to the document, whatever they modality

Return type

List[Document]

class InputConverter[source]#

Abstract class for converting external document to medkit documents

class OutputConverter[source]#

Abstract class for converting medkit document to external format

class DocPipeline(pipeline, labels_by_input_key, uid=None)[source]#

Wrapper around the Pipeline class that runs a pipeline on a list (or collection) of documents, retrieving input annotations from each document and attaching output annotations back to documents.

Initialize the pipeline

Parameters
  • pipeline (Pipeline) – Pipeline to execute on documents. Annotations given to pipeline (corresponding to its input_keys) will be retrieved from documents, according to labels_by_input. Annotations returned by pipeline (corresponding to its output_keys) will be added to documents.

  • labels_by_input_key (Dict[str, List[str]]) –

    Labels of existing annotations that should be retrieved from documents and passed to the pipeline as input. One list of labels per input key.

    For the typical use case where the pipeline takes a text document raw segment as input with key “full_text”:

    >>> doc_pipeline = DocPipeline(
    >>>     pipeline,
    >>>     labels_by_input={"full_text": [TextDocument.RAW_SEGMENT]},
    >>> )
    

    Because the values of labels_by_input_key are lists (one per input), it is possible to use annotation with different labels for the same input key.

Methods:

run(docs)

Run the pipeline on a list of documents, adding the output annotations to each document

run(docs)[source]#

Run the pipeline on a list of documents, adding the output annotations to each document

Parameters

docs (List[Document[~AnnotationType]]) – The documents on which to run the pipeline. Labels to input keys association will be used to retrieve existing annotations from each document, and all output annotations will also be added to each corresponding document.

Return type

None

class Document(*args, **kwds)[source]#

Base document protocol that must be implemented by document classes of all modalities (text, audio, etc).

Documents can contain Annotation objects.

Variables
class DocOperation(uid=None, name=None, **kwargs)[source]#

Abstract operation directly executed on text documents. It uses a list of documents as input for running the operation and creates annotations that are directly appended to these documents.

Common initialization for all annotators:
  • assigning identifier to operation

  • storing class name, name and config in description

Parameters
  • uid (str) – Operation identifier

  • name – Operation name (defaults to class name)

  • kwargs – All other arguments of the child init useful to describe the operation

Examples

In the __init__ function of your annotator, use:

>>> init_args = locals()
>>> init_args.pop('self')
>>> super().__init__(**init_args)
class Operation(uid=None, name=None, **kwargs)[source]#

Abstract class for all annotator modules

Common initialization for all annotators:
  • assigning identifier to operation

  • storing class name, name and config in description

Parameters
  • uid (str) – Operation identifier

  • name – Operation name (defaults to class name)

  • kwargs – All other arguments of the child init useful to describe the operation

Examples

In the __init__ function of your annotator, use:

>>> init_args = locals()
>>> init_args.pop('self')
>>> super().__init__(**init_args)

Methods:

set_prov_tracer(prov_tracer)

Enable provenance tracing.

Attributes:

description

Contains all the operation init parameters.

set_prov_tracer(prov_tracer)[source]#

Enable provenance tracing.

Parameters

prov_tracer (ProvTracer) – The provenance tracer used to trace the provenance.

property description: medkit.core.operation_desc.OperationDescription#

Contains all the operation init parameters.

Return type

OperationDescription

class OperationDescription(uid, name, class_name=None, config=<factory>)[source]#

Description of a specific instance of an operation

Parameters
  • uid (str) – The unique identifier of the instance described

  • name (str) – The name of the operation. Can be the same as class_name or something more specific, for operations with a behavior that can be customized (for instance a rule-based entity matcher with user-provided rules, or a model-based entity matcher with a user-provided model)

  • class_name (Optional[str]) – The name of the class of the operation

  • config (Dict[str, Any]) – The specific configuration of the instance

class Pipeline(steps, input_keys, output_keys, name=None, uid=None)[source]#

Graph of processing operations

A pipeline is made of pipeline steps, connecting together different processing operations by the use of input/output keys. Each operation can be seen as a node and the keys are its edge. Two operations can be chained by using the same string as an output key for the first operation and as an input key to the second.

Steps must be added in the order of execution, there isn’t any sort of dependency detection mechanism.

Initialize the pipeline

Parameters
  • steps (List[PipelineStep]) –

    List of pipeline steps

    Steps will be executed in the order in which they were added, so make sure to add first the steps generating data used by other steps.

  • input_keys (List[str]) – List of keys corresponding to the inputs passed to run()

  • output_keys (List[str]) – List of keys corresponding to the outputs returned by run()

  • name (Optional[str]) – Name describing the pipeline (defaults to the class name)

  • uid (Optional[str]) – Identifier of the pipeline

Methods:

run(*all_input_data)

Run the pipeline.

run(*all_input_data)[source]#

Run the pipeline.

Parameters

*all_input_data (List[Any]) –

Input data expected by the pipeline, must be of same length as the pipeline input_keys.

For each input key, the corresponding input data must be a list of items than can be of any type.

Return type

Union[None, List[Any], Tuple[List[Any], …]]

Returns

Union[None, List[Any], Tuple[List[Any], …]] – All output data returned by the pipeline, will be of same length as the pipeline output_keys.

For each output key, the corresponding output will be a list of items that can be of any type.

If the pipeline has only one output key, then the corresponding output will be directly returned, not wrapped in a tuple. If the pipeline doesn’t have any output key, nothing (ie None) will be returned.

class PipelineStep(operation, input_keys, output_keys, aggregate_input_keys=False)[source]#

Pipeline item describing how a processing operation is connected to other

Parameters
  • operation (medkit.core.pipeline.PipelineCompatibleOperation) – The operation to use at that step

  • input_keys (List[str]) – For each input of operation, the key to use to retrieve the corresponding annotations (either retrieved from a document or generated by an earlier pipeline step)

  • output_keys (List[str]) – For each output of operation, the key used to pass output annotations to the next Pipeline step. Can be empty if operation doesn’t return new annotations.

  • aggregate_input_keys (bool) – If True, all the annotations from multiple input keys are aggregated in a single list. Defaults to False

class ProvTracer(store=None, _graph=None)[source]#

Provenance tracing component.

ProvTracer is intended to gather provenance information about how all data generated by medkit. For each data item (for instance an annotation or an attribute), ProvTracer can tell the operation that created it, the data items that were used to create it, and reciprocally, the data items that were derived from it (cf. Prov).

Provenance-compatible operations should inform the provenance tracer of each data item that through the add_prov() method.

Users wanting to gather provenance information should instantiate one unique ProvTracer object and provide it to all operations involved in their data processing flow. Once all operations have been executed, they may then retrieve provenance info for specific data items through get_prov(), or for all items with get_provs().

Composite operations relying on inner operations (such as pipelines) shouldn’t call add_prov() method. Instead, they should instantiate their own internal ProvTracer and provide it to the operations they rely on, then use add_prov_from_sub_tracer() to integrate information from this internal sub-provenance tracer into the main provenance tracer that was provided to them.

This will build sub-provenance information, that can be retrieved later through get_sub_prov_tracer() or get_sub_prov_tracers(). The inner operations of a composite operation can themselves be composite operations, leading to a tree-like structure of nested provenance tracers.

Parameters

store (Optional[ProvStore]) – Store that will contain all traced data items.

Methods:

add_prov(data_item, op_desc, source_data_items)

Append provenance information about a specific data item.

add_prov_from_sub_tracer(data_items, ...)

Append provenance information about data items created by a composite operation relying on inner operations (such as a pipeline) having its own internal sub-provenance tracer.

get_prov(data_item_id)

Return provenance information about a specific data item.

get_provs()

Return all provenance information about all data items known to the tracer.

get_sub_prov_tracer(operation_id)

Return a sub-provenance tracer containing sub-provenance information from a specific composite operation.

get_sub_prov_tracers()

Return all sub-provenance tracers of the provenance tracer.

has_prov(data_item_id)

Check if the provenance tracer has provenance information about a specific data item.

has_sub_prov_tracer(operation_id)

Check if the provenance tracer has a sub-provenance tracer for a specific composite operation (such as a pipeline).

add_prov(data_item, op_desc, source_data_items)[source]#

Append provenance information about a specific data item.

Parameters
  • data_item (IdentifiableDataItem) – Data item that was created.

  • op_desc (OperationDescription) – Description of the operation that created the data item.

  • source_data_items (List[IdentifiableDataItem]) – Data items that were used by the operation to create the data item.

add_prov_from_sub_tracer(data_items, op_desc, sub_tracer)[source]#

Append provenance information about data items created by a composite operation relying on inner operations (such as a pipeline) having its own internal sub-provenance tracer.

Parameters
  • data_items (List[IdentifiableDataItem]) – Data items created by the composite operation. Should not include internal intermediate data items, only the output of the operation.

  • op_desc (OperationDescription) – Description of the composite operation that created the data items.

  • sub_tracer (ProvTracer) – Internal sub-provenance tracer of the composite operation.

has_prov(data_item_id)[source]#

Check if the provenance tracer has provenance information about a specific data item.

Note

This will return False if we have provenance info about a data item but only in a sub-provenance tracer.

Parameters

data_item_id (str) – Id of the data item.

Return type

bool

Returns

boolTrue if there is provenance info that can be retrieved with get_prov().

get_prov(data_item_id)[source]#

Return provenance information about a specific data item.

Parameters

data_item_id (str) – Id of the data item.

Return type

Prov

Returns

Prov – Provenance info about the data item.

get_provs()[source]#

Return all provenance information about all data items known to the tracer.

Note

Nested provenance info from sub-provenance tracers will not be returned.

Return type

List[Prov]

Returns

List[Prov] – Provenance info about all known data items.

has_sub_prov_tracer(operation_id)[source]#

Check if the provenance tracer has a sub-provenance tracer for a specific composite operation (such as a pipeline).

Note

This will return False if there is a sub-provenance tracer for the operation but that is not a direct child (i.e. that is deeper in the hierarchy).

Parameters

operation_id (str) – Id of the composite operation.

Return type

bool

Returns

boolTrue if there is a sub-provenance tracer for the operation.

get_sub_prov_tracer(operation_id)[source]#

Return a sub-provenance tracer containing sub-provenance information from a specific composite operation.

Parameters

operation_id (str) – Id of the composite operation.

Return type

ProvTracer

Returns

ProvTracer – The sub-provenance tracer containing sub-provenance information from the operation.

get_sub_prov_tracers()[source]#

Return all sub-provenance tracers of the provenance tracer.

Note

This will not return sub-provenance tracers that are not direct children of this tracer (i.e. that are deeper in the hierarchy).

Return type

List[ProvTracer]

Returns

List[ProvTracer] – All sub-provenance tracers of this provenance tracer.

class Prov(data_item, op_desc, source_data_items, derived_data_items)[source]#

Provenance information for a specific data item.

Parameters
  • data_item (medkit.core.data_item.IdentifiableDataItem) – Data item that was created (for instance an annotation or an attribute).

  • op_desc (Optional[medkit.core.operation_desc.OperationDescription]) – Description of the operation that created the data item.

  • source_data_items (List[medkit.core.data_item.IdentifiableDataItem]) – Data items that were used by the operation to create the data item.

  • derived_data_items (List[medkit.core.data_item.IdentifiableDataItem]) – Data items that were created by other operations using this data item.

class Store(*args, **kwds)[source]#

Store protocol

class GlobalStore[source]#

Global store

Methods:

del_store()

Delete the global store object

get_store()

Returns the global store object

init_store(store)

Initialize the global store for your application

classmethod init_store(store)[source]#

Initialize the global store for your application

Parameters

store (Store) – Store for all the data items

Raises

RuntimeError – If global store is already set

classmethod get_store()[source]#

Returns the global store object

Return type

Store

Returns

Store – the global store

classmethod del_store()[source]#

Delete the global store object

Subpackages / Submodules#

medkit.core.annotation

medkit.core.annotation_container

medkit.core.attribute

medkit.core.attribute_container

medkit.core.audio

medkit.core.collection

medkit.core.conversion

medkit.core.data_item

medkit.core.dict_conv

medkit.core.doc_pipeline

medkit.core.document

medkit.core.id

medkit.core.operation

medkit.core.operation_desc

medkit.core.pipeline

medkit.core.prov_store

medkit.core.prov_tracer

medkit.core.store

medkit.core.text

medkit.core.utils