Core components
Contents
Core components#
This page contains all core concepts of medkit.
Note
For more details about public APIs, refer to medkit.core.
Documents, Annotations & Attributes#
Medkit documents classes are used to:
access to raw data,
store relevant annotations extracted from the raw data.
The Document and Annotation protocols
are defined inside medkit.core. They define common properties and
methods across all modalities. These protocols are then implemented for each
modality (text, audio, image, etc), with additional logic specific to the
modality.
To facilitate the implementation of the Document protocol,
an AnnotationContainer class is provided. It behaves like a list of
annotations, with additional filtering methods and support for non-memory
storage.
medkit.core also defines the Attribute class, that
can directly be used to attach attributes to annotations of any modality.
Similarly to AnnotationContainer, the role of this container is to
provide additional methods for facilitating access to the list of attributes
belonging to an annotation.
Fig. 1 Core protocols and classes#
Currently, medkit.core.text implements a
TextDocument class and a corresponding set of
TextAnnotation subclasses, and similarly
medkit.core.audio provides an AudioDocument class
and a corresponding Segment.
Both modality also subclass AnnotationContainer to add some
modality-specific logic or filtering.
To get more details about each modality, you can refer to their documentation:
Document#
Document protocol class provides the minimal data structure
for a medkit document.
For example, each document (whatever the modality) is linked to an annotation
container for the same modality.
AnnotationContainer class provides a set of methods (e.g., add/get)
to be implemented for each modality.
The goal is to provide user with a minimum set of common interfaces for accessing to the document annotations whatever the modality.
Given a document named doc from any modality
User can browse the document annotations
for ann in doc.anns: ...
User can add a new annotation to the document
ann = <my annotation> doc.anns.add(ann)
User can get the document annotations filtered by label
anns = doc.anns.get(label="disorder")
Note
For more details about their implementation, refer to
medkit.core.document.Document and
medkit.core.annotation_container.AnnotationContainer.
Annotation & Attribute#
Annotation protocol class provides the minimal
data structure for a medkit annotation.
For example, each annotation is linked to an attribute container.
AttributeContainer class provides a set of common interfaces for
accessing to the annotation Attribute whatever the modality.
Given an annotation ann from any modality:
User may browse the annotation attributes
for attr in ann.attrs: ...
User may add a new attribute to an annotation
attr = <my attribute> ann.attrs.add(attr)
User may get the annotation attributes filtered by label
attrs = ann.attrs.get(label="NORMALIZATION")
Collection#
Collection class allows to manipulate a set of Document.
Warning
This work is still under development. It may be changed in the future.
Operations#
The Operation abstract class groups all necessary methods for
being compatible with medkit processing pipeline and provenance.
We have defined different subclasses depending on the nature of the operation,
including text-specific and audio-specific operations in medkit.core.text
and medkit.core.audio.
To get more details about each modality, you can refer to their documentation:
For all operations inheriting from Operation abstract class,
these 4 lines shall be added in __init__ method:
def __init__(self, ..., uid=None):
...
# Pass all arguments to super (remove self)
init_args = locals()
init_args.pop("self")
super().__init__(**init_args)
Each operation is described with OperationDescription.
Converters#
Two abstract classes have been defined for managing document conversion between medkit format and another one.
Note
For more details about the public APIs, refer to medkit.core.conversion.
Pipeline#
Pipeline allows to chain several operations.
To better understand how to declare and use medkit pipelines, you may refer to the pipeline tutorial.
Note
For more details about the public APIs, refer to medkit.core.pipeline.
The DocPipeline class is a wrapper allowing
to run an annotation pipeline on a list of documents by automatically attach
output annotations to these documents.
Store#
A store is an object responsible for keeping the annotations of a document
(through an AnnotationContainer) or the attributes of an
annotation (through an AttributeContainer).
The Store protocol defines the method that a store
must implement. For now, we only provide a single implement of this protocol
based on a dictionary, but in the future we will probably provide other
implementations relying on databases.
Users can also implement their own store based on their needs.
Warning
This work is still under development. It may be changed in the future.
Note
For more details about the public APIs, refer to medkit.core.store.
Global store#
To store all data items in the same location, a global store is used for your application. If you have not set your own store, the global store will automatically use the simple internal dict store.
If you implement your own store, we suggest to call
medkit.core.store.GlobalStore.init_store() before initializing any other
medkit component.
GlobalStore provides initialization, access and removal methods
for the global store.
Provenance#
Warning
This work is still under development. It may be changed in the future.
Provenance is a medkit concept allowing to track all operations and their role in new knowledge extraction.
With this mechanism, we will be able to provide the provenance information about a generated data. To log this information, a separate provenance store is used.
For better understanding this concept, you may follow the provenance tutorial and/or refer to “how to make your own module” to know what you have to do to enable provenance.
Note
For more details about the public APIs, refer to medkit.core.prov_tracer.