Text Components#

This page contains all core text concepts of medkit.

For more details about public APIs, please refer to medkit.core.text.

Data Structures#

The TextDocument class implements the Document protocol. It allows to store subclasses of TextAnnotation, which implements the Annotation protocol.

        classDiagram
     direction TB
     class Document~Annotation~{
        <<protocol>>
    }
    class Annotation{
        <<protocol>>
    }
    class TextDocument{
        uid: str
        anns: TextAnnotationContainer
    }
    class TextAnnotation{
        <<abstract>>
        uid: str
        label: str
        attrs: AttributeContainer
    }
    Document <|.. TextDocument: implements
    Annotation <|.. TextAnnotation: implements
    TextDocument *-- TextAnnotation: contains \n(TextAnnotationContainer)
    

Text document and text annotation#

Document#

TextDocument relies on TextAnnotationContainer to manage the annotations.

Given a text document named doc, one can:

  • browse segments, entities, and relations:

for entity in doc.anns.entities:
    ...

for segment in doc.anns.segments:
    ...

for relation in doc.anns.relations:
    ...
  • get and filter segments, entities and relations:

sentences_segments = doc.get_segments(label="sentences")
disorder_entities = doc.get_entities(label="disorder")

entity = ...
relations = doc.get_relations(label="before", source_id=entity.uid)

For more details on common interfaces provided by core components, please refer to Document.

Annotations#

For the text modality, TextDocument can only contain multiple TextAnnotation.

Three subclasses are defined Segment, Entity and Relation.

        classDiagram
     direction TB
    class Annotation{
        <<protocol>>
    }
    class TextAnnotation{
        <<abstract>>
    }
    Annotation <|.. TextAnnotation: implements
    TextAnnotation <|-- Segment
    TextAnnotation <|-- Relation
    Segment <|-- Entity
    

Text annotation hierarchy#

Note

Each text annotation class inherits from the common interfaces provided by the core component (cf. Annotation).

For more details about public APIs, please refer to medkit.core.text.annotation.

Attributes#

Text annotations can receive attributes, which will be instances of the core Attribute class.

Among attributes, medkit.core.text proposes EntityNormAttribute, to be used for normalization attributes, in order to have a common structure for normalization information, independently of the operation used to create it.

Spans#

medkit relies on the concept of spans for following all text modifications made by the different operations.

medkit also proposes a set of utilities for manipulating these spans when implementing new operations.

For more details about public APIs, please refer to medkit.core.text.span and medkit.core.text.span_utils.

See also

You may also take a look to the spans examples.

Text Utilities#

These utilities have some preconfigured patterns for preprocessing text documents without destruction. They are not designed to be used directly, but rather inside a cleaning operation.

For more details about public APIs, please refer to medkit.core.text.utils.

See also

medkit provides a EDSCleaner class, which combines all these utilities to clean French documents (related to EDS documents coming from PDF).

Operations#

Abstract subclasses of Operation have been defined for text to ease the development of text operations according to run operations.

        classDiagram
  Operation <|-- ContextOperation
  Operation <|-- DocOperation
  Operation <|-- NEROperation
  Operation <|-- SegmentationOperation
  Operation <|-- _CustomTextOperation
    

Operation hierarchy#

Internal class _CustomTextOperation has been implemented to allow user to call create_text_operation() for easier instantiation of custom text operations.

For more details about public APIs, please refer to medkit.core.text.operation.

See also

Please refer to this example for examples of custom operation.