Core text components
Contents
Core text components#
This page contains all core text concepts of medkit.
Note
For more details about public APIs, refer to
medkit.core.text.
Document, Annotations & Attributes#
The TextDocument class implements the
Document protocol. It allows to store subclasses of
TextAnnotation, which implements the
Annotation protocol.
Fig. 2 Text document and text annotation#
Document#
TextDocument relies on TextAnnotationContainer,
a subclass of AnnotationContainer, to manage the annotations,
Given a text document named doc
User can browse segments, entities, and relations
for entity in doc.anns.entities: ... for segment in doc.anns.segments: ... for relation in doc.anns.relations: ...
User can filter segments, entities and relations
sentences_segments = doc.get_segments(label="sentences") disorder_entities = doc.get_entities(label="disorder) entity = <my entity> relations = doc.get_relations(label="before", source_id=entity.uid)
Note
For common interfaces provided by core components, you can refer to Document.
Annotations#
For text modality, TextDocument can only contain
TextAnnotations.
Note
For more details about public APIs, refer to medkit.core.text.annotation).
Three subclasses are defined:
Segment,
Entity and
Relation
Fig. 3 Text annotation hierarchy#
Note
Each text annotation class inherits from the common interfaces provided by the core component (cf. Annotation)
Attributes#
Text annotations can receive attributes, which will be instances of the core
Attribute class.
Among attributes, medkit.core.text proposes
EntityNormAttribute, to be used
for normalization attributes, in order to have a common structure for
normalization information, independently of the operation used to create it.
Spans#
medkit relies on the concept of spans for following all text modifications made by the different operations.
Note
For more details about public APIs, refer to
medkit.core.text.span.
medkit also proposes a set of utilities for manipulating these spans if we need it when implementing a new medkit operation.
Note
For more details about public APIs, refer to medkit.core.text.span_utils.
See also
You may also take a look to the spans notebook example.
Text utilities#
These utilities have some preconfigured patterns for preprocessing text documents without destruction. They are not really supposed to be used directly, but rather inside a cleaning operation.
Note
For more details about public APIs, refer to medkit.core.text.utils.
See also
Medkit provides the EDSCleaner class that combines all these utilities to clean french documents (related to EDS documents coming from PDF).
Operations#
Abstract subclasses of Operation have been defined for text
to ease the development of text operations according to run operations.
Fig. 4 Operation hierarchy#
Note
For more details about public APIs, refer to medkit.core.text.operation.
Internal class _CustomTextOperation has been implemented to allow user to
call create_text_operation() for easily instantiating a custom
text operation.
See also
You may refer to this tutorial as example of definition of custom operation.