medkit.core.text#

APIs#

For accessing these APIs, you may use import like this:

from medkit.core.text import <api_to_import>

Classes:

`ContextOperation`([uid, name])	Abstract operation for context detection.
`CustomTextOpType`(value)	Enum class listing all supported function types for creating custom text operations
`Entity`(label, text, spans[, attrs, ...])	Text entity referencing part of an `TextDocument`.
`EntityAttributeContainer`(ann_id)	Manage a list of attributes attached to a text entity.
`EntityNormAttribute`(kb_name, kb_id[, ...])	Normalization attribute linking an entity to an ID in a knowledge base
`ModifiedSpan`(length, replaced_spans)	Slice of text not present in the original text
`NEROperation`([uid, name])	Abstract operation for detecting entities.
`Relation`(label, source_id, target_id[, ...])	Relation between two text entities.
`Segment`(label, text, spans[, attrs, ...])	Text segment referencing part of an `TextDocument`.
`SegmentationOperation`([uid, name])	Abstract operation for segmenting text.
`Span`(start, end)	Slice of text extracted from the original text
`TextAnnotation`(label[, attrs, metadata, ...])	Base abstract class for all text annotations
`TextAnnotationContainer`(doc_id, raw_segment)	Manage a list of text annotations belonging to a text document.
`TextDocument`(text[, anns, metadata, uid])	Document holding text annotations

Functions:

create_text_operation(function, function_type)

Function for instanciating a custom test operation from a user-defined function

class TextAnnotation(label, attrs=None, metadata=None, uid=None, attr_container_class=<class 'AttributeContainer'>)[source]#

Base abstract class for all text annotations

Variables

uid (str) – Unique identifier of the annotation.
label (str) – The label for this annotation (e.g., SENTENCE)
attrs (medkit.core.attribute_container.AttributeContainer) – Attributes of the annotation. Stored in a :class:{~medkit.core.AttributeContainer} but can be passed as a list at init.
metadata (Dict[str, Any]) – The metadata of the annotation
keys (Set[str]) – Pipeline output keys to which the annotation belongs to.

class Segment(label, text, spans, attrs=None, metadata=None, uid=None, store=None, attr_container_class=<class 'AttributeContainer'>)[source]#

Text segment referencing part of an TextDocument.

Variables

uid (str) – The segment identifier.
label (str) – The label for this segment (e.g., SENTENCE)
text (str) – Text of the segment.
spans (List[medkit.core.text.span.AnySpan]) – List of spans indicating which parts of the segment text correspond to which part of the document’s full text.
attrs (medkit.core.attribute_container.AttributeContainer) – Attributes of the segment. Stored in a :class:{~medkit.core.AttributeContainer} but can be passed as a list at init.
metadata (Dict[str, Any]) – The metadata of the segment
keys (Set[str]) – Pipeline output keys to which the segment belongs to.

Methods:

from_dict(segment_dict)

Creates a Segment from a dict

classmethod from_dict(segment_dict)[source]#

Creates a Segment from a dict

Parameters: segment_dict (dict) – A dictionary from a serialized segment as generated by to_dict()
Return type: Self

class Entity(label, text, spans, attrs=None, metadata=None, uid=None, store=None, attr_container_class=<class 'EntityAttributeContainer'>)[source]#

Text entity referencing part of an TextDocument.

Variables

uid (str) – The entity identifier.
label (str) – The label for this entity (e.g., DISEASE)
text (str) – Text of the entity.
spans (List[medkit.core.text.span.AnySpan]) – List of spans indicating which parts of the entity text correspond to which part of the document’s full text.
attrs (medkit.core.text.entity_attribute_container.EntityAttributeContainer) – Attributes of the entity. Stored in a :class:{~medkit.core.EntityAttributeContainer} but can be passed as a list at init.
metadata (Dict[str, Any]) – The metadata of the entity
keys (Set[str]) – Pipeline output keys to which the entity belongs to.

class Relation(label, source_id, target_id, attrs=None, metadata=None, uid=None, store=None, attr_container_class=<class 'AttributeContainer'>)[source]#

Relation between two text entities.

Variables

uid (str) – The identifier of the relation
label (str) – The relation label
source_id (str) – The identifier of the entity from which the relation is defined
target_id (str) – The identifier of the entity to which the relation is defined
attrs (medkit.core.attribute_container.AttributeContainer) – The attributes of the relation
metadata (Dict[str, Any]) – The metadata of the relation
keys (Set[str]) – Pipeline output keys to which the relation belongs to

Methods:

from_dict(relation_dict)

Creates a Relation from a dict

classmethod from_dict(relation_dict)[source]#

Creates a Relation from a dict

Parameters: relation_dict (dict) – A dictionary from a serialized relation as generated by to_dict()
Return type: Self

class TextAnnotationContainer(doc_id, raw_segment)[source]#

Manage a list of text annotations belonging to a text document.

This behaves more or less like a list: calling len() and iterating are supported. Additional filtering is available through the get() method.

Also provides retrieval of entities, segments, relations, and handling of raw segment.

Attributes:

`entities`	Return the list of entities
`relations`	Return the list of relations
`segments`	Return the list of segments

Methods:

`get_entities`(*[, label, key])	Return a list of the entities of the document, optionally filtering by label or key.
`get_relations`(*[, label, key, source_id])	Return a list of the relations of the document, optionally filtering by label, key or source entity.
`get_segments`(*[, label, key])	Return a list of the segments of the document (not including entities), optionally filtering by label or key.

property segments: List[medkit.core.text.annotation.Segment]#

Return the list of segments

Return type: List[Segment]

property entities: List[medkit.core.text.annotation.Entity]#

Return the list of entities

Return type: List[Entity]

property relations: List[medkit.core.text.annotation.Relation]#

Return the list of relations

Return type: List[Relation]

get_segments(*, label=None, key=None)[source]#

Return a list of the segments of the document (not including entities), optionally filtering by label or key.

Parameters

label (Optional[str]) – Label to use to filter segments.
key (Optional[str]) – Key to use to filter segments.

Return type

List[Segment]

get_entities(*, label=None, key=None)[source]#

Return a list of the entities of the document, optionally filtering by label or key.

Parameters

label (Optional[str]) – Label to use to filter entities.
key (Optional[str]) – Key to use to filter entities.

Return type

List[Entity]

get_relations(*, label=None, key=None, source_id=None)[source]#

Return a list of the relations of the document, optionally filtering by label, key or source entity.

Parameters

label (Optional[str]) – Label to use to filter relations.
key (Optional[str]) – Key to use to filter relations.
source_id (Optional[str]) – Identifier of the source entity to use to filter relations.

Return type

List[Relation]

class TextDocument(text, anns=None, metadata=None, uid=None)[source]#

Document holding text annotations

Annotations must be subclasses of TextAnnotation.

Variables

uid (str) – Unique identifier of the document.
text – Full document text.
anns (medkit.core.text.annotation_container.TextAnnotationContainer) – Annotations of the document. Stored in an TextAnnotationContainer but can be passed as a list at init.
metadata (Dict[str, Any]) – Document metadata.
raw_segment (medkit.core.text.annotation.Segment) –
Auto-generated segment containing the full unprocessed document text. To get the raw text as an annotation to pass to processing operations:
```
>>> doc = TextDocument(text="hello")
>>> raw_text = doc.anns.get(label=TextDocument.RAW_LABEL)[0]
```

Methods:

`from_dict`(doc_dict)	Creates a TextDocument from a dict
`get_snippet`(segment, max_extend_length)	Return a portion of the original text containing the annotation

classmethod from_dict(doc_dict)[source]#

Creates a TextDocument from a dict

Parameters: doc_dict (dict) – A dictionary from a serialized TextDocument as generated by to_dict()
Return type: Self

get_snippet(segment, max_extend_length)[source]#

Return a portion of the original text containing the annotation

Parameters

segment (Segment) – The annotation
max_extend_length (int) – Maximum number of characters to use around the annotation

Return type

str

Returns

str – A portion of the text around the annotation

class EntityAttributeContainer(ann_id)[source]#

Manage a list of attributes attached to a text entity.

This behaves more or less like a list: calling len() and iterating are supported. Additional filtering is available through the get() method.

Also provides retrieval of normalization attributes.

Attributes:

norms

Return the list of normalization attributes

Methods:

get_norms()

Return a list of the normalization attributes of the annotation

property norms: List[medkit.core.text.entity_norm_attribute.EntityNormAttribute]#

Return the list of normalization attributes

Return type: List[EntityNormAttribute]

get_norms()[source]#

Return a list of the normalization attributes of the annotation

Return type: List[EntityNormAttribute]

class EntityNormAttribute(kb_name, kb_id, kb_version=None, term=None, score=None, metadata=None, uid=None)[source]#

Normalization attribute linking an entity to an ID in a knowledge base

Variables

uid (str) – Identifier of the attribute
label (str) – The attribute label, always set to EntityNormAttribute.LABEL
kb_name (Optional[str]) – Name of the knowledge base (ex: “icd”). Should always be provided except in special cases when we just want to store a normalized term.
kb_id (Optional[Any]) – ID in the knowledge base to which the annotation should be linked. Should always be provided except in special cases when we just want to store a normalized term.
kb_version (Optional[str]) – Optional version of the knowledge base.
term (Optional[str]) – Optional normalized version of the entity text.
score (Optional[float]) – Optional score reflecting confidence of this link.
metadata (Dict[str, Any]) – Metadata of the attribute

Attributes:

LABEL

Label used for all normalization attributes

LABEL: ClassVar[str] = 'NORMALIZATION'#: Label used for all normalization attributes

class ContextOperation(uid=None, name=None, **kwargs)[source]#

Abstract operation for context detection. It uses a list of segments as input for running the operation and creates attributes that are directly appended to these segments.

Common initialization for all annotators:

assigning identifier to operation
storing class name, name and config in description

Parameters

uid (str) – Operation identifier
name – Operation name (defaults to class name)
kwargs – All other arguments of the child init useful to describe the operation

Examples

In the __init__ function of your annotator, use:

>>> init_args = locals()
>>> init_args.pop('self')
>>> super().__init__(**init_args)

class NEROperation(uid=None, name=None, **kwargs)[source]#

Abstract operation for detecting entities. It uses a list of segments as input and produces a list of detected entities.

Common initialization for all annotators:

assigning identifier to operation
storing class name, name and config in description

Parameters

uid (str) – Operation identifier
name – Operation name (defaults to class name)
kwargs – All other arguments of the child init useful to describe the operation

Examples

In the __init__ function of your annotator, use:

>>> init_args = locals()
>>> init_args.pop('self')
>>> super().__init__(**init_args)

class SegmentationOperation(uid=None, name=None, **kwargs)[source]#

Abstract operation for segmenting text. It uses a list of segments as input and produces a list of new segments.

Common initialization for all annotators:

assigning identifier to operation
storing class name, name and config in description

Parameters

uid (str) – Operation identifier
name – Operation name (defaults to class name)
kwargs – All other arguments of the child init useful to describe the operation

Examples

In the __init__ function of your annotator, use:

>>> init_args = locals()
>>> init_args.pop('self')
>>> super().__init__(**init_args)

class CustomTextOpType(value)[source]#

Enum class listing all supported function types for creating custom text operations

Variables

CREATE_ONE_TO_N – Takes 1 data item, Return N new data items
EXTRACT_ONE_TO_N – Takes 1 data item, Return N existing data items
FILTER – Takes 1 data item, Returns True/False

create_text_operation(function, function_type, name=None)[source]#

Function for instanciating a custom test operation from a user-defined function

Parameters

function (Callable) – User-defined function
function_type (CustomTextOpType) – Type of function. Supported values are defined in CustomTextOpType
name (Optional[str]) – Name of the operation used for provenance info (default: function name)

Return type

_CustomTextOperation

Returns

operation – An instance of a custom text operation

class Span(start, end)[source]#

Slice of text extracted from the original text

Parameters

start (int) – Index of the first character in the original text
end (int) – Index of the last character in the original text, plus one

Methods:

`from_dict`(span_dict)	Creates a Span from a dict
`overlaps`(other)	Test if 2 spans reference at least one character in common

overlaps(other)[source]#

Test if 2 spans reference at least one character in common

classmethod from_dict(span_dict)[source]#

Creates a Span from a dict

Parameters: span_dict (dict) – A dictionary from a serialized span as generated by to_dict()
Return type: Self

class ModifiedSpan(length, replaced_spans)[source]#

Slice of text not present in the original text

Parameters

length (int) – Number of characters
replaced_spans (List[medkit.core.text.span.Span]) – Slices of the original text that this span is replacing

Methods:

from_dict(modified_span_dict)

Creates a Modified from a dict

classmethod from_dict(modified_span_dict)[source]#

Creates a Modified from a dict

Parameters: modified_span_dict (dict) – A dictionary from a serialized ModifiedSpan as generated by to_dict()
Return type: Self

Subpackages / Submodules#

`medkit.core.text.annotation`
`medkit.core.text.annotation_container`
`medkit.core.text.document`
`medkit.core.text.entity_attribute_container`
`medkit.core.text.entity_norm_attribute`
`medkit.core.text.operation`
`medkit.core.text.span`
`medkit.core.text.span_utils`
`medkit.core.text.utils`

medkit

medkit.core.text

Contents

medkit.core.text#

APIs#

Subpackages / Submodules#