medkit.core.text#

APIs#

For accessing these APIs, you may use import like this:

from medkit.core.text import <api_to_import>

Classes:

ContextOperation([uid, name])

Abstract operation for context detection.

CustomTextOpType(value)

Enum class listing all supported function types for creating custom text operations

Entity(label, text, spans[, attrs, ...])

Text entity referencing part of an TextDocument.

EntityAttributeContainer(ann_id)

Manage a list of attributes attached to a text entity.

EntityNormAttribute(kb_name, kb_id[, ...])

Normalization attribute linking an entity to an ID in a knowledge base

ModifiedSpan(length, replaced_spans)

Slice of text not present in the original text

NEROperation([uid, name])

Abstract operation for detecting entities.

Relation(label, source_id, target_id[, ...])

Relation between two text entities.

Segment(label, text, spans[, attrs, ...])

Text segment referencing part of an TextDocument.

SegmentationOperation([uid, name])

Abstract operation for segmenting text.

Span(start, end)

Slice of text extracted from the original text

TextAnnotation(label[, attrs, metadata, ...])

Base abstract class for all text annotations

TextAnnotationContainer(doc_id, raw_segment)

Manage a list of text annotations belonging to a text document.

TextDocument(text[, anns, metadata, uid])

Document holding text annotations

Functions:

create_text_operation(function, function_type)

Function for instanciating a custom test operation from a user-defined function

class TextAnnotation(label, attrs=None, metadata=None, uid=None, attr_container_class=<class 'AttributeContainer'>)[source]#

Base abstract class for all text annotations

Variables
  • uid (str) – Unique identifier of the annotation.

  • label (str) – The label for this annotation (e.g., SENTENCE)

  • attrs (medkit.core.attribute_container.AttributeContainer) – Attributes of the annotation. Stored in a :class:{~medkit.core.AttributeContainer} but can be passed as a list at init.

  • metadata (Dict[str, Any]) – The metadata of the annotation

  • keys (Set[str]) – Pipeline output keys to which the annotation belongs to.

class Segment(label, text, spans, attrs=None, metadata=None, uid=None, store=None, attr_container_class=<class 'AttributeContainer'>)[source]#

Text segment referencing part of an TextDocument.

Variables
  • uid (str) – The segment identifier.

  • label (str) – The label for this segment (e.g., SENTENCE)

  • text (str) – Text of the segment.

  • spans (List[medkit.core.text.span.AnySpan]) – List of spans indicating which parts of the segment text correspond to which part of the document’s full text.

  • attrs (medkit.core.attribute_container.AttributeContainer) – Attributes of the segment. Stored in a :class:{~medkit.core.AttributeContainer} but can be passed as a list at init.

  • metadata (Dict[str, Any]) – The metadata of the segment

  • keys (Set[str]) – Pipeline output keys to which the segment belongs to.

Methods:

from_dict(segment_dict)

Creates a Segment from a dict

classmethod from_dict(segment_dict)[source]#

Creates a Segment from a dict

Parameters

segment_dict (dict) – A dictionary from a serialized segment as generated by to_dict()

Return type

Self

class Entity(label, text, spans, attrs=None, metadata=None, uid=None, store=None, attr_container_class=<class 'EntityAttributeContainer'>)[source]#

Text entity referencing part of an TextDocument.

Variables
  • uid (str) – The entity identifier.

  • label (str) – The label for this entity (e.g., DISEASE)

  • text (str) – Text of the entity.

  • spans (List[medkit.core.text.span.AnySpan]) – List of spans indicating which parts of the entity text correspond to which part of the document’s full text.

  • attrs (medkit.core.text.entity_attribute_container.EntityAttributeContainer) – Attributes of the entity. Stored in a :class:{~medkit.core.EntityAttributeContainer} but can be passed as a list at init.

  • metadata (Dict[str, Any]) – The metadata of the entity

  • keys (Set[str]) – Pipeline output keys to which the entity belongs to.

class Relation(label, source_id, target_id, attrs=None, metadata=None, uid=None, store=None, attr_container_class=<class 'AttributeContainer'>)[source]#

Relation between two text entities.

Variables
  • uid (str) – The identifier of the relation

  • label (str) – The relation label

  • source_id (str) – The identifier of the entity from which the relation is defined

  • target_id (str) – The identifier of the entity to which the relation is defined

  • attrs (medkit.core.attribute_container.AttributeContainer) – The attributes of the relation

  • metadata (Dict[str, Any]) – The metadata of the relation

  • keys (Set[str]) – Pipeline output keys to which the relation belongs to

Methods:

from_dict(relation_dict)

Creates a Relation from a dict

classmethod from_dict(relation_dict)[source]#

Creates a Relation from a dict

Parameters

relation_dict (dict) – A dictionary from a serialized relation as generated by to_dict()

Return type

Self

class TextAnnotationContainer(doc_id, raw_segment)[source]#

Manage a list of text annotations belonging to a text document.

This behaves more or less like a list: calling len() and iterating are supported. Additional filtering is available through the get() method.

Also provides retrieval of entities, segments, relations, and handling of raw segment.

Attributes:

entities

Return the list of entities

relations

Return the list of relations

segments

Return the list of segments

Methods:

get_entities(*[, label, key])

Return a list of the entities of the document, optionally filtering by label or key.

get_relations(*[, label, key, source_id])

Return a list of the relations of the document, optionally filtering by label, key or source entity.

get_segments(*[, label, key])

Return a list of the segments of the document (not including entities), optionally filtering by label or key.

property segments: List[medkit.core.text.annotation.Segment]#

Return the list of segments

Return type

List[Segment]

property entities: List[medkit.core.text.annotation.Entity]#

Return the list of entities

Return type

List[Entity]

property relations: List[medkit.core.text.annotation.Relation]#

Return the list of relations

Return type

List[Relation]

get_segments(*, label=None, key=None)[source]#

Return a list of the segments of the document (not including entities), optionally filtering by label or key.

Parameters
  • label (Optional[str]) – Label to use to filter segments.

  • key (Optional[str]) – Key to use to filter segments.

Return type

List[Segment]

get_entities(*, label=None, key=None)[source]#

Return a list of the entities of the document, optionally filtering by label or key.

Parameters
  • label (Optional[str]) – Label to use to filter entities.

  • key (Optional[str]) – Key to use to filter entities.

Return type

List[Entity]

get_relations(*, label=None, key=None, source_id=None)[source]#

Return a list of the relations of the document, optionally filtering by label, key or source entity.

Parameters
  • label (Optional[str]) – Label to use to filter relations.

  • key (Optional[str]) – Key to use to filter relations.

  • source_id (Optional[str]) – Identifier of the source entity to use to filter relations.

Return type

List[Relation]

class TextDocument(text, anns=None, metadata=None, uid=None)[source]#

Document holding text annotations

Annotations must be subclasses of TextAnnotation.

Variables
  • uid (str) – Unique identifier of the document.

  • text – Full document text.

  • anns (medkit.core.text.annotation_container.TextAnnotationContainer) – Annotations of the document. Stored in an TextAnnotationContainer but can be passed as a list at init.

  • metadata (Dict[str, Any]) – Document metadata.

  • raw_segment (medkit.core.text.annotation.Segment) –

    Auto-generated segment containing the full unprocessed document text. To get the raw text as an annotation to pass to processing operations:

    >>> doc = TextDocument(text="hello")
    >>> raw_text = doc.anns.get(label=TextDocument.RAW_LABEL)[0]
    

Methods:

from_dict(doc_dict)

Creates a TextDocument from a dict

get_snippet(segment, max_extend_length)

Return a portion of the original text containing the annotation

classmethod from_dict(doc_dict)[source]#

Creates a TextDocument from a dict

Parameters

doc_dict (dict) – A dictionary from a serialized TextDocument as generated by to_dict()

Return type

Self

get_snippet(segment, max_extend_length)[source]#

Return a portion of the original text containing the annotation

Parameters
  • segment (Segment) – The annotation

  • max_extend_length (int) – Maximum number of characters to use around the annotation

Return type

str

Returns

str – A portion of the text around the annotation

class EntityAttributeContainer(ann_id)[source]#

Manage a list of attributes attached to a text entity.

This behaves more or less like a list: calling len() and iterating are supported. Additional filtering is available through the get() method.

Also provides retrieval of normalization attributes.

Attributes:

norms

Return the list of normalization attributes

Methods:

get_norms()

Return a list of the normalization attributes of the annotation

property norms: List[medkit.core.text.entity_norm_attribute.EntityNormAttribute]#

Return the list of normalization attributes

Return type

List[EntityNormAttribute]

get_norms()[source]#

Return a list of the normalization attributes of the annotation

Return type

List[EntityNormAttribute]

class EntityNormAttribute(kb_name, kb_id, kb_version=None, term=None, score=None, metadata=None, uid=None)[source]#

Normalization attribute linking an entity to an ID in a knowledge base

Variables
  • uid (str) – Identifier of the attribute

  • label (str) – The attribute label, always set to EntityNormAttribute.LABEL

  • kb_name (Optional[str]) – Name of the knowledge base (ex: “icd”). Should always be provided except in special cases when we just want to store a normalized term.

  • kb_id (Optional[Any]) – ID in the knowledge base to which the annotation should be linked. Should always be provided except in special cases when we just want to store a normalized term.

  • kb_version (Optional[str]) – Optional version of the knowledge base.

  • term (Optional[str]) – Optional normalized version of the entity text.

  • score (Optional[float]) – Optional score reflecting confidence of this link.

  • metadata (Dict[str, Any]) – Metadata of the attribute

Attributes:

LABEL

Label used for all normalization attributes

LABEL: ClassVar[str] = 'NORMALIZATION'#

Label used for all normalization attributes

class ContextOperation(uid=None, name=None, **kwargs)[source]#

Abstract operation for context detection. It uses a list of segments as input for running the operation and creates attributes that are directly appended to these segments.

Common initialization for all annotators:
  • assigning identifier to operation

  • storing class name, name and config in description

Parameters
  • uid (str) – Operation identifier

  • name – Operation name (defaults to class name)

  • kwargs – All other arguments of the child init useful to describe the operation

Examples

In the __init__ function of your annotator, use:

>>> init_args = locals()
>>> init_args.pop('self')
>>> super().__init__(**init_args)
class NEROperation(uid=None, name=None, **kwargs)[source]#

Abstract operation for detecting entities. It uses a list of segments as input and produces a list of detected entities.

Common initialization for all annotators:
  • assigning identifier to operation

  • storing class name, name and config in description

Parameters
  • uid (str) – Operation identifier

  • name – Operation name (defaults to class name)

  • kwargs – All other arguments of the child init useful to describe the operation

Examples

In the __init__ function of your annotator, use:

>>> init_args = locals()
>>> init_args.pop('self')
>>> super().__init__(**init_args)
class SegmentationOperation(uid=None, name=None, **kwargs)[source]#

Abstract operation for segmenting text. It uses a list of segments as input and produces a list of new segments.

Common initialization for all annotators:
  • assigning identifier to operation

  • storing class name, name and config in description

Parameters
  • uid (str) – Operation identifier

  • name – Operation name (defaults to class name)

  • kwargs – All other arguments of the child init useful to describe the operation

Examples

In the __init__ function of your annotator, use:

>>> init_args = locals()
>>> init_args.pop('self')
>>> super().__init__(**init_args)
class CustomTextOpType(value)[source]#

Enum class listing all supported function types for creating custom text operations

Variables
  • CREATE_ONE_TO_N – Takes 1 data item, Return N new data items

  • EXTRACT_ONE_TO_N – Takes 1 data item, Return N existing data items

  • FILTER – Takes 1 data item, Returns True/False

create_text_operation(function, function_type, name=None)[source]#

Function for instanciating a custom test operation from a user-defined function

Parameters
  • function (Callable) – User-defined function

  • function_type (CustomTextOpType) – Type of function. Supported values are defined in CustomTextOpType

  • name (Optional[str]) – Name of the operation used for provenance info (default: function name)

Return type

_CustomTextOperation

Returns

operation – An instance of a custom text operation

class Span(start, end)[source]#

Slice of text extracted from the original text

Parameters
  • start (int) – Index of the first character in the original text

  • end (int) – Index of the last character in the original text, plus one

Methods:

from_dict(span_dict)

Creates a Span from a dict

overlaps(other)

Test if 2 spans reference at least one character in common

overlaps(other)[source]#

Test if 2 spans reference at least one character in common

classmethod from_dict(span_dict)[source]#

Creates a Span from a dict

Parameters

span_dict (dict) – A dictionary from a serialized span as generated by to_dict()

Return type

Self

class ModifiedSpan(length, replaced_spans)[source]#

Slice of text not present in the original text

Parameters
  • length (int) – Number of characters

  • replaced_spans (List[medkit.core.text.span.Span]) – Slices of the original text that this span is replacing

Methods:

from_dict(modified_span_dict)

Creates a Modified from a dict

classmethod from_dict(modified_span_dict)[source]#

Creates a Modified from a dict

Parameters

modified_span_dict (dict) – A dictionary from a serialized ModifiedSpan as generated by to_dict()

Return type

Self

Subpackages / Submodules#

medkit.core.text.annotation

medkit.core.text.annotation_container

medkit.core.text.document

medkit.core.text.entity_attribute_container

medkit.core.text.entity_norm_attribute

medkit.core.text.operation

medkit.core.text.span

medkit.core.text.span_utils

medkit.core.text.utils