medkit.core.text
Contents
medkit.core.text#
APIs#
For accessing these APIs, you may use import like this:
from medkit.core.text import <api_to_import>
Classes:
|
Abstract operation for context detection. |
|
Enum class listing all supported function types for creating custom text operations |
|
Text entity referencing part of an |
|
Manage a list of attributes attached to a text entity. |
|
Normalization attribute linking an entity to an ID in a knowledge base |
|
Slice of text not present in the original text |
|
Abstract operation for detecting entities. |
|
Relation between two text entities. |
|
Text segment referencing part of an |
|
Abstract operation for segmenting text. |
|
Slice of text extracted from the original text |
|
Base abstract class for all text annotations |
|
Manage a list of text annotations belonging to a text document. |
|
Document holding text annotations |
Functions:
|
Function for instanciating a custom test operation from a user-defined function |
- class TextAnnotation(label, attrs=None, metadata=None, uid=None, attr_container_class=<class 'AttributeContainer'>)[source]#
Base abstract class for all text annotations
- Variables
uid (str) – Unique identifier of the annotation.
label (str) – The label for this annotation (e.g., SENTENCE)
attrs (medkit.core.attribute_container.AttributeContainer) – Attributes of the annotation. Stored in a :class:{~medkit.core.AttributeContainer} but can be passed as a list at init.
metadata (Dict[str, Any]) – The metadata of the annotation
keys (Set[str]) – Pipeline output keys to which the annotation belongs to.
- class Segment(label, text, spans, attrs=None, metadata=None, uid=None, store=None, attr_container_class=<class 'AttributeContainer'>)[source]#
Text segment referencing part of an
TextDocument.- Variables
uid (str) – The segment identifier.
label (str) – The label for this segment (e.g., SENTENCE)
text (str) – Text of the segment.
spans (List[medkit.core.text.span.AnySpan]) – List of spans indicating which parts of the segment text correspond to which part of the document’s full text.
attrs (medkit.core.attribute_container.AttributeContainer) – Attributes of the segment. Stored in a :class:{~medkit.core.AttributeContainer} but can be passed as a list at init.
metadata (Dict[str, Any]) – The metadata of the segment
keys (Set[str]) – Pipeline output keys to which the segment belongs to.
Methods:
from_dict(segment_dict)Creates a Segment from a dict
- class Entity(label, text, spans, attrs=None, metadata=None, uid=None, store=None, attr_container_class=<class 'EntityAttributeContainer'>)[source]#
Text entity referencing part of an
TextDocument.- Variables
uid (str) – The entity identifier.
label (str) – The label for this entity (e.g., DISEASE)
text (str) – Text of the entity.
spans (List[medkit.core.text.span.AnySpan]) – List of spans indicating which parts of the entity text correspond to which part of the document’s full text.
attrs (medkit.core.text.entity_attribute_container.EntityAttributeContainer) – Attributes of the entity. Stored in a :class:{~medkit.core.EntityAttributeContainer} but can be passed as a list at init.
metadata (Dict[str, Any]) – The metadata of the entity
keys (Set[str]) – Pipeline output keys to which the entity belongs to.
- class Relation(label, source_id, target_id, attrs=None, metadata=None, uid=None, store=None, attr_container_class=<class 'AttributeContainer'>)[source]#
Relation between two text entities.
- Variables
uid (str) – The identifier of the relation
label (str) – The relation label
source_id (str) – The identifier of the entity from which the relation is defined
target_id (str) – The identifier of the entity to which the relation is defined
attrs (medkit.core.attribute_container.AttributeContainer) – The attributes of the relation
metadata (Dict[str, Any]) – The metadata of the relation
keys (Set[str]) – Pipeline output keys to which the relation belongs to
Methods:
from_dict(relation_dict)Creates a Relation from a dict
- class TextAnnotationContainer(doc_id, raw_segment)[source]#
Manage a list of text annotations belonging to a text document.
This behaves more or less like a list: calling len() and iterating are supported. Additional filtering is available through the get() method.
Also provides retrieval of entities, segments, relations, and handling of raw segment.
Attributes:
Return the list of entities
Return the list of relations
Return the list of segments
Methods:
get_entities(*[, label, key])Return a list of the entities of the document, optionally filtering by label or key.
get_relations(*[, label, key, source_id])Return a list of the relations of the document, optionally filtering by label, key or source entity.
get_segments(*[, label, key])Return a list of the segments of the document (not including entities), optionally filtering by label or key.
- property segments: List[medkit.core.text.annotation.Segment]#
Return the list of segments
- Return type
List[Segment]
- property entities: List[medkit.core.text.annotation.Entity]#
Return the list of entities
- Return type
List[Entity]
- property relations: List[medkit.core.text.annotation.Relation]#
Return the list of relations
- Return type
List[Relation]
- get_segments(*, label=None, key=None)[source]#
Return a list of the segments of the document (not including entities), optionally filtering by label or key.
- Parameters
label (
Optional[str]) – Label to use to filter segments.key (
Optional[str]) – Key to use to filter segments.
- Return type
List[Segment]
- get_entities(*, label=None, key=None)[source]#
Return a list of the entities of the document, optionally filtering by label or key.
- Parameters
label (
Optional[str]) – Label to use to filter entities.key (
Optional[str]) – Key to use to filter entities.
- Return type
List[Entity]
- get_relations(*, label=None, key=None, source_id=None)[source]#
Return a list of the relations of the document, optionally filtering by label, key or source entity.
- Parameters
label (
Optional[str]) – Label to use to filter relations.key (
Optional[str]) – Key to use to filter relations.source_id (
Optional[str]) – Identifier of the source entity to use to filter relations.
- Return type
List[Relation]
- class TextDocument(text, anns=None, metadata=None, uid=None)[source]#
Document holding text annotations
Annotations must be subclasses of TextAnnotation.
- Variables
uid (str) – Unique identifier of the document.
text – Full document text.
anns (medkit.core.text.annotation_container.TextAnnotationContainer) – Annotations of the document. Stored in an
TextAnnotationContainerbut can be passed as a list at init.metadata (Dict[str, Any]) – Document metadata.
raw_segment (medkit.core.text.annotation.Segment) –
Auto-generated segment containing the full unprocessed document text. To get the raw text as an annotation to pass to processing operations:
>>> doc = TextDocument(text="hello") >>> raw_text = doc.anns.get(label=TextDocument.RAW_LABEL)[0]
Methods:
from_dict(doc_dict)Creates a TextDocument from a dict
get_snippet(segment, max_extend_length)Return a portion of the original text containing the annotation
- classmethod from_dict(doc_dict)[source]#
Creates a TextDocument from a dict
- Parameters
doc_dict (dict) – A dictionary from a serialized TextDocument as generated by to_dict()
- Return type
Self
- get_snippet(segment, max_extend_length)[source]#
Return a portion of the original text containing the annotation
- Parameters
segment (
Segment) – The annotationmax_extend_length (
int) – Maximum number of characters to use around the annotation
- Return type
str- Returns
str – A portion of the text around the annotation
- class EntityAttributeContainer(ann_id)[source]#
Manage a list of attributes attached to a text entity.
This behaves more or less like a list: calling len() and iterating are supported. Additional filtering is available through the get() method.
Also provides retrieval of normalization attributes.
Attributes:
Return the list of normalization attributes
Methods:
Return a list of the normalization attributes of the annotation
- property norms: List[medkit.core.text.entity_norm_attribute.EntityNormAttribute]#
Return the list of normalization attributes
- Return type
List[EntityNormAttribute]
- get_norms()[source]#
Return a list of the normalization attributes of the annotation
- Return type
List[EntityNormAttribute]
- class EntityNormAttribute(kb_name, kb_id, kb_version=None, term=None, score=None, metadata=None, uid=None)[source]#
Normalization attribute linking an entity to an ID in a knowledge base
- Variables
uid (str) – Identifier of the attribute
label (str) – The attribute label, always set to
EntityNormAttribute.LABELkb_name (Optional[str]) – Name of the knowledge base (ex: “icd”). Should always be provided except in special cases when we just want to store a normalized term.
kb_id (Optional[Any]) – ID in the knowledge base to which the annotation should be linked. Should always be provided except in special cases when we just want to store a normalized term.
kb_version (Optional[str]) – Optional version of the knowledge base.
term (Optional[str]) – Optional normalized version of the entity text.
score (Optional[float]) – Optional score reflecting confidence of this link.
metadata (Dict[str, Any]) – Metadata of the attribute
Attributes:
Label used for all normalization attributes
- LABEL: ClassVar[str] = 'NORMALIZATION'#
Label used for all normalization attributes
- class ContextOperation(uid=None, name=None, **kwargs)[source]#
Abstract operation for context detection. It uses a list of segments as input for running the operation and creates attributes that are directly appended to these segments.
- Common initialization for all annotators:
assigning identifier to operation
storing class name, name and config in description
- Parameters
uid (str) – Operation identifier
name – Operation name (defaults to class name)
kwargs – All other arguments of the child init useful to describe the operation
Examples
In the __init__ function of your annotator, use:
>>> init_args = locals() >>> init_args.pop('self') >>> super().__init__(**init_args)
- class NEROperation(uid=None, name=None, **kwargs)[source]#
Abstract operation for detecting entities. It uses a list of segments as input and produces a list of detected entities.
- Common initialization for all annotators:
assigning identifier to operation
storing class name, name and config in description
- Parameters
uid (str) – Operation identifier
name – Operation name (defaults to class name)
kwargs – All other arguments of the child init useful to describe the operation
Examples
In the __init__ function of your annotator, use:
>>> init_args = locals() >>> init_args.pop('self') >>> super().__init__(**init_args)
- class SegmentationOperation(uid=None, name=None, **kwargs)[source]#
Abstract operation for segmenting text. It uses a list of segments as input and produces a list of new segments.
- Common initialization for all annotators:
assigning identifier to operation
storing class name, name and config in description
- Parameters
uid (str) – Operation identifier
name – Operation name (defaults to class name)
kwargs – All other arguments of the child init useful to describe the operation
Examples
In the __init__ function of your annotator, use:
>>> init_args = locals() >>> init_args.pop('self') >>> super().__init__(**init_args)
- class CustomTextOpType(value)[source]#
Enum class listing all supported function types for creating custom text operations
- Variables
CREATE_ONE_TO_N – Takes 1 data item, Return N new data items
EXTRACT_ONE_TO_N – Takes 1 data item, Return N existing data items
FILTER – Takes 1 data item, Returns True/False
- create_text_operation(function, function_type, name=None)[source]#
Function for instanciating a custom test operation from a user-defined function
- Parameters
function (
Callable) – User-defined functionfunction_type (
CustomTextOpType) – Type of function. Supported values are defined inCustomTextOpTypename (
Optional[str]) – Name of the operation used for provenance info (default: function name)
- Return type
_CustomTextOperation- Returns
operation – An instance of a custom text operation
- class Span(start, end)[source]#
Slice of text extracted from the original text
- Parameters
start (int) – Index of the first character in the original text
end (int) – Index of the last character in the original text, plus one
Methods:
from_dict(span_dict)Creates a Span from a dict
overlaps(other)Test if 2 spans reference at least one character in common
- class ModifiedSpan(length, replaced_spans)[source]#
Slice of text not present in the original text
- Parameters
length (int) – Number of characters
replaced_spans (List[medkit.core.text.span.Span]) – Slices of the original text that this span is replacing
Methods:
from_dict(modified_span_dict)Creates a Modified from a dict