medkit.core.text.document#

Classes:

TextDocument(text[, anns, metadata, uid])

Document holding text annotations

class TextDocument(text, anns=None, metadata=None, uid=None)[source]#

Document holding text annotations

Annotations must be subclasses of TextAnnotation.

Variables
  • uid (str) – Unique identifier of the document.

  • text – Full document text.

  • anns (medkit.core.text.annotation_container.TextAnnotationContainer) – Annotations of the document. Stored in an TextAnnotationContainer but can be passed as a list at init.

  • metadata (Dict[str, Any]) – Document metadata.

  • raw_segment (medkit.core.text.annotation.Segment) –

    Auto-generated segment containing the full unprocessed document text. To get the raw text as an annotation to pass to processing operations:

    >>> doc = TextDocument(text="hello")
    >>> raw_text = doc.anns.get(label=TextDocument.RAW_LABEL)[0]
    

Methods:

from_dict(doc_dict)

Creates a TextDocument from a dict

get_snippet(segment, max_extend_length)

Return a portion of the original text containing the annotation

classmethod from_dict(doc_dict)[source]#

Creates a TextDocument from a dict

Parameters

doc_dict (dict) – A dictionary from a serialized TextDocument as generated by to_dict()

Return type

Self

get_snippet(segment, max_extend_length)[source]#

Return a portion of the original text containing the annotation

Parameters
  • segment (Segment) – The annotation

  • max_extend_length (int) – Maximum number of characters to use around the annotation

Return type

str

Returns

str – A portion of the text around the annotation