medkit.core.text.document
=========================

.. py:module:: medkit.core.text.document


Classes
-------

.. autoapisummary::

   medkit.core.text.document.TextDocument


Module Contents
---------------

.. py:class:: TextDocument(text: str, anns: Sequence[medkit.core.text.annotation.TextAnnotation] | None = None, attrs: Sequence[medkit.core.Attribute] | None = None, metadata: dict[str, Any] | None = None, uid: str | None = None)

   Bases: :py:obj:`medkit.core.dict_conv.SubclassMapping`


   Document holding text annotations.

   Annotations must be subclasses of `TextAnnotation`.


   .. rubric:: Examples

   >>> doc = TextDocument(text="hello")
   >>> raw_text = doc.anns.get(label=TextDocument.RAW_LABEL)[0]

   :Attributes:

       **uid** : str
           Unique identifier of the document.

       **text** : str
           Full document text.

       **anns** : TextAnnotationContainer
           Annotations of the document. Stored in an
           :class:`~.text.TextAnnotationContainer` but can be passed as a list at init.

       **attrs** : AttributeContainer
           Attributes of the document. Stored in an
           :class:`~.core.AttributeContainer` but can be passed as a list at init

       **metadata** : dict of str to Any
           Document metadata.

       **raw_segment** : Segment
           Auto-generated segment containing the full unprocessed document text. To
           get the raw text as an annotation to pass to processing operations:


   ..
       !! processed by numpydoc !!

   .. py:attribute:: RAW_LABEL
      :type:  ClassVar[str]
      :value: 'RAW_TEXT'


   .. py:attribute:: uid
      :type:  str


   .. py:attribute:: anns
      :type:  medkit.core.text.annotation_container.TextAnnotationContainer


   .. py:attribute:: attrs
      :type:  medkit.core.AttributeContainer


   .. py:attribute:: metadata
      :type:  dict[str, Any]


   .. py:attribute:: raw_segment
      :type:  medkit.core.text.annotation.Segment


   .. py:method:: _generate_raw_segment(text: str, doc_id: str) -> medkit.core.text.annotation.Segment
      :classmethod:


   .. py:property:: text
      :type: str


   .. py:method:: __init_subclass__()
      :classmethod:


   .. py:method:: to_dict(with_anns: bool = True) -> dict[str, Any]


   .. py:method:: from_dict(doc_dict: dict[str, Any]) -> typing_extensions.Self
      :classmethod:


      Create a TextDocument from a dict.


      :Parameters:

          **doc_dict** : dict of str to Any
              A dictionary from a serialized TextDocument as generated by to_dict()


      ..
          !! processed by numpydoc !!


   .. py:method:: from_file(path: os.PathLike, encoding: str = 'utf-8') -> typing_extensions.Self
      :classmethod:


      Create a document from a text file.


      :Parameters:

          **path** : Path
              Path of the text file

          **encoding** : str, default="utf-8"
              Text encoding to use

      :Returns:

          TextDocument
              Text document with contents of `path` as text. The file path is
              included in the document metadata.


      ..
          !! processed by numpydoc !!


   .. py:method:: from_dir(path: os.PathLike, pattern: str = '*.txt', encoding: str = 'utf-8') -> list[typing_extensions.Self]
      :classmethod:


      Create documents from text files in a directory.


      :Parameters:

          **path** : Path
              Path of the directory containing text files

          **pattern** : str
              Glob pattern to match text files in `path`

          **encoding** : str
              Text encoding to use

      :Returns:

          list of TextDocument
              Text documents with contents of each file as text


      ..
          !! processed by numpydoc !!


   .. py:method:: get_snippet(segment: medkit.core.text.annotation.Segment, max_extend_length: int) -> str

      
      Return a portion of the original text containing the annotation.


      :Parameters:

          **segment** : Segment
              The annotation

          **max_extend_length** : int
              Maximum number of characters to use around the annotation

      :Returns:

          str
              A portion of the text around the annotation


      ..
          !! processed by numpydoc !!