medkit.text.postprocessing
==========================

.. py:module:: medkit.text.postprocessing


Submodules
----------

.. toctree::
   :maxdepth: 1

   /reference/api/medkit/text/postprocessing/alignment_utils/index
   /reference/api/medkit/text/postprocessing/attribute_duplicator/index
   /reference/api/medkit/text/postprocessing/document_splitter/index
   /reference/api/medkit/text/postprocessing/overlapping/index


Classes
-------

.. autoapisummary::

   medkit.text.postprocessing.AttributeDuplicator
   medkit.text.postprocessing.DocumentSplitter


Functions
---------

.. autoapisummary::

   medkit.text.postprocessing.compute_nested_segments
   medkit.text.postprocessing.filter_overlapping_entities


Package Contents
----------------

.. py:function:: compute_nested_segments(source_segments: list[medkit.core.text.Segment], target_segments: list[medkit.core.text.Segment]) -> list[tuple[medkit.core.text.Segment, list[medkit.core.text.Segment]]]

   
   Return source segments aligned with its nested segments.

   Only nested segments fully contained in the `source_segments` are returned.

   :Parameters:

       **source_segments** : list of Segment
           List of source segments

       **target_segments** : list of Segment
           List of segments to align


   :Returns:

       list of tuple
           List of aligned segments


   ..
       !! processed by numpydoc !!

.. py:class:: AttributeDuplicator(attr_labels: list[str], uid: str | None = None)

   Bases: :py:obj:`medkit.core.Operation`


   Annotator to copy attributes from a source segment to its nested segments.

   For each attribute to be duplicated, a new attribute is created in the nested segment.

   :Parameters:

       **attr_labels** : list of str
           Labels of the attributes to copy

       **uid** : str, optional
           Identifier of the annotator


   ..
       !! processed by numpydoc !!

   .. py:attribute:: attr_labels


   .. py:method:: run(source_segments: list[medkit.core.text.Segment], target_segments: list[medkit.core.text.Segment])

      
      Add attributes from source segments to all nested segments.

      The nested segments are chosen among the `target_segments` based on their spans.

      :Parameters:

          **source_segments** : list of Segment
              List of segments with attributes to copy

          **target_segments** : list of Segment
              List of segments target


      ..
          !! processed by numpydoc !!


   .. py:method:: _duplicate_attr(attr: medkit.core.Attribute, target: medkit.core.text.Segment)


.. py:class:: DocumentSplitter(segment_label: str, entity_labels: list[str] | None = None, attr_labels: list[str] | None = None, relation_labels: list[str] | None = None, name: str | None = None, uid: str | None = None)

   Bases: :py:obj:`medkit.core.Operation`


   Split text documents using its segments as a reference.

   The resulting 'mini-documents' contain the entities belonging to each
   segment along with their attributes.

   This operation can be used to create datasets from medkit text documents.

   :Parameters:

       **segment_label** : str
           Label of the segments to use as references for the splitter

       **entity_labels** : list of str, optional
           Labels of entities to be included in the mini documents.
           If None, all entities from the document will be included.

       **attr_labels** : list of str, optional
           Labels of the attributes to be included into the new annotations.
           If None, all attributes will be included.

       **relation_labels** : list of str, optional
           Labels of relations to be included in the mini documents.
           If None, all relations will be included.

       **name** : str, optional
           Name describing the splitter (default to the class name).

       **uid** : str, Optional
           Identifier of the operation


   ..
       !! processed by numpydoc !!

   .. py:attribute:: segment_label


   .. py:attribute:: entity_labels
      :value: None


   .. py:attribute:: attr_labels
      :value: None


   .. py:attribute:: relation_labels
      :value: None


   .. py:method:: run(docs: list[medkit.core.text.TextDocument]) -> list[medkit.core.text.TextDocument]

      
      Split docs into mini documents.


      :Parameters:

          **docs: list of TextDocument**
              List of text documents to split


      :Returns:

          list of TextDocument
              List of documents created from the selected segments


      ..
          !! processed by numpydoc !!


   .. py:method:: _create_segment_doc(segment: medkit.core.text.Segment, entities: list[medkit.core.text.Entity], relations: list[medkit.core.text.Relation], doc_source: medkit.core.text.TextDocument) -> medkit.core.text.TextDocument

      
      Create a TextDocument from a segment and its entities.

      The original zone of the segment becomes the text of the document.

      :Parameters:

          **segment** : Segment
              Segment to use as reference for the new document

          **entities** : list of Entity
              Entities inside the segment

          **relations** : list of Relation
              Relations inside the segment

          **doc_source** : TextDocument
              Initial document from which annotations where extracted


      :Returns:

          TextDocument
              A new document with entities, the metadata includes the original span and metadata


      ..
          !! processed by numpydoc !!


   .. py:method:: _filter_attrs_from_ann(ann: medkit.core.text.TextAnnotation) -> list[medkit.core.Attribute]

      
      Filter attributes from an annotation using 'attr_labels'.


      ..
          !! processed by numpydoc !!


.. py:function:: filter_overlapping_entities(entities: list[medkit.core.text.Entity]) -> list[medkit.core.text.Entity]

   
   Filter a list of entities and remove overlaps.

   This method may be useful for the creation of data for named entity recognition,
   where a part of text can only contain one entity per 'word'.
   When an overlap is detected, the longest entity is preferred.

   :Parameters:

       **entities** : list of Entity
           Entities to filter


   :Returns:

       list of Entity
           Filtered entities


   ..
       !! processed by numpydoc !!