medkit.text.segmentation
========================

.. py:module:: medkit.text.segmentation


Submodules
----------

.. toctree::
   :maxdepth: 1

   /reference/api/medkit/text/segmentation/rush_sentence_tokenizer/index
   /reference/api/medkit/text/segmentation/section_tokenizer/index
   /reference/api/medkit/text/segmentation/sentence_tokenizer/index
   /reference/api/medkit/text/segmentation/syntagma_tokenizer/index


Classes
-------

.. autoapisummary::

   medkit.text.segmentation.SectionModificationRule
   medkit.text.segmentation.SectionTokenizer
   medkit.text.segmentation.SentenceTokenizer
   medkit.text.segmentation.SyntagmaTokenizer


Package Contents
----------------

.. py:class:: SectionModificationRule

   .. py:attribute:: section_name
      :type:  str


   .. py:attribute:: new_section_name
      :type:  str


   .. py:attribute:: other_sections
      :type:  list[str]


   .. py:attribute:: order
      :type:  typing_extensions.Literal[BEFORE, AFTER]


.. py:class:: SectionTokenizer(section_dict: dict[str, list[str]] | None = None, output_label: str = _DEFAULT_LABEL, section_rules: Iterable[SectionModificationRule] = (), strip_chars: str = _DEFAULT_STRIP_CHARS, uid: str | None = None)

   Bases: :py:obj:`medkit.core.text.SegmentationOperation`


   
   Section segmentation annotator based on keyword rules.


   :Parameters:

       **section_dict: dict of str to list of str, optional**
           Dictionary containing the section name as key and the list of mappings as
           value. If None, the content of default_section_definition.yml will be used.

       **output_label: str, optional**
           Segment label to use for annotation output.

       **section_rules: iterable of SectionModificationRule, optional**
           List of rules for modifying a section name according its order to the other
           sections. If section_dict is None, the content of
           default_section_definition.yml will be used.

       **strip_chars: str, optional**
           The list of characters to strip at the beginning of the returned segment.

       **uid: str, optional**
           Identifier of the tokenizer














   ..
       !! processed by numpydoc !!

   .. py:attribute:: _DEFAULT_LABEL
      :type:  str
      :value: 'section'



   .. py:attribute:: _DEFAULT_STRIP_CHARS
      :type:  str
      :value: Multiline-String

      .. raw:: html

         <details><summary>Show Value</summary>

      .. code-block:: python

         """.;,?! 
         
         	"""

      .. raw:: html

         </details>




   .. py:method:: run(segments: list[medkit.core.text.Segment]) -> list[medkit.core.text.Segment]

      
      Return sections detected in `segments`.

      Each section is a segment with an attached attribute
      (label: <same as self.output_label>, value: <the name of the section>).

      :Parameters:

          **segments: list of Segment**
              List of segments into which to look for sections

      :Returns:

          list of Segment
              Sections segments found in `segments`













      ..
          !! processed by numpydoc !!


   .. py:method:: _find_sections_in_segment(segment: medkit.core.text.Segment)


   .. py:method:: _get_sections_to_rename(match: list[tuple])


   .. py:method:: get_example()
      :classmethod:



   .. py:method:: load_section_definition(filepath: pathlib.Path, encoding: str | None = None) -> tuple[dict[str, list[str]], tuple[SectionModificationRule, Ellipsis]]
      :staticmethod:


      
      Load the sections definition stored in a yml file.


      :Parameters:

          **filepath** : Path
              Path to a yml file containing the sections(name + mappings) and rules

          **encoding** : str, optional
              Encoding of the file to open

      :Returns:

          tuple
              Tuple containing:
              - the dictionary where key is the section name and value is the list of all
              equivalent strings.
              - the list of section modification rules.
              These rules allow to rename some sections according their order













      ..
          !! processed by numpydoc !!


   .. py:method:: save_section_definition(section_dict: dict[str, list[str]], section_rules: Iterable[SectionModificationRule], filepath: pathlib.Path, encoding: str | None = None)
      :staticmethod:


      
      Save section yaml definition file.


      :Parameters:

          **section_dict** : dict of str to list of str
              Dictionary containing the section name as key and the list of mappings
              as value (cf. content of default_section_dict.yml as example)

          **section_rules** : iterable of SectionModificationRule
              List of rules for modifying a section name according its order to the other
              sections.

          **filepath** : Path
              Path to the file to save

          **encoding** : str, optional
              File encoding














      ..
          !! processed by numpydoc !!


.. py:class:: SentenceTokenizer(output_label: str = _DEFAULT_LABEL, punct_chars: tuple[str, Ellipsis] = _DEFAULT_PUNCT_CHARS, keep_punct: bool = False, split_on_newlines: bool = True, attrs_to_copy: list[str] | None = None, uid: str | None = None)

   Bases: :py:obj:`medkit.core.text.SegmentationOperation`


   
   Sentence segmentation annotator based on end punctuation rules.
















   ..
       !! processed by numpydoc !!

   .. py:attribute:: _DEFAULT_LABEL
      :value: 'sentence'



   .. py:attribute:: _DEFAULT_PUNCT_CHARS
      :value: ('.', ';', '?', '!')



   .. py:method:: run(segments: list[medkit.core.text.Segment]) -> list[medkit.core.text.Segment]

      
      Return sentences detected in `segments`.


      :Parameters:

          **segments** : list of Segment
              List of segments into which to look for sentences

      :Returns:

          list of Segment
              Sentences segments found in `segments`













      ..
          !! processed by numpydoc !!


   .. py:method:: _find_sentences_in_segment(segment: medkit.core.text.Segment) -> Iterator[medkit.core.text.Segment]


   .. py:method:: _split_text(text: str, pattern: re.Pattern, keep_separator: bool) -> Iterator[tuple[int, int]]
      :staticmethod:



   .. py:method:: _build_sentence(source_segment: medkit.core.text.Segment, range_: tuple[int, int]) -> medkit.core.text.Segment


.. py:class:: SyntagmaTokenizer(separators: tuple[str, Ellipsis] | None = None, output_label: str = _DEFAULT_LABEL, strip_chars: str = _DEFAULT_STRIP_CHARS, attrs_to_copy: list[str] | None = None, uid: str | None = None)

   Bases: :py:obj:`medkit.core.text.SegmentationOperation`


   
   Syntagma segmentation annotator based on provided separators.
















   ..
       !! processed by numpydoc !!

   .. py:attribute:: _DEFAULT_LABEL
      :value: 'syntagma'



   .. py:attribute:: _DEFAULT_STRIP_CHARS
      :value: Multiline-String

      .. raw:: html

         <details><summary>Show Value</summary>

      .. code-block:: python

         """.;,?! 
         
         	"""

      .. raw:: html

         </details>




   .. py:method:: run(segments: list[medkit.core.text.Segment]) -> list[medkit.core.text.Segment]

      
      Return syntagmes detected in `segments`.


      :Parameters:

          **segments** : list of Segment
              List of segments into which to look for sentences

      :Returns:

          list of Segment
              Syntagmas segments found in `segments`













      ..
          !! processed by numpydoc !!


   .. py:method:: _find_syntagmas_in_segment(segment: medkit.core.text.Segment) -> Iterator[medkit.core.text.Segment]


   .. py:method:: get_example()
      :classmethod:



   .. py:method:: load_syntagma_definition(filepath: pathlib.Path, encoding: str | None = None) -> tuple[str, Ellipsis]
      :staticmethod:


      
      Load the syntagma definition stored in yml file.


      :Parameters:

          **filepath** : Path
              Path to a yml file containing the syntagma separators

          **encoding** : str, optional
              Encoding of the file to open

      :Returns:

          tuple of str
              Tuple containing the separators













      ..
          !! processed by numpydoc !!


   .. py:method:: save_syntagma_definition(syntagma_seps: tuple[str, Ellipsis], filepath: pathlib.Path, encoding: str | None = None)
      :staticmethod:


      
      Save syntagma yaml definition file.


      :Parameters:

          **syntagma_seps** : tuple of str
              The tuple of regular expressions corresponding to separators

          **filepath** : Path
              The path of the file to save

          **encoding** : str, optional
              The encoding of the file. Default: None














      ..
          !! processed by numpydoc !!


