medkit.text.preprocessing
=========================

.. py:module:: medkit.text.preprocessing


Submodules
----------

.. toctree::
   :maxdepth: 1

   /reference/api/medkit/text/preprocessing/char_replacer/index
   /reference/api/medkit/text/preprocessing/char_rules/index
   /reference/api/medkit/text/preprocessing/duplicate_finder/index
   /reference/api/medkit/text/preprocessing/eds_cleaner/index
   /reference/api/medkit/text/preprocessing/regexp_replacer/index


Attributes
----------

.. autoapisummary::

   medkit.text.preprocessing.ALL_CHAR_RULES
   medkit.text.preprocessing.DOT_RULES
   medkit.text.preprocessing.FRACTION_RULES
   medkit.text.preprocessing.LIGATURE_RULES
   medkit.text.preprocessing.QUOTATION_RULES
   medkit.text.preprocessing.SIGN_RULES
   medkit.text.preprocessing.SPACE_RULES


Classes
-------

.. autoapisummary::

   medkit.text.preprocessing.CharReplacer
   medkit.text.preprocessing.DuplicateFinder
   medkit.text.preprocessing.DuplicationAttribute
   medkit.text.preprocessing.EDSCleaner
   medkit.text.preprocessing.RegexpReplacer


Package Contents
----------------

.. py:class:: CharReplacer(output_label: str, rules: list[tuple[str, str]] | None = None, name: str | None = None, uid: str | None = None)

   Bases: :py:obj:`medkit.core.operation.Operation`


   Generic character replacer to be used as pre-processing module.

   This module is a non-destructive module allowing to replace selected 1-char string
   with the wanted n-chars strings.
   It respects the span modification by creating a new text-bound annotation containing
   the span modification information from input text.

   :Parameters:

       **output_label** : str
           The output label of the created annotations

       **rules** : list of tuple, optional
           The list of replacement rules. Default: ALL_CHAR_RULES

       **name** : str, optional
           Name describing the pre-processing module (defaults to the class name)

       **uid** : str, optional
           Identifier of the pre-processing module


   ..
       !! processed by numpydoc !!

   .. py:method:: run(segments: list[medkit.core.text.Segment]) -> list[medkit.core.text.Segment]

      
      Run the module on a list of segments provided as input and returns a new list of segments.


      :Parameters:

          **segments** : list of Segment
              List of segments to process

      :Returns:

          list of Segment
              List of new segments


      ..
          !! processed by numpydoc !!


   .. py:method:: _process_segment_text(segment: medkit.core.text.Segment)


.. py:data:: ALL_CHAR_RULES

.. py:data:: DOT_RULES
   :value: [('…', '...'), ('⋯', '...')]


.. py:data:: FRACTION_RULES
   :value: [('¼', '1/4'), ('½', '1/2'), ('¾', '3/4'), ('⅐', '1/7'), ('⅑', '1/9'), ('⅒', '1/10'), ('⅓',...


.. py:data:: LIGATURE_RULES
   :value: [('Æ', 'AE'), ('æ', 'ae'), ('Œ', 'OE'), ('œ', 'oe')]


.. py:data:: QUOTATION_RULES
   :value: [('»', '"'), ('«', '"'), ('“', '"'), ('”', '"'), ('„', '"'), ('‟', '"'), ('‹', '"'), ('›', '"'),...


.. py:data:: SIGN_RULES
   :value: [('©', ''), ('®', ''), ('™', '')]


.. py:data:: SPACE_RULES
   :value: [('\xa0', ' '), ('\u1680', ' '), ('\u2002', ' '), ('\u2003', ' '), ('\u2004', ' '), ('\u2005', '...


.. py:class:: DuplicateFinder(output_label: str, segments_to_output: typing_extensions.Literal[dup, nondup, both] = 'dup', min_duplicate_length: int = 5, fingerprint_type: typing_extensions.Literal[char, word] = 'word', fingerprint_length: int = 2, date_metadata_key: str | None = None, case_sensitive: bool = True, allow_multiline: bool = True, orf: int = 1)

   Bases: :py:obj:`medkit.core.Operation`


   Detect duplicated chunks of text across a collection of text documents.

   When a duplicated chunk of text is found, a segment is created on the newest
   document covering the span that is duplicated. A
   :class:`~.DuplicationAttribute` having `"is_duplicate"` as label and `True`
   as value is attached to the segment. It can later be propagated to the
   entities created from the duplicate segments.

   The attribute also holds the id of the source document from which the text
   was copied, the spans of the text in the source document, and optionally the
   date of the source document if provided.

   Optionally, segments can also be created for non-duplicate zones to make it
   easier to process only those parts of the documents. For these segments, the
   attribute value is `False` and the source, spans and date fields are `None`.

   NB: better performance may be achieved by installing the `ncls` python
   package, which will then be used by `duptextfinder` library.

   :Parameters:

       **output_label** : str
           Label of created segments

       **segments_to_output** : str, default="dup"
           Type of segments to create: only duplicate segments (`"dup"`), only
           non-duplicate segments (`"nondup"`), or both (`"both"`)

       **min_duplicate_length** : int, default=5
           Minimum length of duplicated segments, in characters (shorter
           segments will be discarded)

       **fingerprint_type** : str, default="word"
           Base unit to use for fingerprinting (either `"char"` or `"word"`)

       **fingerprint_length** : int, default=2
           Number of chars or words in each fingerprint. If `fingerprint_type`
           is set to `"char"`, this should be the same value as
           `min_duplicate_length`. If `fingerprint_type` is set to `"word"`,
           this should be around the average word size multiplied by
           `min_duplicate_length`

       **date_metadata_key** : str, optional
           Key to use to retrieve the date of each document from their metadata
           dicts. When provided, this is used to determine which document
           should be the source of a duplicate (the older) and which document
           should be the recipient (the newer). If None, the order of the
           documents in the collection will be used.

       **case_sensitive** : bool, default=True
           Whether duplication detection should be case-sensitive or not

       **allow_multiline** : bool, default=True
           Whether detected duplicates can span across multiline lines, or
           each line should be handled separately

       **orf** : int, default=1
           Step size when building fingerprints, cf the `duptextfinder`
           documentation


   ..
       !! processed by numpydoc !!

   .. py:attribute:: _NON_EMPTY_REGEXP


   .. py:method:: run(collections: list[medkit.core.Collection])

      
      Find duplicates in each collection of documents.

      For each duplicate found, a :class:`~.core.text.Segment` object with a
      :class:`~.DuplicationAttribute` will be created and attached to the document that
      is the recipient of the duplication (ie not the source document).


      ..
          !! processed by numpydoc !!


   .. py:method:: _find_duplicate_in_docs(docs: list[medkit.core.text.TextDocument])

      
      Find duplicates among a set of documents.


      ..
          !! processed by numpydoc !!


   .. py:method:: _find_duplicates_in_doc(doc: medkit.core.text.TextDocument, duplicate_finder: duptextfinder.DuplicateFinder, docs_by_id: dict[str, medkit.core.text.TextDocument])

      
      Find duplicates between a document and previously processed documents.


      :Parameters:

          **doc** : TextDocument
              Document in which to look for duplicates

          **duplicate_finder** : DuplicateFinder
              Duplicate finder to use, that has already processed previous documents if any

          **docs_by_id** : dict of str to TextDocument
              Previously processed documents, by id


      ..
          !! processed by numpydoc !!


   .. py:method:: _create_nondup_segment(target_segment, range_)

      
      Create a segment representing a non-duplicated zone.


      ..
          !! processed by numpydoc !!


   .. py:method:: _create_duplicate_segment(target_segment, target_range, source_doc, source_range)

      
      Create a segment representing a duplicated zone.


      ..
          !! processed by numpydoc !!


.. py:class:: DuplicationAttribute(value: bool, source_doc_id: str | None = None, source_spans: list[medkit.core.text.AnySpan] | None = None, source_doc_date: Any | None = None, metadata: dict[str, Any] | None = None, uid: str | None = None)

   Bases: :py:obj:`medkit.core.Attribute`


   Attribute indicating if some text is a duplicate of some other text in another document.


   :Attributes:

       **uid** : str
           Identifier of the attribute

       **label** : str
           The attribute label, always set to :attr:`DuplicationAttribute.LABEL`

       **value** : Any, optional
           `True` if the segment or entity to which the attribute belongs is a
           duplicate of the part of another document, `False` otherwise.

       **source_doc_id** : str, optional
           Identifier of the document from which the text was copied

       **source_spans** : list of AnySpan, optional
           Spans of the duplicated text in the source document

       **source_doc_date** : Any, optional
           Date of the source document, if known


   ..
       !! processed by numpydoc !!

   .. py:attribute:: source_doc_id
      :type:  str | None


   .. py:attribute:: source_spans
      :type:  list[medkit.core.text.AnySpan] | None


   .. py:attribute:: source_doc_date
      :type:  Any | None


   .. py:attribute:: LABEL
      :type:  ClassVar[str]
      :value: 'is_duplicate'


      Label used for all TNM attributes


      ..
          !! processed by numpydoc !!


   .. py:method:: to_dict() -> dict[str, Any]


   .. py:method:: from_dict(attr_dict: dict[str, Any]) -> typing_extensions.Self
      :classmethod:


      Create an Attribute from a dict.


      :Parameters:

          **attribute_dict: dict of str to Any**
              A dictionary from a serialized Attribute as generated by to_dict()


      ..
          !! processed by numpydoc !!


.. py:class:: EDSCleaner(output_label: str = _DEFAULT_LABEL, keep_endlines: bool = False, handle_parentheses_eds: bool = True, handle_points_eds: bool = True, uid: str | None = None)

   Bases: :py:obj:`medkit.core.Operation`


   EDS pre-processing annotation module.

   This module is a non-destructive module allowing to remove and clean selected points
   and newlines characters. It respects the span modification by creating a new
   text-bound annotation containing the span modification information from input text.

   :Parameters:

       **output_label** : str, optional
           The output label of the created annotations.

       **keep_endlines** : bool, default=False
           If True, modify multiple endlines using `.\\n` as a replacement.
           If False (default), modify multiple endlines using whitespaces (`.\\s`) as a replacement.

       **handle_parentheses_eds** : bool, default=True
           If True (default), modify the text near to parentheses or keywords according to
           predefined rules for french documents
           If False, the text near to parentheses or keywords is not modified

       **handle_points_eds** : bool, default=True
           Modify points near to predefined keywords for french documents
           If True (default), modify the points near to keywords
           If False, the points near to keywords is not modified

       **uid** : str, optional
           Identifier of the pre-processing module


   ..
       !! processed by numpydoc !!

   .. py:attribute:: _DEFAULT_LABEL
      :value: 'clean_text'


   .. py:method:: run(segments: list[medkit.core.text.Segment]) -> list[medkit.core.text.Segment]

      
      Run the module on a list of segments provided as input and returns a new list of segments.


      :Parameters:

          **segments** : list of Segment
              List of segments to normalize

      :Returns:

          list of Segment
              List of cleaned segments.


      ..
          !! processed by numpydoc !!


   .. py:method:: _clean_segment_text(segment: medkit.core.text.Segment)

      
      Clean up a segment non-destructively, remove points between numbers and  upper case letters.

      Then remove multiple whitespaces or newline characters.
      Finally, modify parentheses or point after keywords if necessary.


      ..
          !! processed by numpydoc !!


.. py:class:: RegexpReplacer(output_label: str, rules: list[tuple[str, str]] | None = None, name: str | None = None, uid: str | None = None)

   Bases: :py:obj:`medkit.core.operation.Operation`


   Generic pattern replacer to be used as pre-processing module.

   This module is a non-destructive module allowing to replace a regex pattern
   by a new text.
   It respects the span modification by creating a new text-bound annotation containing
   the span modification information from input text.

   :Parameters:

       **output_label** : str
           The output label of the created annotations

       **rules** : list of tuple, optional
           The list of replacement rules [(pattern_to_replace, new_text)]

       **name** : str, optional
           Name describing the pre-processing module (defaults to the class name)

       **uid** : str, optional
           Identifier of the pre-processing module


   ..
       !! processed by numpydoc !!

   .. py:method:: run(segments: list[medkit.core.text.Segment]) -> list[medkit.core.text.Segment]

      
      Run the module on a list of segments provided as input and returns a new list of segments.


      :Parameters:

          **segments** : list of Segment
              List of segments to normalize

      :Returns:

          list of Segment
              List of normalized segments


      ..
          !! processed by numpydoc !!


   .. py:method:: _normalize_segment_text(segment: medkit.core.text.Segment)