:py:mod:`medkit.text.ner._base_simstring_matcher`
=================================================

.. py:module:: medkit.text.ner._base_simstring_matcher


Module Contents
---------------

Classes
~~~~~~~

.. autoapisummary::

   medkit.text.ner._base_simstring_matcher.BaseSimstringMatcherRule
   medkit.text.ner._base_simstring_matcher.BaseSimstringMatcherNormalization
   medkit.text.ner._base_simstring_matcher.BaseSimstringMatcher


Functions
~~~~~~~~~

.. autoapisummary::

   medkit.text.ner._base_simstring_matcher.build_simstring_matcher_databases


.. py:class:: BaseSimstringMatcherRule


   Rule to use with :class:`~.BaseSimstringMatcher`.


   :Attributes:

       **term** : str
           Term to match using similarity-based fuzzy matching

       **label** : str
           Label to use for the entities created when a match is found

       **case_sensitive** : bool, default=False
           Whether to take case into account when looking for matches.

       **unicode_sensitive** : bool, default=False
           Whether to use ASCII-only versions of the rule term and input texts when
           looking for matches (non-ASCII chars replaced by closest ASCII chars).

       **normalizations** : list of BaseSimstringMatcherNormalization, optional
           Optional list of normalization attributes that should be attached to the
           entities created


   ..
       !! processed by numpydoc !!
   .. py:attribute:: term
      :type: str

      
   .. py:attribute:: label
      :type: str

      
   .. py:attribute:: case_sensitive
      :type: bool
      :value: False

      
   .. py:attribute:: unicode_sensitive
      :type: bool
      :value: False

      
   .. py:attribute:: normalizations
      :type: list[BaseSimstringMatcherNormalization]

      
.. py:class:: BaseSimstringMatcherNormalization


   Descriptor of normalization attributes to attach to entities created from a `~.BaseSimstringMatcherRule`.


   :Attributes:

       **kb_name** : str
           The name of the knowledge base we are referencing. Ex: "umls"

       **kb_id** : int or str
           The id of the entity in the knowledge base, for instance a CUI

       **kb_version** : str, optional
           The name of the knowledge base we are referencing. Ex: "202AB"

       **term** : str, optional
           Normalized version of the entity text in the knowledge base


   ..
       !! processed by numpydoc !!
   .. py:attribute:: kb_name
      :type: str

      
   .. py:attribute:: kb_id
      :type: int | str

      
   .. py:attribute:: kb_version
      :type: str | None

      
   .. py:attribute:: term
      :type: str | None

      
   .. py:method:: to_attribute(score: float) -> medkit.core.text.EntityNormAttribute

      
      Create a normalization attribute based on the normalization descriptor.


      :Parameters:

          **score** : float
              Score of similarity between the normalized term and the entity text

      :Returns:

          EntityNormAttribute
              Normalization attribute to add to entity


      ..
          !! processed by numpydoc !!


.. py:class:: BaseSimstringMatcher(simstring_db_file: pathlib.Path, rules_db_file: pathlib.Path, threshold: float = 0.9, min_length: int = 3, max_length: int = 50, similarity: typing_extensions.Literal[cosine, dice, jaccard, overlap] = 'jaccard', spacy_tokenization_language: str | None = None, blacklist: list[str] | None = None, same_beginning: bool = False, attrs_to_copy: list[str] | None = None, name: str | None = None, uid: str | None = None)


   Bases: :py:obj:`medkit.core.text.NEROperation`

   
   Base class for entity matcher using the `simstring` fuzzy matching algorithm.


   :Parameters:

       **simstring_db_file** : Path
           Simstring database to use

       **rules_db_file** : Path
           Rules database (in python shelve format) mapping matched terms to
           corresponding rules

       **threshold** : float, default=0.9
           Minimum similarity (between 0.0 and 1.0) between a rule term and the
           text of an entity matched on that rule.

       **min_length** : int, default=3
           Minimum number of chars in matched entities.

       **max_length** : int, default=50
           Maximum number of chars in matched entities.

       **similarity** : str, default="jaccard"
           Similarity metric to use.

       **spacy_tokenization_language** : str, optional
           2-letter code (ex: "fr", "en", etc) designating the language of the
           spacy model to use for tokenization. If provided, spacy will be used
           to tokenize input segments and filter out some tokens based on their
           part-of-speech tags, such as determinants, conjunctions and
           prepositions. If `None`, a simple regexp based tokenization will be
           used, which is faster but might give more false positives.

       **blacklist** : list of str, optional
           List of exact terms to ignore.

       **same_beginning** : bool, default=False
           Ignore all matches that start with a different character than the
           term of the rule. This can be convenient to get rid of false
           positives on words that are very similar but have opposite meanings
           because of a preposition, for instance "activation" and
           "inactivation".

       **attrs_to_copy** : list of str, optional
           Labels of the attributes that should be copied from the source
           segment to the created entity. Useful for propagating context
           attributes (negation, antecedent, etc).

       **name** : str, optional
           Name describing the matcher (defaults to the class name).

       **uid** : str, optional
           Identifier of the matcher.


   ..
       !! processed by numpydoc !!
   .. py:method:: run(segments: list[medkit.core.text.Segment]) -> list[medkit.core.text.Entity]

      
      Return entities (with optional normalization attributes) matched in `segments`.


      :Parameters:

          **segments** : list of Segment
              List of segments into which to look for matches

      :Returns:

          list of Entity
              Entities found in `segments` (with optional normalization
              attributes)


      ..
          !! processed by numpydoc !!

   .. py:method:: _find_matches_in_segment(segment: medkit.core.text.Segment, spacy_doc: Any | None) -> Iterator[medkit.core.text.Entity]

      
      Return an iterator to the entities matched in a segment.


      ..
          !! processed by numpydoc !!

   .. py:method:: _filter_overlapping_matches(matches: list[_Match]) -> list[_Match]
      :staticmethod:

      
      Find and remove overlapping matches.

      Remove overlapping matches by keeping matches with best score then max
      length among overlapping matches.


      ..
          !! processed by numpydoc !!

   .. py:method:: _build_entity(segment: medkit.core.text.Segment, match: _Match) -> medkit.core.text.Entity

      
      Build an entity from a match in a segment.


      ..
          !! processed by numpydoc !!


.. py:function:: build_simstring_matcher_databases(simstring_db_file: pathlib.Path, rules_db_file: pathlib.Path, rules: Iterable[BaseSimstringMatcherRule])

   
   Generate the databases needed by :class:`BaseSimstringMatcher`.


   :Parameters:

       **simstring_db_file** : Path
           Database used by the fuzzy matching `simstring` library.

       **rules_db_file** : Path
           `shelve` database storing the mapping between terms to match and
           corresponding BaseSimstringMatcherRule` objects (one term to match may
           correspond to several rules)

       **rules** : iterable of BaseSimstringMatcherRule
           Rules to add to databases


   ..
       !! processed by numpydoc !!