:py:mod:`medkit.text.ner.umls_matcher`
======================================

.. py:module:: medkit.text.ner.umls_matcher


Module Contents
---------------

Classes
~~~~~~~

.. autoapisummary::

   medkit.text.ner.umls_matcher.UMLSMatcher


.. py:class:: UMLSMatcher(umls_dir: str | pathlib.Path, cache_dir: str | pathlib.Path, language: str, threshold: float = 0.9, min_length: int = 3, max_length: int = 50, similarity: typing_extensions.Literal[cosine, dice, jaccard, overlap] = 'jaccard', lowercase: bool = True, normalize_unicode: bool = False, spacy_tokenization: bool = False, semgroups: Sequence[str] = ('ANAT', 'CHEM', 'DEVI', 'DISO', 'PHYS', 'PROC'), blacklist: list[str] | None = None, same_beginning: bool = False, output_labels_by_semgroup: str | dict[str, str] | None = None, attrs_to_copy: list[str] | None = None, name: str | None = None, uid: str | None = None)


   Bases: :py:obj:`medkit.text.ner._base_simstring_matcher.BaseSimstringMatcher`

   
   Entity annotator identifying UMLS concepts using the `simstring`_ fuzzy matching algorithm.

   This operation is heavily inspired by the `QuickUMLS` library
   (https://github.com/Georgetown-IR-Lab/QuickUMLS).

   By default, only terms belonging to the `ANAT` (anatomy), `CHEM` (Chemicals &
   Drugs), `DEVI` (Devices), `DISO` (Disorders), `PHYS` (Physiology) and `PROC`
   (Procedures) semgroups will be considered. This behavior can be changed with
   the `semgroups` parameter.

   Note that setting `spacy_tokenization_language` to `True` might reduce the
   number of false positives. This requires the `spacy` optional dependency,
   which can be installed with `pip install medkit-lib[spacy]`.

   .. _simstring: http://chokkan.org/software/simstring/

   :Parameters:

       **umls_dir** : str or Path
           Path to the UMLS directory containing the MRCONSO.RRF and
           MRSTY.RRF files.

       **cache_dir** : str or Path
           Path to the directory into which the umls database will be cached.
           If it doesn't exist yet, the database will be automatically
           generated (it can take a long time) and stored there, ready to be
           reused on further instantiations. If it already exists, a check will
           be done to make sure the params used when the database was generated
           are consistent with the params of the current instance. If you want
           to rebuild the database with new params using the same cache dir,
           you will have to manually delete it first.

       **language** : str
           Language to consider as found in the MRCONSO.RRF file. Example:
           `"FRE"`. Will trigger a regeneration of the database if changed.

       **threshold** : float, default=0.9
           Minimum similarity threshold (between 0.0 and 1.0) between a UMLS term
           and the text of a matched entity.

       **min_length** : int, default=3
           Minimum number of chars in matched entities.

       **max_length** : int, default=50
           Maximum number of chars in matched entities.

       **similarity** : str, default="jaccard"
           Similarity metric to use.

       **lowercase** : bool, default=True
           Whether to use lowercased versions of UMLS terms and input entities
           (except for acronyms for which the uppercase term is always used).
           Will trigger a regeneration of the database if changed.

       **normalize_unicode** : bool, default=False
           Whether to use ASCII-only versions of UMLS terms and input entities
           (non-ASCII chars replaced by closest ASCII chars). Will trigger a
           regeneration of the database if changed.

       **spacy_tokenization** : bool, default=False
           If `True`, spacy will be used to tokenize input segments and filter
           out some tokens based on their part-of-speech tags, such as
           determinants, conjunctions and prepositions. If `None`, a simple
           regexp based tokenization will be used, which is faster but might
           give more false positives.

       **semgroups** : sequence of str, default=("ANAT", "CHEM", "DEVI", "DISO", "PHYS", "PROC")
           Ids of UMLS semantic groups that matched concepts should belong to.
           :see: https://lhncbc.nlm.nih.gov/semanticnetwork/download/sg_archive/SemGroups-v04.txt
           If set to `None`, all concepts can be matched.
           Will trigger a regeneration of the database if changed.

       **blacklist** : list of str, optional
           Optional list of exact terms to ignore.

       **same_beginning** : bool, default=False
           Ignore all matches that start with a different character than the
           term of the rule. This can be convenient to get rid of false
           positives on words that are very similar but have opposite meanings
           because of a preposition, for instance "activation" and
           "inactivation".

       **output_labels_by_semgroup** : str or dict, optional
           By default, ~`medkit.text.ner.umls.SEMGROUP_LABELS` will be used as
           entity labels. Use this parameter to override them. Example:
           `{"DISO": "problem", "PROC": "test}`. If `output_labels_by_semgroup`
           is a string, all entities will use this string as label instead.
           Will trigger a regeneration of the database if changed.

       **attrs_to_copy** : list of str, optional
           Labels of the attributes that should be copied from the source
           segment to the created entity. Useful for propagating context
           attributes (negation, antecedent, etc)

       **name** : str, optional
           Name describing the matcher (defaults to the class name).

       **uid** : str, optional
           Identifier of the matcher.


   ..
       !! processed by numpydoc !!
   .. py:attribute:: _SEMGROUP_BY_SEMTYPE

      
   .. py:method:: _get_labels_by_semgroup(output_labels: str | dict[str, str] | None) -> dict[str, str]
      :classmethod:

      
      Return a mapping giving the label to use for all entries of a given semgroup.

      output_labels : str or dict of str to str, optional
          Optional mapping of labels to use. Can be used to override the default
          labels. If `output_labels` is a single string, it will be used as a unique
          label for all semgroups


      :Returns:

          dict of str to str
              A mapping with semgroups as keys and corresponding label as values


      ..
          !! processed by numpydoc !!

   .. py:method:: _build_rules(umls_dir: pathlib.Path, language: str, lowercase: bool, normalize_unicode: bool, semgroups: set[str] | None, labels_by_semgroup: dict[str, str]) -> Iterator[medkit.text.ner._base_simstring_matcher.BaseSimstringMatcherRule]
      :classmethod:

      
      Create rules for all UMLS entries with appropriate labels.


      ..
          !! processed by numpydoc !!