:py:mod:`medkit.text.ner.umls_coder_normalizer`
===============================================

.. py:module:: medkit.text.ner.umls_coder_normalizer


Module Contents
---------------

Classes
~~~~~~~

.. autoapisummary::

   medkit.text.ner.umls_coder_normalizer.UMLSCoderNormalizer


.. py:class:: UMLSCoderNormalizer(umls_mrconso_file: str | pathlib.Path, language: str, model: str | pathlib.Path, embeddings_cache_dir: str | pathlib.Path, summary_method: typing_extensions.Literal[mean, cls] = 'cls', normalize_embeddings: bool = True, lowercase: bool = False, normalize_unicode: bool = False, threshold: float | None = None, max_nb_matches: int = 1, device: int = -1, batch_size: int = 128, hf_auth_token: str | None = None, nb_umls_embeddings_chunks: int | None = None, hf_cache_dir: str | pathlib.Path | None = None, name: str | None = None, uid: str | None = None)


   Bases: :py:obj:`medkit.core.Operation`

   
   Normalizer adding UMLS normalization attributes to pre-existing entities.

   Based on https://github.com/GanjinZero/CODER/.

   An UMLS `MRCONSO.RRF` file is needed. The normalizer identifies UMLS concepts by
   comparing embeddings of reference UMLS terms with the embeddings of the input
   entities. Any text transformer model from the HuggingFace Hub can be used,
   but "GanjinZero/UMLSBert_ENG" was specifically trained for this task (for english).

   When `UMLSCoderNormalizer` is used for the first time for a given `MRCONSO.RRF`,
   the embeddings of all umls terms are pre-computed (this can take a very long time)
   and stored in `embeddings_cache_dir`, so they can be reused next time.

   If another `MRCONSO.RRF` file is used, or if a parameter impacting the computation
   of embeddings (`model`, `summary_method`, etc) is changed, then another `embeddings_cache_dir`
   must be used, or `embeddings_cache_dir` must be deleted so it can be created properly.

   If the UMLS embeddings are too big to be held in memory, use `nb_umls_embeddings_chunks`.

   :Parameters:

       **umls_mrconso_file** : str or Path
           Path to the UMLS `MRCONSO.RRF` file.

       **language** : str
           Language of the UMLS terms to use (ex: `"ENG"`, `"FRE"`).

       **model** : str or Path
           Name on the Hugging Face hub or path to the transformers model that will be used to extract
           embeddings (ex: `"GanjinZero/UMLSBert_ENG"`).

       **embeddings_cache_dir** : str or Path
           Path to the directory into which pre-computed embeddings of UMLS terms should be cached.
           If it doesn't exist yet, the embeddings will be automatically generated (it can take a long
           time) and stored there, ready to be reused on further instantiations.
           If it already exists, a check will be done to make sure the params used when the embeddings
           were computed are consistent with the params of the current instance.

       **summary_method** : {"mean", "cls"}, default="cls"
           If set to `"mean"`, the embeddings extracted will be the mean of the pooling layers
           of the model. Otherwise, when set to `"cls"`, the last hidden layer will be used.

       **normalize_embeddings** : bool, default=True
           Whether to normalize the extracted embeddings.

       **lowercase** : bool, default=False
           Whether to use lowercased versions of UMLS terms and input entities.

       **normalize_unicode** : bool, default=False
           Whether to use ASCII-only versions of UMLS terms and input entities
           (non-ASCII chars replaced by closest ASCII chars).

       **threshold** : float, optional
           Minimum similarity threshold (between 0.0 and 1.0) between the embeddings
           of an entity and of an UMLS term for a normalization attribute to be added.

       **max_nb_matches** : int, default=1
           Maximum number of normalization attributes to add to each entity.

       **device** : int, default=-1
           Device to use for transformers models. Follows the Hugging Face convention
           (-1 for "cpu" and device number for gpu, for instance 0 for "cuda:0").

       **batch_size** : int, default=128
           Number of entities in batches processed by the embeddings extraction pipeline.

       **hf_auth_token** : str, optional
           HuggingFace Authentication token (to access private models on the hub)

       **nb_umls_embeddings_chunks** : int, optional
           Number of umls embeddings chunks to load at the same time when computing
           embeddings similarities. (a chunk contains 65536 embeddings).
           If `None`, all pre-computed umls embeddings are pre-loaded in memory and
           similaries are computed in one shot. Otherwise, at each call to `run()`,
           umls embeddings are loaded by groups of chunks and similaries are computed
           for each group.
           Use this when umls embeddings are too big to be fully loaded in memory.
           The higher this value, the more memory needed.

       **hf_cache_dir: str or Path, optional**
           Directory where to store downloaded models. If not set, the default
           HuggingFace cache dir is used.

       **name** : str, optional
           Name describing the normalizer (defaults to the class name).

       **uid** : str, optional
           Identifier of the normalizer.


   ..
       !! processed by numpydoc !!
   .. py:method:: run(entities: list[medkit.core.text.Entity])

      
      Add normalization attributes to each entity in `entities`.

      Each entity will have zero, one or more normalization attributes depending
      on `max_nb_matches` and on how many matches with a similarity above `threshold`
      are found.

      :Parameters:

          **entities** : list of Entity
              List of entities to add normalization attributes to


      ..
          !! processed by numpydoc !!

   .. py:method:: _find_best_matches(entities: list[medkit.core.text.Entity]) -> tuple[list[list[int]], list[list[float]]]


   .. py:method:: _load_umls_embeddings(files: list[pathlib.Path]) -> torch.Tensor


   .. py:method:: _normalize_entity(entity: medkit.core.text.Entity, match_indices: list[int], match_scores: list[float])


   .. py:method:: _build_umls_embeddings(show_progress=True)