:py:mod:`medkit.text.ner`
=========================

.. py:module:: medkit.text.ner


Submodules
----------
.. toctree::
   :titlesonly:
   :maxdepth: 1

   _base_simstring_matcher/index.rst
   adicap_norm_attribute/index.rst
   date_attribute/index.rst
   duckling_matcher/index.rst
   edsnlp_date_matcher/index.rst
   edsnlp_tnm_matcher/index.rst
   hf_entity_matcher/index.rst
   hf_entity_matcher_trainable/index.rst
   hf_tokenization_utils/index.rst
   iamsystem_matcher/index.rst
   nlstruct_entity_matcher/index.rst
   quick_umls_matcher/index.rst
   regexp_matcher/index.rst
   simstring_matcher/index.rst
   tnm_attribute/index.rst
   umls_coder_normalizer/index.rst
   umls_matcher/index.rst
   umls_utils/index.rst


Package Contents
----------------

Classes
~~~~~~~

.. autoapisummary::

   medkit.text.ner.ADICAPNormAttribute
   medkit.text.ner.DateAttribute
   medkit.text.ner.DurationAttribute
   medkit.text.ner.RelativeDateAttribute
   medkit.text.ner.RelativeDateDirection
   medkit.text.ner.DucklingMatcher
   medkit.text.ner.IAMSystemMatcher
   medkit.text.ner.MedkitKeyword
   medkit.text.ner.RegexpMatcher
   medkit.text.ner.RegexpMatcherNormalization
   medkit.text.ner.RegexpMatcherRule
   medkit.text.ner.RegexpMetadata
   medkit.text.ner.SimstringMatcher
   medkit.text.ner.SimstringMatcherNormalization
   medkit.text.ner.SimstringMatcherRule
   medkit.text.ner.UMLSMatcher




.. py:class:: ADICAPNormAttribute(code: str, sampling_mode: str | None = None, technic: str | None = None, organ: str | None = None, pathology: str | None = None, pathology_type: str | None = None, behaviour_type: str | None = None, metadata: dict[str, Any] | None = None, uid: str | None = None)


   Bases: :py:obj:`medkit.core.text.EntityNormAttribute`

   
   Attribute describing tissue sample using the ADICAP coding.

   ADICAP: Association pour le Développement de l'Informatique en Cytologie et Anatomo-Pathologie

   :see: https://smt.esante.gouv.fr/wp-json/ans/terminologies/document?terminologyId=terminologie-adicap&fileName=cgts_sem_adicap_fiche-detaillee.pdf

   This class is replicating EDS-NLP's `Adicap` class, making it a medkit
   `Attribute`.

   The `code` field fully describes the tissue sample. Additional information
   is derived from `code` in human readable fields (`sampling_code`,
   `technic`, `organ`, `pathology`, `pathology_type`, `behaviour_type`)













   :Attributes:

       **uid:**
           Identifier of the attribute

       **label:**
           The attribute label, always set to :attr:`EntityNormAttribute.LABEL
           <.core.text.EntityNormAttribute.LABEL>`

       **value:**
           ADICAP code prefix with "adicap:" (ex: "adicap:BHGS0040")

       **code:**
           ADICAP code as a string (ex: "BHGS0040")

       **kb_id:**
           Same as `code`

       **sampling_mode:**
           Sampling mode (ex: "BIOPSIE CHIRURGICALE")

       **technic:**
           Sampling technic (ex: "HISTOLOGIE ET CYTOLOGIE PAR INCLUSION")

       **organ:**
           Organ and regions (ex: "SEIN (ÉGALEMENT UTILISÉ CHEZ L'HOMME)")

       **pathology:**
           General pathology (ex: "PATHOLOGIE GÉNÉRALE NON TUMORALE")

       **pathology_type:**
           Pathology type (ex: "ETAT SUBNORMAL - LESION MINEURE")

       **behaviour_type:**
           Behaviour type (ex: "CARACTERES GENERAUX")

       **metadata:**
           Metadata of the attribute


   ..
       !! processed by numpydoc !!
   .. py:property:: code
      :type: str


   .. py:attribute:: sampling_mode
      :type: str | None

      

   .. py:attribute:: technic
      :type: str | None

      

   .. py:attribute:: organ
      :type: str | None

      

   .. py:attribute:: pathology
      :type: str | None

      

   .. py:attribute:: pathology_type
      :type: str | None

      

   .. py:attribute:: behaviour_type
      :type: str | None

      

   .. py:method:: to_dict() -> dict[str, Any]


   .. py:method:: from_dict(adicap_dict: dict[str, Any]) -> typing_extensions.Self
      :classmethod:

      
      Create an Attribute from a dict.


      :Parameters:

          **attribute_dict: dict of str to Any**
              A dictionary from a serialized Attribute as generated by to_dict()














      ..
          !! processed by numpydoc !!


.. py:class:: DateAttribute(label: str, year: int | None = None, month: int | None = None, day: int | None = None, hour: int | None = None, minute: int | None = None, second: int | None = None, metadata: dict[str, Any] | None = None, uid: str | None = None)


   Bases: :py:obj:`medkit.core.Attribute`

   
   Attribute representing an absolute date or time associated to a segment or entity.

   The date or time can be incomplete: each date/time component is optional but
   at least one must be provided.













   :Attributes:

       **uid** : str
           Identifier of the attribute

       **label** : str
           Label of the attribute

       **value** : Any, optional
           String representation of the date with YYYY-MM-DD format for the date
           part and HH:MM:SS for the time part, if present. Missing components are
           replaced with question marks.

       **year** : int, optional
           Year component of the date

       **month** : int, optional
           Month component of the date

       **day** : int, optional
           Day component of the date

       **hour** : int, optional
           Hour component of the time

       **minute** : int, optional
           Minute component of the time

       **second** : int, optional
           Second component of the time

       **metadata** : dict of str to Any
           Metadata of the attribute


   ..
       !! processed by numpydoc !!
   .. py:attribute:: year
      :type: int | None

      

   .. py:attribute:: month
      :type: int | None

      

   .. py:attribute:: day
      :type: int | None

      

   .. py:attribute:: hour
      :type: int | None

      

   .. py:attribute:: minute
      :type: int | None

      

   .. py:attribute:: second
      :type: int | None

      

   .. py:method:: to_brat() -> str

      
      Return a value compatible with the brat format.
















      ..
          !! processed by numpydoc !!

   .. py:method:: to_spacy() -> str

      
      Return a value compatible with spaCy.
















      ..
          !! processed by numpydoc !!

   .. py:method:: to_dict() -> dict[str, Any]


   .. py:method:: from_dict(date_dict: dict[str, Any]) -> typing_extensions.Self
      :classmethod:

      
      Create an Attribute from a dict.


      :Parameters:

          **attribute_dict: dict of str to Any**
              A dictionary from a serialized Attribute as generated by to_dict()














      ..
          !! processed by numpydoc !!


.. py:class:: DurationAttribute(label: str, years: int = 0, months: int = 0, weeks: int = 0, days: int = 0, hours: int = 0, minutes: int = 0, seconds: int = 0, metadata: dict[str, Any] | None = None, uid: str | None = None)


   Bases: :py:obj:`medkit.core.Attribute`

   
   Attribute representing a time quantity associated to a segment or entity.

   Each date/time component is optional but at least one must be provided.













   :Attributes:

       **uid** : str
           Identifier of the attribute

       **label** : str
           Label of the attribute

       **value** : Any, optional
           String representation of the duration (ex: "1 year 10 months 2 days")

       **years** : int
           Year component of the date quantity

       **months** : int
           Month component of the date quantity

       **weeks** : int
           Week component of the date quantity

       **days** : int
           Day component of the date quantity

       **hours** : int
           Hour component of the time quantity

       **minutes** : int
           Minute component of the time quantity

       **seconds** : int
           Second component of the time quantity

       **metadata** : dict of str to Any
           Metadata of the attribute


   ..
       !! processed by numpydoc !!
   .. py:attribute:: years
      :type: int

      

   .. py:attribute:: months
      :type: int

      

   .. py:attribute:: weeks
      :type: int

      

   .. py:attribute:: days
      :type: int

      

   .. py:attribute:: hours
      :type: int

      

   .. py:attribute:: minutes
      :type: int

      

   .. py:attribute:: seconds
      :type: int

      

   .. py:method:: to_brat() -> str

      
      Return a value compatible with the brat format.
















      ..
          !! processed by numpydoc !!

   .. py:method:: to_spacy() -> str

      
      Return a value compatible with spaCy.
















      ..
          !! processed by numpydoc !!

   .. py:method:: to_dict() -> dict[str, Any]


   .. py:method:: from_dict(duration_dict: dict[str, Any]) -> typing_extensions.Self
      :classmethod:

      
      Create an Attribute from a dict.


      :Parameters:

          **attribute_dict: dict of str to Any**
              A dictionary from a serialized Attribute as generated by to_dict()














      ..
          !! processed by numpydoc !!


.. py:class:: RelativeDateAttribute(label: str, direction: RelativeDateDirection, years: int = 0, months: int = 0, weeks: int = 0, days: int = 0, hours: int = 0, minutes: int = 0, seconds: int = 0, metadata: dict[str, Any] | None = None, uid: str | None = None)


   Bases: :py:obj:`medkit.core.Attribute`

   
   Attribute representing a relative date or time associated to a segment or entity.

   A date or time offset from an (unknown) reference date time with a direction.

   At least one date or time component must be non-zero.













   :Attributes:

       **uid** : str
           Identifier of the attribute

       **label** : str
           Label of the attribute

       **value** : Any, optional
           String representation of the relative date (ex: "+ 1 year 10 months 2
           days")

       **direction** : RelativeDateDirection
           Direction the relative date. Ex: "2 years ago" corresponds to the `PAST`
           direction and "in 2 weeks" to the `FUTURE` direction.

       **years** : int
           Year component of the date offset

       **months** : int
           Month component of the date offset

       **weeks** : int
           Week component of the date offset

       **days** : int
           Day component of the date offset

       **hours** : int
           Hour component of the time offset

       **minutes** : int
           Minute component of the time offset

       **seconds** : int
           Second component of the time offset

       **metadata** : dict of str to Any
           Metadata of the attribute


   ..
       !! processed by numpydoc !!
   .. py:attribute:: direction
      :type: RelativeDateDirection

      

   .. py:attribute:: years
      :type: int

      

   .. py:attribute:: months
      :type: int

      

   .. py:attribute:: weeks
      :type: int

      

   .. py:attribute:: days
      :type: int

      

   .. py:attribute:: hours
      :type: int

      

   .. py:attribute:: minutes
      :type: int

      

   .. py:attribute:: seconds
      :type: int

      

   .. py:method:: to_brat() -> str

      
      Return a value compatible with the brat format.
















      ..
          !! processed by numpydoc !!

   .. py:method:: to_spacy() -> str

      
      Return a value compatible with spaCy.
















      ..
          !! processed by numpydoc !!

   .. py:method:: to_dict() -> dict[str, Any]


   .. py:method:: from_dict(date_dict: dict[str, Any]) -> typing_extensions.Self
      :classmethod:

      
      Create an Attribute from a dict.


      :Parameters:

          **attribute_dict: dict of str to Any**
              A dictionary from a serialized Attribute as generated by to_dict()














      ..
          !! processed by numpydoc !!


.. py:class:: RelativeDateDirection(*args, **kwds)


   Bases: :py:obj:`enum.Enum`

   
   Direction of a :class:`~.RelativeDateAttribute`.
















   ..
       !! processed by numpydoc !!
   .. py:attribute:: PAST
      :value: 'past'

      

   .. py:attribute:: FUTURE
      :value: 'future'

      


.. py:class:: DucklingMatcher(output_label: str, version: str, url: str = 'http://localhost:8000', locale: str = 'fr_FR', dims: list[str] | None = None, attrs_to_copy: list[str] | None = None, uid: str | None = None)


   Bases: :py:obj:`medkit.core.text.NEROperation`

   
   Entity annotator using Duckling (https://github.com/facebook/duckling).

   This annotator can parse several types of information in multiple languages:
       amount of money, credit card numbers, distance, duration, email, numeral,
       ordinal, phone number, quantity, temperature, time, url, volume.

   This annotator currently requires a Duckling Server running. The easiest method is
   to run a docker container :

   >>> docker run --rm -d -p <PORT>:8000 --name duckling rasa/duckling:<TAG>

   This command will start a Duckling server listening on port <PORT>.
   The version of the server is identified by <TAG>















   ..
       !! processed by numpydoc !!
   .. py:method:: run(segments: list[medkit.core.text.Segment]) -> list[medkit.core.text.Entity]

      
      Return entities for each match in `segments`.


      :Parameters:

          **segments** : list of Segment
              List of segments into which to look for matches

      :Returns:

          list of Entity
              Entities found in `segments`













      ..
          !! processed by numpydoc !!

   .. py:method:: _find_matches_in_segment(segment: medkit.core.text.Segment) -> Iterator[medkit.core.text.Entity]


   .. py:method:: _test_connection()



.. py:class:: IAMSystemMatcher(matcher: iamsystem.Matcher, label_provider: LabelProvider | None = None, attrs_to_copy: list[str] | None = None, name: str | None = None, uid: str | None = None)


   Bases: :py:obj:`medkit.core.text.NEROperation`

   
   Entity annotator and linker based on iamsystem library.
















   ..
       !! processed by numpydoc !!
   .. py:method:: run(segments: list[medkit.core.text.Segment]) -> list[medkit.core.text.Entity]


   .. py:method:: _create_entity_from_iamsystem_ann(ann: iamsystem.Annotation, segment: medkit.core.text.Segment)



.. py:class:: MedkitKeyword


   
   A recommended iamsystem's IEntity implementation.

   This class is implemented to allow user to define one of both values of `kb_id`
   or `kb_name` with its iamsystem keyword.
   The entity label may be also provided if the user wants to define a category for
   the searched keyword (e.g., "drug" label for "Vicodin" keyword)















   ..
       !! processed by numpydoc !!
   .. py:attribute:: label
      :type: str

      

   .. py:attribute:: kb_id
      :type: str

      

   .. py:attribute:: kb_name
      :type: str | None

      

   .. py:attribute:: ent_label
      :type: str | None

      


.. py:class:: RegexpMatcher(rules: list[RegexpMatcherRule] | None = None, attrs_to_copy: list[str] | None = None, name: str | None = None, uid: str | None = None)


   Bases: :py:obj:`medkit.core.text.NEROperation`

   
   Entity annotator relying on regexp-based rules.

   For detecting entities, the module uses rules that may be sensitive to unicode or
   not. When the rule is not sensitive to unicode, we try to convert unicode chars to
   the closest ascii chars. However, some characters need to be pre-processed before
   (e.g., `n°` -> `number`). So, if the text lengths are different, we fall back on
   initial unicode text for detection even if rule is not unicode-sensitive.
   In this case, a warning is logged for recommending to pre-process data.















   ..
       !! processed by numpydoc !!
   .. py:method:: run(segments: list[medkit.core.text.Segment]) -> list[medkit.core.text.Entity]

      
      Return entities (with optional normalization attributes) matched in `segments`.


      :Parameters:

          **segments: list of Segment**
              List of segments into which to look for matches

      :Returns:

          list of Entity:
              Entities found in `segments` (with optional normalization attributes).
              Entities have a metadata dict with fields described in :class:`.RegexpMetadata`













      ..
          !! processed by numpydoc !!

   .. py:method:: _find_matches_in_segment(segment: medkit.core.text.Segment) -> Iterator[medkit.core.text.Entity]


   .. py:method:: _find_matches_in_segment_for_rule(rule_index: int, segment: medkit.core.text.Segment, text_ascii: str | None) -> Iterator[medkit.core.text.Entity]


   .. py:method:: _create_norm_attr(norm: RegexpMatcherNormalization) -> medkit.core.text.EntityNormAttribute
      :staticmethod:


   .. py:method:: load_rules(path_to_rules: pathlib.Path, encoding: str | None = None) -> list[RegexpMatcherRule]
      :staticmethod:

      
      Load all rules stored in a yml file.


      :Parameters:

          **path_to_rules: Path**
              Path to a yml file containing a list of mappings
              with the same structure as `RegexpMatcherRule`

          **encoding: str, optional**
              Encoding of the file to open

      :Returns:

          list of RegexpMatcherRule
              List of all the rules in `path_to_rules`,
              can be used to init a `RegexpMatcher`













      ..
          !! processed by numpydoc !!

   .. py:method:: check_rules_sanity(rules: list[RegexpMatcherRule])
      :staticmethod:

      
      Check consistency of a set of rules.
















      ..
          !! processed by numpydoc !!

   .. py:method:: save_rules(rules: list[RegexpMatcherRule], path_to_rules: pathlib.Path, encoding: str | None = None)
      :staticmethod:

      
      Store rules in a yml file.


      :Parameters:

          **rules: list of RegexpMatcherRule**
              The rules to save

          **path_to_rules: Path**
              Path to a .yml file that will contain the rules

          **encoding: str, optional**
              Encoding of the .yml file














      ..
          !! processed by numpydoc !!


.. py:class:: RegexpMatcherNormalization


   
   Descriptor of normalization attributes to attach to entities created from a :class:`~.RegexpMatcherRule`.














   :Attributes:

       **kb_name: str**
           The name of the knowledge base we are referencing. Ex: "umls"

       **kb_version: str**
           The name of the knowledge base we are referencing. Ex: "202AB"

       **kb_id: str, optional**
           The id of the entity in the knowledge base, for instance a CUI


   ..
       !! processed by numpydoc !!
   .. py:attribute:: kb_name
      :type: str

      

   .. py:attribute:: kb_id
      :type: Any

      

   .. py:attribute:: kb_version
      :type: str | None

      


.. py:class:: RegexpMatcherRule


   
   Regexp-based rule to use with :class:`~.RegexpMatcher`.














   :Attributes:

       **regexp: str**
           The regexp pattern used to match entities

       **label: str**
           The label to attribute to entities created based on this rule

       **term: str, optional**
           The optional normalized version of the entity text

       **id: str, optional**
           Unique identifier of the rule to store in the metadata of the entities

       **version: str, optional**
           Version string to store in the metadata of the entities

       **index_extract: int, default=0**
           If the regexp has groups, the index of the group to use to extract
           the entity

       **case_sensitive: bool, default=True**
           Whether to ignore case when running `regexp and `exclusion_regexp`

       **unicode_sensitive: bool, default=True**
           If True, regexp rule matches are searched on unicode text.
           So, `regexp and `exclusion_regexps` shall not contain non-ASCII chars because
           they would never be matched.
           If False, regexp rule matches are searched on closest ASCII text when possible.
           (cf. RegexpMatcher)

       **exclusion_regexp: str, optional**
           An optional exclusion pattern. Note that this exclusion pattern will be
           executed on the whole input annotation, so when relying on `exclusion_regexp`
           make sure the input annotations passed to `RegexpMatcher` are "local"-enough
           (sentences or syntagmas) rather than the whole text or paragraphs

       **normalizations: list of RegexpMatcherNormalization, optional**
           Optional list of normalization attributes that should be attached to
           the entities created


   ..
       !! processed by numpydoc !!
   .. py:attribute:: regexp
      :type: str

      

   .. py:attribute:: label
      :type: str

      

   .. py:attribute:: term
      :type: str | None

      

   .. py:attribute:: id
      :type: str | None

      

   .. py:attribute:: version
      :type: str | None

      

   .. py:attribute:: index_extract
      :type: int
      :value: 0

      

   .. py:attribute:: case_sensitive
      :type: bool
      :value: True

      

   .. py:attribute:: unicode_sensitive
      :type: bool
      :value: True

      

   .. py:attribute:: exclusion_regexp
      :type: str | None

      

   .. py:attribute:: normalizations
      :type: list[RegexpMatcherNormalization]

      

   .. py:method:: __post_init__()



.. py:class:: RegexpMetadata


   Bases: :py:obj:`typing_extensions.TypedDict`

   
   Metadata dict added to entities matched by :class:`.RegexpMatcher`.


   :Parameters:

       **rule_id: str or int**
           Identifier of the rule used to match an entity.
           If the rule has no id, then the index of the rule in
           the list of rules is used instead.

       **version: str, optional**
           Optional version of the rule used to match an entity














   ..
       !! processed by numpydoc !!
   .. py:attribute:: rule_id
      :type: str | int

      

   .. py:attribute:: version
      :type: str | None

      


.. py:class:: SimstringMatcher(rules: list[SimstringMatcherRule], threshold: float = 0.9, min_length: int = 3, max_length: int = 50, similarity: typing_extensions.Literal[cosine, dice, jaccard, overlap] = 'jaccard', spacy_tokenization_language: str | None = None, blacklist: list[str] | None = None, same_beginning: bool = False, attrs_to_copy: list[str] | None = None, name: str | None = None, uid: str | None = None)


   Bases: :py:obj:`medkit.text.ner._base_simstring_matcher.BaseSimstringMatcher`

   
   Entity matcher relying on string similarity.

   Uses the `simstring` fuzzy matching algorithm
   (http://chokkan.org/software/simstring/).

   Note that setting `spacy_tokenization_language` to `True` might reduce the
   number of false positives. This requires the `spacy` optional dependency,
   which can be installed with `pip install medkit-lib[spacy]`.

   :Parameters:

       **rules: list of SimstringMatcherRule**
           Rules to use for matching entities.

       **threshold: float, default=0.9**
           Minimum similarity (between 0.0 and 1.0) between a rule term
           and the text of an entity matched on that rule.

       **min_length: int, default=3**
           Minimum number of chars in matched entities.

       **max_length: int, default=50**
           Maximum number of chars in matched entities.

       **similarity: str, default="jaccard"**
           Similarity metric to use.

       **spacy_tokenization_language: str, optional**
           2-letter code (ex: "fr", "en", etc.) designating the language of the
           spacy model to use for tokenization. If provided, spacy will be used
           to tokenize input segments and filter out some tokens based on their
           part-of-speech tags, such as determinants, conjunctions and
           prepositions. If `None`, a simple regexp based tokenization will be
           used, which is faster but might give more false positives.

       **blacklist: list of str, optional**
           Optional list of exact terms to ignore.

       **same_beginning: bool, default=False**
           Ignore all matches that start with a different character than the
           term of the rule. This can be convenient to get rid of false
           positives on words that are very similar but have opposite meanings
           because of a preposition, for instance "activation" and
           "inactivation".

       **attrs_to_copy: list of str, optional**
           Labels of the attributes that should be copied from the source
           segment to the created entity. Useful for propagating context
           attributes (negation, antecedent, etc.).

       **name: str, optional**
           Name describing the matcher (defaults to the class name).

       **uid: str, optional**
           Identifier of the matcher.














   ..
       !! processed by numpydoc !!
   .. py:method:: load_rules(path_to_rules: pathlib.Path, encoding: str | None = None) -> list[SimstringMatcherRule]
      :staticmethod:

      
      Load all rules stored in a yml file.


      :Parameters:

          **path_to_rules**
              The path to a yml file containing a list of mappings with the same
              structure as :class:`~.SimstringMatcherRule`

          **encoding: str, optional**
              The encoding of the file to open

      :Returns:

          List[SimstringMatcherRule]
              List of all the rules in `path_to_rules`, can be used to init a
              :class:`~.SimstringMatcher`













      ..
          !! processed by numpydoc !!

   .. py:method:: save_rules(rules: list[SimstringMatcherRule], path_to_rules: pathlib.Path, encoding: str | None = None)
      :staticmethod:

      
      Store rules in a yml file.


      :Parameters:

          **rules: list of SimstringMatcherRule**
              The rules to save

          **path_to_rules: Path**
              The path to a yml file that will contain the rules

          **encoding: str, optional**
              The encoding of the yml file














      ..
          !! processed by numpydoc !!


.. py:class:: SimstringMatcherNormalization


   Bases: :py:obj:`medkit.text.ner._base_simstring_matcher.BaseSimstringMatcherNormalization`

   
   Descriptor of normalization attributes to attach to entities created from a :class:`~.SimstringMatcherRule`.














   :Attributes:

       **kb_name:**
           The name of the knowledge base we are referencing. Ex: "umls"

       **kb_version:**
           The name of the knowledge base we are referencing. Ex: "202AB"

       **kb_id:**
           The id of the entity in the knowledge base, for instance a CUI

       **term:**
           Optional normalized version of the entity text in the knowledge base


   ..
       !! processed by numpydoc !!
   .. py:method:: from_dict(data: dict[str, Any]) -> SimstringMatcherNormalization
      :staticmethod:

      
      Create a SimstringMatcherNormalization object from a dict.
















      ..
          !! processed by numpydoc !!


.. py:class:: SimstringMatcherRule


   Bases: :py:obj:`medkit.text.ner._base_simstring_matcher.BaseSimstringMatcherRule`

   
   Rule to use with :class:`~.SimstringMatcher`.














   :Attributes:

       **term:**
           Term to match using similarity-based fuzzy matching

       **label:**
           Label to use for the entities created when a match is found

       **case_sensitive:**
           Whether to take case into account when looking for matches.

       **unicode_sensitive:**
           Whether to use ASCII-only versions of the rule term and input texts when
           looking for matches (non-ASCII chars replaced by closest ASCII chars).

       **normalizations:**
           Optional list of normalization attributes that should be attached to the
           entities created


   ..
       !! processed by numpydoc !!
   .. py:method:: from_dict(data: dict[str, Any]) -> SimstringMatcherRule
      :staticmethod:

      
      Create a SimStringMatcherRule from a dict.
















      ..
          !! processed by numpydoc !!


.. py:class:: UMLSMatcher(umls_dir: str | pathlib.Path, cache_dir: str | pathlib.Path, language: str, threshold: float = 0.9, min_length: int = 3, max_length: int = 50, similarity: typing_extensions.Literal[cosine, dice, jaccard, overlap] = 'jaccard', lowercase: bool = True, normalize_unicode: bool = False, spacy_tokenization: bool = False, semgroups: Sequence[str] = ('ANAT', 'CHEM', 'DEVI', 'DISO', 'PHYS', 'PROC'), blacklist: list[str] | None = None, same_beginning: bool = False, output_labels_by_semgroup: str | dict[str, str] | None = None, attrs_to_copy: list[str] | None = None, name: str | None = None, uid: str | None = None)


   Bases: :py:obj:`medkit.text.ner._base_simstring_matcher.BaseSimstringMatcher`

   
   Entity annotator identifying UMLS concepts using the `simstring`_ fuzzy matching algorithm.

   This operation is heavily inspired by the `QuickUMLS` library
   (https://github.com/Georgetown-IR-Lab/QuickUMLS).

   By default, only terms belonging to the `ANAT` (anatomy), `CHEM` (Chemicals &
   Drugs), `DEVI` (Devices), `DISO` (Disorders), `PHYS` (Physiology) and `PROC`
   (Procedures) semgroups will be considered. This behavior can be changed with
   the `semgroups` parameter.

   Note that setting `spacy_tokenization_language` to `True` might reduce the
   number of false positives. This requires the `spacy` optional dependency,
   which can be installed with `pip install medkit-lib[spacy]`.

   .. _simstring: http://chokkan.org/software/simstring/

   :Parameters:

       **umls_dir** : str or Path
           Path to the UMLS directory containing the MRCONSO.RRF and
           MRSTY.RRF files.

       **cache_dir** : str or Path
           Path to the directory into which the umls database will be cached.
           If it doesn't exist yet, the database will be automatically
           generated (it can take a long time) and stored there, ready to be
           reused on further instantiations. If it already exists, a check will
           be done to make sure the params used when the database was generated
           are consistent with the params of the current instance. If you want
           to rebuild the database with new params using the same cache dir,
           you will have to manually delete it first.

       **language** : str
           Language to consider as found in the MRCONSO.RRF file. Example:
           `"FRE"`. Will trigger a regeneration of the database if changed.

       **threshold** : float, default=0.9
           Minimum similarity threshold (between 0.0 and 1.0) between a UMLS term
           and the text of a matched entity.

       **min_length** : int, default=3
           Minimum number of chars in matched entities.

       **max_length** : int, default=50
           Maximum number of chars in matched entities.

       **similarity** : str, default="jaccard"
           Similarity metric to use.

       **lowercase** : bool, default=True
           Whether to use lowercased versions of UMLS terms and input entities
           (except for acronyms for which the uppercase term is always used).
           Will trigger a regeneration of the database if changed.

       **normalize_unicode** : bool, default=False
           Whether to use ASCII-only versions of UMLS terms and input entities
           (non-ASCII chars replaced by closest ASCII chars). Will trigger a
           regeneration of the database if changed.

       **spacy_tokenization** : bool, default=False
           If `True`, spacy will be used to tokenize input segments and filter
           out some tokens based on their part-of-speech tags, such as
           determinants, conjunctions and prepositions. If `None`, a simple
           regexp based tokenization will be used, which is faster but might
           give more false positives.

       **semgroups** : sequence of str, default=("ANAT", "CHEM", "DEVI", "DISO", "PHYS", "PROC")
           Ids of UMLS semantic groups that matched concepts should belong to.
           :see: https://lhncbc.nlm.nih.gov/semanticnetwork/download/sg_archive/SemGroups-v04.txt
           If set to `None`, all concepts can be matched.
           Will trigger a regeneration of the database if changed.

       **blacklist** : list of str, optional
           Optional list of exact terms to ignore.

       **same_beginning** : bool, default=False
           Ignore all matches that start with a different character than the
           term of the rule. This can be convenient to get rid of false
           positives on words that are very similar but have opposite meanings
           because of a preposition, for instance "activation" and
           "inactivation".

       **output_labels_by_semgroup** : str or dict, optional
           By default, ~`medkit.text.ner.umls.SEMGROUP_LABELS` will be used as
           entity labels. Use this parameter to override them. Example:
           `{"DISO": "problem", "PROC": "test}`. If `output_labels_by_semgroup`
           is a string, all entities will use this string as label instead.
           Will trigger a regeneration of the database if changed.

       **attrs_to_copy** : list of str, optional
           Labels of the attributes that should be copied from the source
           segment to the created entity. Useful for propagating context
           attributes (negation, antecedent, etc)

       **name** : str, optional
           Name describing the matcher (defaults to the class name).

       **uid** : str, optional
           Identifier of the matcher.














   ..
       !! processed by numpydoc !!
   .. py:attribute:: _SEMGROUP_BY_SEMTYPE

      

   .. py:method:: _get_labels_by_semgroup(output_labels: str | dict[str, str] | None) -> dict[str, str]
      :classmethod:

      
      Return a mapping giving the label to use for all entries of a given semgroup.

      output_labels : str or dict of str to str, optional
          Optional mapping of labels to use. Can be used to override the default
          labels. If `output_labels` is a single string, it will be used as a unique
          label for all semgroups


      :Returns:

          dict of str to str
              A mapping with semgroups as keys and corresponding label as values













      ..
          !! processed by numpydoc !!

   .. py:method:: _build_rules(umls_dir: pathlib.Path, language: str, lowercase: bool, normalize_unicode: bool, semgroups: set[str] | None, labels_by_semgroup: dict[str, str]) -> Iterator[medkit.text.ner._base_simstring_matcher.BaseSimstringMatcherRule]
      :classmethod:

      
      Create rules for all UMLS entries with appropriate labels.
















      ..
          !! processed by numpydoc !!


