medkit.text.ner.regexp_matcher
==============================

.. py:module:: medkit.text.ner.regexp_matcher


Classes
-------

.. autoapisummary::

   medkit.text.ner.regexp_matcher.RegexpMatcherRule
   medkit.text.ner.regexp_matcher.RegexpMatcherNormalization
   medkit.text.ner.regexp_matcher.RegexpMetadata
   medkit.text.ner.regexp_matcher.RegexpMatcher


Module Contents
---------------

.. py:class:: RegexpMatcherRule

   
   Regexp-based rule to use with :class:`~.RegexpMatcher`.


   :Attributes:

       **regexp: str**
           The regexp pattern used to match entities

       **label: str**
           The label to attribute to entities created based on this rule

       **term: str, optional**
           The optional normalized version of the entity text

       **id: str, optional**
           Unique identifier of the rule to store in the metadata of the entities

       **version: str, optional**
           Version string to store in the metadata of the entities

       **index_extract: int, default=0**
           If the regexp has groups, the index of the group to use to extract
           the entity

       **case_sensitive: bool, default=True**
           Whether to ignore case when running `regexp and `exclusion_regexp`

       **unicode_sensitive: bool, default=True**
           If True, regexp rule matches are searched on unicode text.
           So, `regexp and `exclusion_regexps` shall not contain non-ASCII chars because
           they would never be matched.
           If False, regexp rule matches are searched on closest ASCII text when possible.
           (cf. RegexpMatcher)

       **exclusion_regexp: str, optional**
           An optional exclusion pattern. Note that this exclusion pattern will be
           executed on the whole input annotation, so when relying on `exclusion_regexp`
           make sure the input annotations passed to `RegexpMatcher` are "local"-enough
           (sentences or syntagmas) rather than the whole text or paragraphs

       **normalizations: list of RegexpMatcherNormalization, optional**
           Optional list of normalization attributes that should be attached to
           the entities created


   ..
       !! processed by numpydoc !!

   .. py:attribute:: regexp
      :type:  str


   .. py:attribute:: label
      :type:  str


   .. py:attribute:: term
      :type:  str | None
      :value: None


   .. py:attribute:: id
      :type:  str | None
      :value: None


   .. py:attribute:: version
      :type:  str | None
      :value: None


   .. py:attribute:: index_extract
      :type:  int
      :value: 0


   .. py:attribute:: case_sensitive
      :type:  bool
      :value: True


   .. py:attribute:: unicode_sensitive
      :type:  bool
      :value: True


   .. py:attribute:: exclusion_regexp
      :type:  str | None
      :value: None


   .. py:attribute:: normalizations
      :type:  list[RegexpMatcherNormalization]


   .. py:method:: __post_init__()


.. py:class:: RegexpMatcherNormalization

   
   Descriptor of normalization attributes to attach to entities created from a :class:`~.RegexpMatcherRule`.


   :Attributes:

       **kb_name: str**
           The name of the knowledge base we are referencing. Ex: "umls"

       **kb_version: str**
           The name of the knowledge base we are referencing. Ex: "202AB"

       **kb_id: str, optional**
           The id of the entity in the knowledge base, for instance a CUI


   ..
       !! processed by numpydoc !!

   .. py:attribute:: kb_name
      :type:  str


   .. py:attribute:: kb_id
      :type:  Any


   .. py:attribute:: kb_version
      :type:  str | None
      :value: None


.. py:class:: RegexpMetadata

   Bases: :py:obj:`typing_extensions.TypedDict`


   Metadata dict added to entities matched by :class:`.RegexpMatcher`.


   :Parameters:

       **rule_id: str or int**
           Identifier of the rule used to match an entity.
           If the rule has no id, then the index of the rule in
           the list of rules is used instead.

       **version: str, optional**
           Optional version of the rule used to match an entity


   ..
       !! processed by numpydoc !!

   .. py:attribute:: rule_id
      :type:  str | int


   .. py:attribute:: version
      :type:  str | None


.. py:class:: RegexpMatcher(rules: list[RegexpMatcherRule] | None = None, attrs_to_copy: list[str] | None = None, name: str | None = None, uid: str | None = None)

   Bases: :py:obj:`medkit.core.text.NEROperation`


   Entity annotator relying on regexp-based rules.

   For detecting entities, the module uses rules that may be sensitive to unicode or
   not. When the rule is not sensitive to unicode, we try to convert unicode chars to
   the closest ascii chars. However, some characters need to be pre-processed before
   (e.g., `n°` -> `number`). So, if the text lengths are different, we fall back on
   initial unicode text for detection even if rule is not unicode-sensitive.
   In this case, a warning is logged for recommending to pre-process data.


   ..
       !! processed by numpydoc !!

   .. py:method:: run(segments: list[medkit.core.text.Segment]) -> list[medkit.core.text.Entity]

      
      Return entities (with optional normalization attributes) matched in `segments`.


      :Parameters:

          **segments: list of Segment**
              List of segments into which to look for matches

      :Returns:

          list of Entity:
              Entities found in `segments` (with optional normalization attributes).
              Entities have a metadata dict with fields described in :class:`.RegexpMetadata`


      ..
          !! processed by numpydoc !!


   .. py:method:: _find_matches_in_segment(segment: medkit.core.text.Segment) -> Iterator[medkit.core.text.Entity]


   .. py:method:: _find_matches_in_segment_for_rule(rule_index: int, segment: medkit.core.text.Segment, text_ascii: str | None) -> Iterator[medkit.core.text.Entity]


   .. py:method:: _create_norm_attr(norm: RegexpMatcherNormalization) -> medkit.core.text.EntityNormAttribute
      :staticmethod:


   .. py:method:: load_rules(path_to_rules: pathlib.Path, encoding: str | None = None) -> list[RegexpMatcherRule]
      :staticmethod:


      Load all rules stored in a yml file.


      :Parameters:

          **path_to_rules: Path**
              Path to a yml file containing a list of mappings
              with the same structure as `RegexpMatcherRule`

          **encoding: str, optional**
              Encoding of the file to open

      :Returns:

          list of RegexpMatcherRule
              List of all the rules in `path_to_rules`,
              can be used to init a `RegexpMatcher`


      ..
          !! processed by numpydoc !!


   .. py:method:: check_rules_sanity(rules: list[RegexpMatcherRule])
      :staticmethod:


      Check consistency of a set of rules.


      ..
          !! processed by numpydoc !!


   .. py:method:: save_rules(rules: list[RegexpMatcherRule], path_to_rules: pathlib.Path, encoding: str | None = None)
      :staticmethod:


      Store rules in a yml file.


      :Parameters:

          **rules: list of RegexpMatcherRule**
              The rules to save

          **path_to_rules: Path**
              Path to a .yml file that will contain the rules

          **encoding: str, optional**
              Encoding of the .yml file


      ..
          !! processed by numpydoc !!