:py:mod:`medkit.text.context`
=============================

.. py:module:: medkit.text.context


Submodules
----------
.. toctree::
   :titlesonly:
   :maxdepth: 1

   family_detector/index.rst
   hypothesis_detector/index.rst
   negation_detector/index.rst


Package Contents
----------------

Classes
~~~~~~~

.. autoapisummary::

   medkit.text.context.FamilyDetector
   medkit.text.context.FamilyDetectorRule
   medkit.text.context.FamilyMetadata
   medkit.text.context.HypothesisDetector
   medkit.text.context.HypothesisDetectorRule
   medkit.text.context.HypothesisRuleMetadata
   medkit.text.context.HypothesisVerbMetadata
   medkit.text.context.NegationDetector
   medkit.text.context.NegationDetectorRule
   medkit.text.context.NegationMetadata


.. py:class:: FamilyDetector(output_label: str, rules: list[FamilyDetectorRule] | None = None, uid: str | None = None)


   Bases: :py:obj:`medkit.core.text.ContextOperation`

   
   Annotator for creating family attributes.

   Annotator creating family attributes with boolean values
   indicating if a family reference has been detected.

   Because family attributes will be attached to whole annotations,
   each input annotation should be "local"-enough rather than
   a big chunk of text (ie a sentence or a syntagma).

   For detecting family references, the module uses rules that may be sensitive to unicode or
   not. When the rule is not sensitive to unicode, we try to convert unicode chars to
   the closest ascii chars. However, some characters need to be pre-processed before
   (e.g., `n°` -> `number`). So, if the text lengths are different, we fall back on
   initial unicode text for detection even if rule is not unicode-sensitive.
   In this case, a warning is logged for recommending to pre-process data.

   Note that for better results, family detection should be run at the sentence
   level (ie on sentence segments) rather than at the syntagma level [1].

   :Parameters:

       **output_label** : str
           The label of the created attributes

       **rules** : list of FamilyDetectorRule, optional
           The set of rules to use when detecting family references. If none provided,
           the rules in "family_detector_default_rules.yml" will be used

       **uid** : str, optional
           Identifier of the detector


   .. rubric:: References

   [1] Garcelon, N., Neuraz, A., Benoit, V., Salomon, R., & Burgun, A. (2017).
       Improving a full-text search engine: the importance of negation detection and family history context
       to identify cases in a biomedical data warehouse.
       Journal of the American Medical Informatics Association : JAMIA, 24(3), 607-613.
       https://doi.org/10.1093/jamia/ocw144

   .. only:: latex

      
   ..
       !! processed by numpydoc !!
   .. py:method:: run(segments: list[medkit.core.text.Segment])

      
      Run the operation.

      Add a family attribute to each segment with a boolean value
      indicating if a family reference has been detected.

      Family attributes with a `True` value have a metadata dict with
      fields described in :class:`.FamilyMetadata`.

      :Parameters:

          **segments** : list of Segment
              List of segments to detect as being family references or not


      ..
          !! processed by numpydoc !!

   .. py:method:: _detect_family_ref_in_segment(segment: medkit.core.text.Segment) -> medkit.core.Attribute | None


   .. py:method:: _find_matching_rule(text: str) -> str | int | None


   .. py:method:: load_rules(path_to_rules: pathlib.Path, encoding: str | None = None) -> list[FamilyDetectorRule]
      :staticmethod:

      
      Load all rules stored in a yml file.


      :Parameters:

          **path_to_rules** : Path
              Path to a yml file containing a list of mappings
              with the same structure as `FamilyDetectorRule`

          **encoding** : str, optional
              Encoding of the file to open

      :Returns:

          list of FamilyDetectorRule
              List of all the rules in `path_to_rules`,
              can be used to init a `FamilyDetector`


      ..
          !! processed by numpydoc !!

   .. py:method:: check_rules_sanity(rules: list[FamilyDetectorRule])
      :staticmethod:

      
      Check consistency of a set of rules.


      ..
          !! processed by numpydoc !!

   .. py:method:: save_rules(rules: list[FamilyDetectorRule], path_to_rules: pathlib.Path, encoding: str | None = None)
      :staticmethod:

      
      Store rules in a YAML file.


      :Parameters:

          **rules** : list of FamilyDetectorRule
              The rules to save

          **path_to_rules** : Path
              Path to a .yml file that will contain the rules

          **encoding** : str, optional
              Encoding of the .yml file


      ..
          !! processed by numpydoc !!


.. py:class:: FamilyDetectorRule


   Regexp-based rule to use with `FamilyDetector`.

   Input text may be converted before detecting rule.

   :Parameters:

       **regexp** : str
           The regexp pattern used to match a family reference

       **exclusion_regexps** : list of str, optional
           Optional exclusion patterns

       **id** : str, optional
           Unique identifier of the rule to store in the metadata of the entities

       **case_sensitive** : bool, default=False
           Whether to consider case when running `regexp and `exclusion_regexs`

       **unicode_sensitive** : bool, default=False
           If True, rule matches are searched on unicode text.
           So, `regexp` and `exclusion_regexps` shall not contain non-ASCII chars because
           they would never be matched.
           If False, rule matches are searched on closest ASCII text when possible.
           (cf. FamilyDetector)


   ..
       !! processed by numpydoc !!
   .. py:attribute:: regexp
      :type: str

      
   .. py:attribute:: exclusion_regexps
      :type: list[str]

      
   .. py:attribute:: id
      :type: str | None

      
   .. py:attribute:: case_sensitive
      :type: bool
      :value: False

      
   .. py:attribute:: unicode_sensitive
      :type: bool
      :value: False

      
   .. py:method:: __post_init__()


.. py:class:: FamilyMetadata


   Bases: :py:obj:`typing_extensions.TypedDict`

   
   Metadata dict added to family attributes with `True` value.


   :Parameters:

       **rule_id** : str or int
           Identifier of the rule used to detect a family reference.
           If the rule has no id, then the index of the rule in
           the list of rules is used instead.


   ..
       !! processed by numpydoc !!
   .. py:attribute:: rule_id
      :type: str | int

      
.. py:class:: HypothesisDetector(output_label: str = 'hypothesis', rules: list[HypothesisDetectorRule] | None = None, verbs: dict[str, dict[str, dict[str, list[str]]]] | None = None, modes_and_tenses: list[tuple[str, str]] | None = None, max_length: int = 150, uid: str | None = None)


   Bases: :py:obj:`medkit.core.text.ContextOperation`

   
   Annotator detecting and creating hypothesis attributes.

   Hypothesis will be considered present either because of the presence of a
   certain text pattern in a segment, or because of the usage of a certain verb
   at a specific mode and tense (for instance conditional).

   Because hypothesis attributes will be attached to whole segments,
   each input segment should be "local"-enough (ie a sentence or a syntagma)
   rather than a big chunk of text.

   :Parameters:

       **output_label** : str, default="hypothesis"
           The label of the created attributes

       **rules** : list of HypothesisDetectorRule, optional
           The set of rules to use when detecting hypothesis. If none provided,
           the rules in "hypothesis_detector_default_rules.yml" will be used

       **verbs** : dict of str to dict, optional
           Conjugated verbs forms, to be used in association with `modes_and_tenses`.
           Conjugated forms of a verb at a specific mode and tense must be provided
           in nested dicts with the 1st key being the verb's root, the 2d key the mode
           and the 3d key the tense.
           For instance verb["aller"]["indicatif]["présent"] would hold the list
           ["vais", "vas", "va", "allons", aller", "vont"]
           When `verbs` is provided, `modes_and_tenses` must also be provided.
           If none provided, the rules in "hypothesis_detector_default_verbs.yml" will
           be used.

       **modes_and_tenses** : list of tuple of str, optional
           List of tuples of all modes and tenses associated with hypothesis.
           Will be used to select conjugated forms in `verbs` that denote hypothesis.

       **max_length** : int, default=150
           Maximum number of characters in a hypothesis segment. Segments longer than
           this will never be considered as hypothesis

       **uid** : str, optional
           Identifier of the detector


   ..
       !! processed by numpydoc !!
   .. py:method:: run(segments: list[medkit.core.text.Segment])

      
      Run the operation.

      Add a hypothesis attribute to each segment with a boolean value
      indicating if a hypothesis has been detected.

      Hypothesis attributes with a `True` value have a metadata dict with
      fields described in either :class:`.HypothesisRuleMetadata` or :class:`.HypothesisVerbMetadata`.

      :Parameters:

          **segments** : list of Segment
              List of segments to detect as being hypothesis or not


      ..
          !! processed by numpydoc !!

   .. py:method:: _detect_hypothesis_in_segment(segment: medkit.core.text.Segment) -> medkit.core.Attribute | None


   .. py:method:: _find_matching_verb(text: str) -> str | None


   .. py:method:: _find_matching_rule(text: str) -> str | int | None


   .. py:method:: load_verbs(path_to_verbs: pathlib.Path, encoding: str | None = None) -> dict[str, dict[str, dict[str, list[str]]]]
      :staticmethod:

      
      Load all conjugated verb forms stored in a YAML file.

      Conjugated verb forms at a specific mode and tense must be stored in nested mappings
      with the 1st key being the verb root, the 2d key the mode and the 3d key the tense.

      :Parameters:

          **path_to_verbs** : Path
              Path to a yml file containing a list of verbs form,
              arranged by mode and tense.

          **encoding** : str, optional
              Encoding on the file to open

      :Returns:

          dict of str to dict
              List of verb forms in `path_to_verbs`,
              can be used to init an `HypothesisDetector`


      ..
          !! processed by numpydoc !!

   .. py:method:: load_rules(path_to_rules: pathlib.Path, encoding: str | None = None) -> list[HypothesisDetectorRule]
      :staticmethod:

      
      Load all rules stored in a YAML file.


      :Parameters:

          **path_to_rules** : Path
              Path to a yml file containing a list of mappings
              with the same structure as `HypothesisDetectorRule`

          **encoding** : str, optional
              Encoding of the file to open

      :Returns:

          list of HypothesisDetectorRule
              List of all the rules in `path_to_rules`,
              can be used to init an `HypothesisDetector`


      ..
          !! processed by numpydoc !!

   .. py:method:: get_example() -> HypothesisDetector
      :classmethod:

      
      Instantiate an HypothesisDetector with example rules and verbs, designed for usage with EDS documents.


      ..
          !! processed by numpydoc !!

   .. py:method:: check_rules_sanity(rules: list[HypothesisDetectorRule])
      :staticmethod:

      
      Check consistency of a set of rules.


      ..
          !! processed by numpydoc !!

   .. py:method:: save_rules(rules: list[HypothesisDetectorRule], path_to_rules: pathlib.Path, encoding: str | None = None)
      :staticmethod:

      
      Store rules in a YAML file.


      :Parameters:

          **rules** : list of HypothesisDetectorRule
              The rules to save

          **path_to_rules** : Path
              Path to a .yml file that will contain the rules

          **encoding** : str, optional
              Encoding of the .yml file


      ..
          !! processed by numpydoc !!


.. py:class:: HypothesisDetectorRule


   Regexp-based rule to use with `HypothesisDetector`.


   :Attributes:

       **regexp** : str
           The regexp pattern used to match a hypothesis

       **exclusion_regexps** : list of str, optional
           Optional exclusion patterns

       **id** : str, optional
           Unique identifier of the rule to store in the metadata of the entities

       **case_sensitive** : bool, default=False
           Whether to ignore case when running `regexp and `exclusion_regexps`

       **unicode_sensitive** : bool, default=False
           Whether to replace all non-ASCII chars by the closest ASCII chars
           on input text before running `regexp and `exclusion_regexps`.
           If True, then `regexp and `exclusion_regexps` shouldn't contain
           non-ASCII chars because they would never be matched.


   ..
       !! processed by numpydoc !!
   .. py:attribute:: regexp
      :type: str

      
   .. py:attribute:: exclusion_regexps
      :type: list[str]

      
   .. py:attribute:: id
      :type: str | None

      
   .. py:attribute:: case_sensitive
      :type: bool
      :value: False

      
   .. py:attribute:: unicode_sensitive
      :type: bool
      :value: False

      
   .. py:method:: __post_init__()


.. py:class:: HypothesisRuleMetadata


   Bases: :py:obj:`typing_extensions.TypedDict`

   
   Metadata added to hypothesis attributes with `True` value detected by a rule.


   :Parameters:

       **type** : str
           Metadata type, here `"rule"` (use to differentiate
           between rule/verb metadata dict)

       **rule_id** : str
           Identifier of the rule used to detect an hypothesis.
           If the rule has no uid, then the index of the rule in
           the list of rules is used instead


   ..
       !! processed by numpydoc !!
   .. py:attribute:: type
      :type: typing_extensions.Literal[rule]

      
   .. py:attribute:: rule_id
      :type: str

      
.. py:class:: HypothesisVerbMetadata


   Bases: :py:obj:`typing_extensions.TypedDict`

   
   Metadata added to hypothesis attributes with `True` value detected by a verb.


   :Parameters:

       **type** : str
           Metadata type, here `"verb"` (use to differentiate
           between rule/verb metadata dict).

       **matched_verb** : str
           Root of the verb used to detect an hypothesis.


   ..
       !! processed by numpydoc !!
   .. py:attribute:: type
      :type: typing_extensions.Literal[verb]

      
   .. py:attribute:: matched_verb
      :type: str

      
.. py:class:: NegationDetector(output_label: str, rules: list[NegationDetectorRule] | None = None, uid: str | None = None)


   Bases: :py:obj:`medkit.core.text.ContextOperation`

   
   Annotator creating negation attributes.

   Because negation attributes will be attached to whole annotations,
   each input annotation should be "local"-enough rather than
   a big chunk of text (ie a sentence or a syntagma).

   For detecting negation, the module uses rules that may be sensitive to unicode or
   not. When the rule is not sensitive to unicode, we try to convert unicode chars to
   the closest ascii chars. However, some characters need to be pre-processed before
   (e.g., `n°` -> `number`). So, if the text lengths are different, we fall back on
   initial unicode text for detection even if rule is not unicode-sensitive.
   In this case, a warning is logged for recommending to pre-process data.


   ..
       !! processed by numpydoc !!
   .. py:method:: run(segments: list[medkit.core.text.Segment])

      
      Run the operation.

      Add a negation attribute to each segment with a boolean value
      indicating if a hypothesis has been found.

      Negation attributes with a `True` value have a metadata dict with
      fields described in :class:`.NegationRuleMetadata`.

      :Parameters:

          **segments** : list of Segment
              List of segments to detect as being negated or not


      ..
          !! processed by numpydoc !!

   .. py:method:: _detect_negation_in_segment(segment: medkit.core.text.Segment) -> medkit.core.Attribute | None


   .. py:method:: _find_matching_rule(text: str) -> str | int | None


   .. py:method:: load_rules(path_to_rules: pathlib.Path, encoding: str | None = None) -> list[NegationDetectorRule]
      :staticmethod:

      
      Load all rules stored in a yml file.


      :Parameters:

          **path_to_rules** : Path
              Path to a yml file containing a list of mappings
              with the same structure as `NegationDetectorRule`

          **encoding** : str, optional
              Encoding of the file to open

      :Returns:

          list of NegationDetectorRule
              List of all the rules in `path_to_rules`,
              can be used to init a `NegationDetector`


      ..
          !! processed by numpydoc !!

   .. py:method:: check_rules_sanity(rules: list[NegationDetectorRule])
      :staticmethod:

      
      Check consistency of a set of rules.


      ..
          !! processed by numpydoc !!

   .. py:method:: save_rules(rules: list[NegationDetectorRule], path_to_rules: pathlib.Path, encoding: str | None = None)
      :staticmethod:

      
      Store rules in a yml file.


      :Parameters:

          **rules** : list of NegationDetectorRule
              The rules to save

          **path_to_rules** : Path
              Path to a .yml file that will contain the rules

          **encoding** : str, optional
              Encoding of the .yml file


      ..
          !! processed by numpydoc !!


.. py:class:: NegationDetectorRule


   Regexp-based rule to use with `NegationDetector`.

   Input text may be converted before detecting rule.

   :Parameters:

       **regexp** : str
           The regexp pattern used to match a negation

       **exclusion_regexps** : list of str, optional
           Optional exclusion patterns

       **id** : str, optional
           Unique identifier of the rule to store in the metadata of the entities

       **case_sensitive** : bool, default=False
           Whether to consider case when running `regexp and `exclusion_regexs`

       **unicode_sensitive** : bool, default=False
           If True, rule matches are searched on unicode text.
           So, `regexp and `exclusion_regexs` shall not contain non-ASCII chars because
           they would never be matched.
           If False, rule matches are searched on closest ASCII text when possible.
           (cf. NegationDetector)


   ..
       !! processed by numpydoc !!
   .. py:attribute:: regexp
      :type: str

      
   .. py:attribute:: exclusion_regexps
      :type: list[str]

      
   .. py:attribute:: id
      :type: str | None

      
   .. py:attribute:: case_sensitive
      :type: bool
      :value: False

      
   .. py:attribute:: unicode_sensitive
      :type: bool
      :value: False

      
   .. py:method:: __post_init__()


.. py:class:: NegationMetadata


   Bases: :py:obj:`typing_extensions.TypedDict`

   
   Metadata dict added to negation attributes with `True` value.


   :Parameters:

       **rule_id** : str or int
           Identifier of the rule used to detect a negation.
           If the rule has no uid, then the index of the rule in
           the list of rules is used instead.


   ..
       !! processed by numpydoc !!
   .. py:attribute:: rule_id
      :type: str | int