medkit.text.ner.hf_entity_matcher
=================================

.. py:module:: medkit.text.ner.hf_entity_matcher


Classes
-------

.. autoapisummary::

   medkit.text.ner.hf_entity_matcher.HFEntityMatcher


Module Contents
---------------

.. py:class:: HFEntityMatcher(model: str | pathlib.Path, aggregation_strategy: typing_extensions.Literal[none, simple, first, average, max] = 'max', attrs_to_copy: list[str] | None = None, device: int = -1, batch_size: int = 1, hf_auth_token: str | None = None, cache_dir: str | pathlib.Path | None = None, name: str | None = None, uid: str | None = None)

   Bases: :py:obj:`medkit.core.text.NEROperation`


   Entity matcher based on HuggingFace transformers model.

   Any token classification model from the HuggingFace hub can be used
   (for instance "samrawal/bert-base-uncased_clinical-ner").

   :Parameters:

       **model** : str or Path
           Name (on the HuggingFace models hub) or path of the NER model. Must be a model compatible
           with the `TokenClassification` transformers class.

       **aggregation_strategy** : str, default="max"
           Strategy to fuse tokens based on the model prediction, passed to `TokenClassificationPipeline`.
           Defaults to "max", cf https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.TokenClassificationPipeline.aggregation_strategy
           for details

       **attrs_to_copy** : list of str, optional
           Labels of the attributes that should be copied from the input segment
           to the created entity. Useful for propagating context attributes
           (negation, antecendent, etc).

       **device** : int, default=-1
           Device to use for the transformer model. Follows the HuggingFace convention
           (-1 for "cpu" and device number for gpu, for instance 0 for "cuda:0").

       **batch_size** : int, default=1
           Number of segments in batches processed by the transformer model.

       **hf_auth_token** : str, optional
           HuggingFace Authentication token (to access private models on the
           hub)

       **cache_dir** : str or Path, optional
           Directory where to store downloaded models. If not set, the default
           HuggingFace cache dir is used.

       **name** : str, optional
           Name describing the matcher (defaults to the class name).

       **uid** : str, optional
           Identifier of the matcher.


   ..
       !! processed by numpydoc !!

   .. py:attribute:: model


   .. py:attribute:: attrs_to_copy
      :value: None


   .. py:attribute:: _pipeline


   .. py:method:: run(segments: list[medkit.core.text.Segment]) -> list[medkit.core.text.Entity]

      
      Return entities for each match in `segments`.


      :Parameters:

          **segments** : list of Segment
              List of segments into which to look for matches.


      :Returns:

          list of Entity
              Entities found in `segments`.


      ..
          !! processed by numpydoc !!


   .. py:method:: _matches_to_entities(matches: list[dict], segment: medkit.core.text.Segment) -> Iterator[medkit.core.text.Entity]


   .. py:method:: make_trainable(model_name_or_path: str | pathlib.Path, labels: list[str], tagging_scheme: typing_extensions.Literal[bilou, iob2], tag_subtokens: bool = False, tokenizer_max_length: int | None = None, hf_auth_token: str | None = None, device: int = -1)
      :staticmethod:


      Return the trainable component of the operation.

      This component can be trained using :class:`~medkit.training.Trainer`, and then
      used in a new `HFEntityMatcher` operation.


      ..
          !! processed by numpydoc !!