medkit.text.ner.hf_entity_matcher_trainable
===========================================

.. py:module:: medkit.text.ner.hf_entity_matcher_trainable


Classes
-------

.. autoapisummary::

   medkit.text.ner.hf_entity_matcher_trainable.HFEntityMatcherTrainable


Module Contents
---------------

.. py:class:: HFEntityMatcherTrainable(model_name_or_path: str | pathlib.Path, labels: list[str], tagging_scheme: typing_extensions.Literal[bilou, iob2], tag_subtokens: bool = False, tokenizer_max_length: int | None = None, hf_auth_token: str | None = None, device: int = -1)

   
   Trainable entity matcher based on HuggingFace transformers model.

   Any token classification model from the HuggingFace hub can be used
   (for instance "samrawal/bert-base-uncased_clinical-ner").

   :Parameters:

       **model_name_or_path** : str or Path
           Name (on the HuggingFace models hub) or path of the NER model. Must be a model compatible
           with the `TokenClassification` transformers class.

       **labels** : list of str
           List of labels to detect

       **tagging_scheme** : {"bilou", "iob2"}
           Tagging scheme to use in the segment-entities preprocessing and label mapping definition.

       **tag_subtokens** : bool, default=False
           Whether tag subtokens in a word. PreTrained models require a tokenization step.
           If any word of the segment is not in the vocabulary of the tokenizer used by the PreTrained model,
           the word is split into subtokens.
           It is recommended to only tag the first subtoken of a word. However, it is possible to tag all subtokens
           by setting this value to `True`. It could influence the time and results of fine-tunning.

       **tokenizer_max_length** : int, optional
           Optional max length for the tokenizer, by default the `model_max_length` will be used.

       **hf_auth_token** : str, optional
           HuggingFace Authentication token (to access private models on the
           hub)

       **device** : int, default=-1
           Device to use for the transformer model. Follows the HuggingFace convention
           (-1 for "cpu" and device number for gpu, for instance 0 for "cuda:0").


   ..
       !! processed by numpydoc !!

   .. py:attribute:: model_name_or_path


   .. py:attribute:: tagging_scheme


   .. py:attribute:: tag_subtokens
      :value: False


   .. py:attribute:: tokenizer_max_length
      :value: None


   .. py:attribute:: model_config


   .. py:attribute:: label_to_id


   .. py:attribute:: id_to_label


   .. py:attribute:: device


   .. py:attribute:: _data_collator


   .. py:method:: configure_optimizer(lr: float) -> torch


   .. py:method:: preprocess(data_item: medkit.core.text.TextDocument) -> dict[str, Any]


   .. py:method:: _encode_text(text)

      
      Return a EncodingFast instance.


      ..
          !! processed by numpydoc !!


   .. py:method:: collate(batch: list[dict[str, Any]]) -> medkit.training.utils.BatchData


   .. py:method:: forward(input_batch: medkit.training.utils.BatchData, return_loss: bool, eval_mode: bool) -> tuple[medkit.training.utils.BatchData, torch | None]


   .. py:method:: save(path: str | pathlib.Path)


   .. py:method:: load(path: str | pathlib.Path, hf_auth_token: str | None = None)


   .. py:method:: _get_valid_model_config(labels: list[str], hf_auth_token: str | None = None)

      
      Return a config file with the correct mapping of labels.


      ..
          !! processed by numpydoc !!