medkit.text.translation.hf_translator
=====================================

.. py:module:: medkit.text.translation.hf_translator


Classes
-------

.. autoapisummary::

   medkit.text.translation.hf_translator.HFTranslator


Module Contents
---------------

.. py:class:: HFTranslator(output_label: str = _DEFAULT_LABEL, translation_model: str | pathlib.Path = _DEFAULT_TRANSLATION_MODEL, alignment_model: str | pathlib.Path = _DEFAULT_ALIGNMENT_MODEL, alignment_layer: int = 8, alignment_threshold: float = 0.001, device: int = -1, batch_size: int = 1, hf_auth_token: str | None = None, cache_dir: str | pathlib.Path | None = None, uid: str | None = None)

   Bases: :py:obj:`medkit.core.Operation`


   Translator based on HuggingFace transformers model.

   Any translation model from the HuggingFace hub can be used.

   For segment given in input, a translated segment will be returned.
   The spans of the translated segment are "aligned" to the original segment.
   An alignment model is used to find matches between translated words and
   original words, and for each of these matches a :class:`~medkit.core.text.ModifiedSpan`
   is created, referencing the original span in the original text.

   Segment given in input should not contain more than one sentence, because only the 1st
   sentence will be translated and the others will be discarded (this might vary with the model).
   The formatting will not be preserved. Note that the translation and alignment models have a
   maximum token length (typically 512) so there is a hard limit on the length of each segment anyway.

   :Parameters:

       **output_label** : str, optional
           Label of the translated segments

       **translation_model** : str or Path, optional
           Name (on the HuggingFace models hub) or path of the translation model. Must be a model compatible
           with the `TranslationPipeline` transformers class.

       **alignment_model** : str or Path, optional
           Name (on the HuggingFace models hub) or path of the alignment model. Must be a multilingual BERT model
           compatible with the `BertModel` transformers class.

       **alignment_layer** : int, default=8
           Index of the layer in the alignment model that contains the token embeddings
           (the original and translated embedding will be. compared)

       **alignment_threshold** : float, default=1e-3
           Threshold value used to decide if embeddings are similar enough to be aligned

       **device** : int, default=-1
           Device to use for transformers models. Follows the HuggingFace convention
           (-1 for "cpu" and device number for gpu, for instance 0 for "cuda:0")

       **batch_size** : int, default=1
           Number of segments in batches processed by translation and alignment models

       **hf_auth_token** : str, optional
           HuggingFace Authentication token (to access private models on the
           hub)

       **cache_dir** : str or Path, optional
           Directory where to store downloaded models. If not set, the default
           HuggingFace cache dir is used.

       **uid** : str, optional
           Identifier of the translator


   ..
       !! processed by numpydoc !!

   .. py:attribute:: _DEFAULT_LABEL
      :value: 'translation'


   .. py:attribute:: _DEFAULT_TRANSLATION_MODEL
      :value: 'Helsinki-NLP/opus-mt-fr-en'


   .. py:attribute:: _DEFAULT_ALIGNMENT_MODEL
      :value: 'bert-base-multilingual-cased'


   .. py:attribute:: output_label
      :value: 'translation'


   .. py:attribute:: translation_model
      :value: 'Helsinki-NLP/opus-mt-fr-en'


   .. py:attribute:: alignment_model
      :value: 'bert-base-multilingual-cased'


   .. py:attribute:: alignment_layer
      :value: 8


   .. py:attribute:: alignment_threshold
      :value: 0.001


   .. py:attribute:: device
      :value: -1


   .. py:attribute:: batch_size
      :value: 1


   .. py:attribute:: _translation_pipeline


   .. py:attribute:: _aligner


   .. py:method:: run(segments: list[medkit.core.text.Segment]) -> list[medkit.core.text.Segment]

      
      Translate short segments (can't contain multiple sentences).


      :Parameters:

          **segments** : list of Segment
              List of segments to translate


      :Returns:

          list of Segment
              Translated segments (with spans referring to words in original text, for translated
              words that have been aligned to original words)


      ..
          !! processed by numpydoc !!


   .. py:method:: _translate_segments(segments: list[medkit.core.text.Segment]) -> Iterator[medkit.core.text.Segment]


   .. py:method:: _get_translated_spans(alignment, translated_text, original_text, original_spans)

      
      Compute spans for translated segments.

      Making translated words reference words in original text through ModifiedSpans when possible.


      ..
          !! processed by numpydoc !!