medkit.text.translation.hf_translator

medkit.text.translation.hf_translator#

This module needs extra-dependencies not installed as core dependencies of medkit. To install them, use pip install medkit-lib[hf-translator].

Classes:

HFTranslator([output_label, ...])

Translator based on HuggingFace transformers model

class HFTranslator(output_label='translation', translation_model='Helsinki-NLP/opus-mt-fr-en', alignment_model='bert-base-multilingual-cased', alignment_layer=8, alignment_threshold=0.001, device=-1, batch_size=1, hf_auth_token=None, cache_dir=None, uid=None)[source]#

Translator based on HuggingFace transformers model

Any translation model from the HuggingFace hub can be used.

For segment given in input, a translated segment will be returned. The spans of the translated segment are “aligned” to the original segment. An alignment model is used to find matches between translated words and original words, and for each of these matches a ModifiedSpan is created, referencing the original span in the original text.

Segment given in input should not contain more than one sentence, because only the 1st sentence will be translated and the others will be discarded (this might vary with the model). The formatting will not be preserved. Note that the translation and alignment models have a maximum token length (typically 512) so there is a hard limit on the length of each segment anyway.

Parameters:

output_label (str, optional) – Label of the translated segments
translation_model (str or Path, optional) – Name (on the HuggingFace models hub) or path of the translation model. Must be a model compatible with the TranslationPipeline transformers class.
alignment_model (str or Path, optional) – Name (on the HuggingFace models hub) or path of the alignment model. Must be a multilingual BERT model compatible with the BertModel transformers class.
alignment_layer (int, default=8) – Index of the layer in the alignment model that contains the token embeddings (the original and translated embedding will be. compared)
alignment_threshold (float, default=1e-3) – Threshold value used to decide if embeddings are similar enough to be aligned
device (int, default=-1) – Device to use for transformers models. Follows the HuggingFace convention (-1 for “cpu” and device number for gpu, for instance 0 for “cuda:0”)
batch_size (int, default=1) – Number of segments in batches processed by translation and alignment models
hf_auth_token (str, optional) – HuggingFace Authentication token (to access private models on the hub)
cache_dir (str or Path, optional) – Directory where to store downloaded models. If not set, the default HuggingFace cache dir is used.
uid (str, optional) – Identifier of the translator

Methods:

`run`(segments)	Translate short segments (can't contain multiple sentences)
`set_prov_tracer`(prov_tracer)	Enable provenance tracing.

Attributes:

description

Contains all the operation init parameters.

run(segments)[source]#

Translate short segments (can’t contain multiple sentences)

Parameters:: segments (list of Segment) – List of segments to translate
Return type:: list[Segment]
Returns:: list of Segment – Translated segments (with spans referring to words in original text, for translated words that have been aligned to original words)

property description: OperationDescription#

Contains all the operation init parameters.

Return type:: OperationDescription

set_prov_tracer(prov_tracer)#

Enable provenance tracing.

Parameters:: prov_tracer (ProvTracer) – The provenance tracer used to trace the provenance.

medkit.text.translation.hf_translator

Contents

medkit.text.translation.hf_translator#