medkit.text.ner.hf_tokenization_utils
=====================================

.. py:module:: medkit.text.ner.hf_tokenization_utils


Functions
---------

.. autoapisummary::

   medkit.text.ner.hf_tokenization_utils.convert_labels_to_tags
   medkit.text.ner.hf_tokenization_utils.transform_entities_to_tags
   medkit.text.ner.hf_tokenization_utils.align_and_map_tokens_with_tags


Module Contents
---------------

.. py:function:: convert_labels_to_tags(labels: list[str], tagging_scheme: typing_extensions.Literal[bilou, iob2] = 'bilou') -> dict[str, int]

   
   Convert a list of labels in a mapping of NER tags.


   :Parameters:

       **labels** : list of str
           List of labels to convert

       **tagging_scheme** : str, default="bilou"
           Scheme to use in the conversion, "iob2" follows the BIO scheme.


   :Returns:

       dict of str to int
           Mapping with NER tags.


   .. rubric:: Examples

   >>> convert_labels_to_tags(labels=["test", "problem"], tagging_scheme="iob2")
   {'O': 0, 'B-test': 1, 'I-test': 2, 'B-problem': 3, 'I-problem': 4}

   ..
       !! processed by numpydoc !!

.. py:function:: transform_entities_to_tags(text_encoding: transformers.tokenization_utils_fast.EncodingFast, entities: list[medkit.core.text.Entity], tagging_scheme: typing_extensions.Literal[bilou, iob2] = 'bilou') -> list[str]

   
   Transform entities from a encoded document to a list of BILOU/IOB2 tags.


   :Parameters:

       **text_encoding** : EncodingFast
           Encoding of the document of reference, this is created by a HuggingFace fast tokenizer.
           It contains a tokenized version of the document to tag.

       **entities** : list of Entity
           The list of entities to transform

       **tagging_scheme** : {"bilou", "iob2"}, default="bilou"
           Scheme to tag the tokens, it can be `bilou` or `iob2`


   :Returns:

       list of str
           A list describing the document with tags. By default the tags
           could be "B", "I", "L", "O","U", if `tagging_scheme` is `iob2`
           the tags could be "B", "I","O".


   .. rubric:: Examples

   >>> # Define a fast tokenizer, i.e. : bert tokenizer
   >>> from transformers import AutoTokenizer
   >>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", use_fast=True)

   >>> document = TextDocument(text="medkit")
   >>> entities = [
   ...     Entity(label="corporation", spans=[Span(start=0, end=6)], text="medkit")
   ... ]
   >>> # Get text encoding of the document using the tokenizer
   >>> text_encoding = tokenizer(document.text).encodings[0]
   >>> print(text_encoding.tokens)
   ['[CLS]', 'med',##kit', '[SEP]']

   Transform to BILOU tags

   >>> tags = transform_entities_to_tags(text_encoding, entities)
   >>> assert tags == ["O", "B-corporation", "L-corporation", "O"]

   Transform to IOB2 tags

   >>> tags = transform_entities_to_tags(text_encoding, entities, "iob2")
   >>> assert tags == ["O", "B-corporation", "I-corporation", "O"]

   ..
       !! processed by numpydoc !!

.. py:function:: align_and_map_tokens_with_tags(text_encoding: transformers.tokenization_utils_fast.EncodingFast, tags: list[str], tag_to_id: dict[str, int], map_sub_tokens: bool = True) -> list[int]

   
   Return a list of tags_ids aligned with the text encoding.

   Tags considered as special tokens will have the `SPECIAL_TAG_ID_HF`.

   :Parameters:

       **text_encoding** : EncodingFast
           Text encoding after tokenization with a HuggingFace fast tokenizer

       **tags** : list of str
           A list of tags i.e BILOU tags

       **tag_to_id** : dict of str to int
           Mapping tag to id

       **map_sub_tokens** : bool, default=True
           When a token is not in the vocabulary of the tokenizer, it could split
           the token into multiple subtokens.
           If `map_sub_tokens` is True, all tags inside a token will be converted.
           If `map_sub_tokens` is False, only the first subtoken of a split token will be
           converted.


   :Returns:

       list of int
           A list of tags ids


   .. rubric:: Examples

   >>> # Define a fast tokenizer, i.e. : bert tokenizer
   >>> from transformers import AutoTokenizer
   >>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", use_fast=True)

   >>> # define data to map
   >>> text_encoding = tokenizer("medkit").encodings[0]
   >>> tags = ["O", "B-corporation", "I-corporation", "O"]
   >>> tag_to_id = {"O": 0, "B-corporation": 1, "I-corporation": 2}
   >>> print(text_encoding.tokens)
   ['[CLS]', 'med',##kit', '[SEP]']

   Mapping all tags to tags_ids

   >>> tags_ids = align_and_map_tokens_with_tags(text_encoding, tags, tag_to_id)
   >>> assert tags_ids == [-100, 1, 2, -100]

   Mapping only first tag in tokens

   >>> tags_ids = align_and_map_tokens_with_tags(text_encoding, tags, tag_to_id, False)
   >>> assert tags_ids == [-100, 1, -100, -100]

   ..
       !! processed by numpydoc !!