medkit.text.ner.hf_tokenization_utils

medkit.text.ner.hf_tokenization_utils#

Functions#

`convert_labels_to_tags`(→ dict[str, int])	Convert a list of labels in a mapping of NER tags.
`transform_entities_to_tags`(→ list[str])	Transform entities from a encoded document to a list of BILOU/IOB2 tags.
`align_and_map_tokens_with_tags`(→ list[int])	Return a list of tags_ids aligned with the text encoding.

Module Contents#

medkit.text.ner.hf_tokenization_utils.convert_labels_to_tags(labels: list[str], tagging_scheme: typing_extensions.Literal[bilou, iob2] = 'bilou') → dict[str, int]#

Convert a list of labels in a mapping of NER tags.

Parameters:

labelslist of str: List of labels to convert
tagging_schemestr, default=”bilou”: Scheme to use in the conversion, “iob2” follows the BIO scheme.

Returns:

dict of str to int: Mapping with NER tags.

Examples

>>> convert_labels_to_tags(labels=["test", "problem"], tagging_scheme="iob2")
{'O': 0, 'B-test': 1, 'I-test': 2, 'B-problem': 3, 'I-problem': 4}

medkit.text.ner.hf_tokenization_utils.transform_entities_to_tags(text_encoding: transformers.tokenization_utils_fast.EncodingFast, entities: list[medkit.core.text.Entity], tagging_scheme: typing_extensions.Literal[bilou, iob2] = 'bilou') → list[str]#

Transform entities from a encoded document to a list of BILOU/IOB2 tags.

Parameters:

text_encodingEncodingFast: Encoding of the document of reference, this is created by a HuggingFace fast tokenizer. It contains a tokenized version of the document to tag.
entitieslist of Entity: The list of entities to transform
tagging_scheme{“bilou”, “iob2”}, default=”bilou”: Scheme to tag the tokens, it can be bilou or iob2

Returns:

list of str: A list describing the document with tags. By default the tags could be “B”, “I”, “L”, “O”,”U”, if tagging_scheme is iob2 the tags could be “B”, “I”,”O”.

Examples

>>> # Define a fast tokenizer, i.e. : bert tokenizer
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", use_fast=True)

>>> document = TextDocument(text="medkit")
>>> entities = [
...     Entity(label="corporation", spans=[Span(start=0, end=6)], text="medkit")
... ]
>>> # Get text encoding of the document using the tokenizer
>>> text_encoding = tokenizer(document.text).encodings[0]
>>> print(text_encoding.tokens)
['[CLS]', 'med',##kit', '[SEP]']

Transform to BILOU tags

>>> tags = transform_entities_to_tags(text_encoding, entities)
>>> assert tags == ["O", "B-corporation", "L-corporation", "O"]

Transform to IOB2 tags

>>> tags = transform_entities_to_tags(text_encoding, entities, "iob2")
>>> assert tags == ["O", "B-corporation", "I-corporation", "O"]

medkit.text.ner.hf_tokenization_utils.align_and_map_tokens_with_tags(text_encoding: transformers.tokenization_utils_fast.EncodingFast, tags: list[str], tag_to_id: dict[str, int], map_sub_tokens: bool = True) → list[int]#

Return a list of tags_ids aligned with the text encoding.

Tags considered as special tokens will have the SPECIAL_TAG_ID_HF.

Parameters:

text_encodingEncodingFast: Text encoding after tokenization with a HuggingFace fast tokenizer
tagslist of str: A list of tags i.e BILOU tags
tag_to_iddict of str to int: Mapping tag to id
map_sub_tokensbool, default=True: When a token is not in the vocabulary of the tokenizer, it could split the token into multiple subtokens. If map_sub_tokens is True, all tags inside a token will be converted. If map_sub_tokens is False, only the first subtoken of a split token will be converted.

Returns:

list of int: A list of tags ids

Examples

>>> # Define a fast tokenizer, i.e. : bert tokenizer
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", use_fast=True)

>>> # define data to map
>>> text_encoding = tokenizer("medkit").encodings[0]
>>> tags = ["O", "B-corporation", "I-corporation", "O"]
>>> tag_to_id = {"O": 0, "B-corporation": 1, "I-corporation": 2}
>>> print(text_encoding.tokens)
['[CLS]', 'med',##kit', '[SEP]']

Mapping all tags to tags_ids

>>> tags_ids = align_and_map_tokens_with_tags(text_encoding, tags, tag_to_id)
>>> assert tags_ids == [-100, 1, 2, -100]

Mapping only first tag in tokens

>>> tags_ids = align_and_map_tokens_with_tags(text_encoding, tags, tag_to_id, False)
>>> assert tags_ids == [-100, 1, -100, -100]