medkit.text.ner.hf_tokenization_utils#

Functions#

convert_labels_to_tags(→ dict[str, int])

Convert a list of labels in a mapping of NER tags.

transform_entities_to_tags(→ list[str])

Transform entities from a encoded document to a list of BILOU/IOB2 tags.

align_and_map_tokens_with_tags(→ list[int])

Return a list of tags_ids aligned with the text encoding.

Module Contents#

medkit.text.ner.hf_tokenization_utils.convert_labels_to_tags(labels: list[str], tagging_scheme: typing_extensions.Literal[bilou, iob2] = 'bilou') dict[str, int]#

Convert a list of labels in a mapping of NER tags.

Parameters:
labelslist of str

List of labels to convert

tagging_schemestr, default=”bilou”

Scheme to use in the conversion, “iob2” follows the BIO scheme.

Returns:
dict of str to int

Mapping with NER tags.

Examples

>>> convert_labels_to_tags(labels=["test", "problem"], tagging_scheme="iob2")
{'O': 0, 'B-test': 1, 'I-test': 2, 'B-problem': 3, 'I-problem': 4}
medkit.text.ner.hf_tokenization_utils.transform_entities_to_tags(text_encoding: transformers.tokenization_utils_fast.EncodingFast, entities: list[medkit.core.text.Entity], tagging_scheme: typing_extensions.Literal[bilou, iob2] = 'bilou') list[str]#

Transform entities from a encoded document to a list of BILOU/IOB2 tags.

Parameters:
text_encodingEncodingFast

Encoding of the document of reference, this is created by a HuggingFace fast tokenizer. It contains a tokenized version of the document to tag.

entitieslist of Entity

The list of entities to transform

tagging_scheme{“bilou”, “iob2”}, default=”bilou”

Scheme to tag the tokens, it can be bilou or iob2

Returns:
list of str

A list describing the document with tags. By default the tags could be “B”, “I”, “L”, “O”,”U”, if tagging_scheme is iob2 the tags could be “B”, “I”,”O”.

Examples

>>> # Define a fast tokenizer, i.e. : bert tokenizer
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", use_fast=True)
>>> document = TextDocument(text="medkit")
>>> entities = [
...     Entity(label="corporation", spans=[Span(start=0, end=6)], text="medkit")
... ]
>>> # Get text encoding of the document using the tokenizer
>>> text_encoding = tokenizer(document.text).encodings[0]
>>> print(text_encoding.tokens)
['[CLS]', 'med',##kit', '[SEP]']

Transform to BILOU tags

>>> tags = transform_entities_to_tags(text_encoding, entities)
>>> assert tags == ["O", "B-corporation", "L-corporation", "O"]

Transform to IOB2 tags

>>> tags = transform_entities_to_tags(text_encoding, entities, "iob2")
>>> assert tags == ["O", "B-corporation", "I-corporation", "O"]
medkit.text.ner.hf_tokenization_utils.align_and_map_tokens_with_tags(text_encoding: transformers.tokenization_utils_fast.EncodingFast, tags: list[str], tag_to_id: dict[str, int], map_sub_tokens: bool = True) list[int]#

Return a list of tags_ids aligned with the text encoding.

Tags considered as special tokens will have the SPECIAL_TAG_ID_HF.

Parameters:
text_encodingEncodingFast

Text encoding after tokenization with a HuggingFace fast tokenizer

tagslist of str

A list of tags i.e BILOU tags

tag_to_iddict of str to int

Mapping tag to id

map_sub_tokensbool, default=True

When a token is not in the vocabulary of the tokenizer, it could split the token into multiple subtokens. If map_sub_tokens is True, all tags inside a token will be converted. If map_sub_tokens is False, only the first subtoken of a split token will be converted.

Returns:
list of int

A list of tags ids

Examples

>>> # Define a fast tokenizer, i.e. : bert tokenizer
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", use_fast=True)
>>> # define data to map
>>> text_encoding = tokenizer("medkit").encodings[0]
>>> tags = ["O", "B-corporation", "I-corporation", "O"]
>>> tag_to_id = {"O": 0, "B-corporation": 1, "I-corporation": 2}
>>> print(text_encoding.tokens)
['[CLS]', 'med',##kit', '[SEP]']

Mapping all tags to tags_ids

>>> tags_ids = align_and_map_tokens_with_tags(text_encoding, tags, tag_to_id)
>>> assert tags_ids == [-100, 1, 2, -100]

Mapping only first tag in tokens

>>> tags_ids = align_and_map_tokens_with_tags(text_encoding, tags, tag_to_id, False)
>>> assert tags_ids == [-100, 1, -100, -100]