medkit.text.ner.hf_tokenization_utils#
Functions#
|
Convert a list of labels in a mapping of NER tags. |
|
Transform entities from a encoded document to a list of BILOU/IOB2 tags. |
|
Return a list of tags_ids aligned with the text encoding. |
Module Contents#
- medkit.text.ner.hf_tokenization_utils.convert_labels_to_tags(labels: list[str], tagging_scheme: typing_extensions.Literal[bilou, iob2] = 'bilou') dict[str, int] #
Convert a list of labels in a mapping of NER tags.
- Parameters:
- labelslist of str
List of labels to convert
- tagging_schemestr, default=”bilou”
Scheme to use in the conversion, “iob2” follows the BIO scheme.
- Returns:
- dict of str to int
Mapping with NER tags.
Examples
>>> convert_labels_to_tags(labels=["test", "problem"], tagging_scheme="iob2") {'O': 0, 'B-test': 1, 'I-test': 2, 'B-problem': 3, 'I-problem': 4}
- medkit.text.ner.hf_tokenization_utils.transform_entities_to_tags(text_encoding: transformers.tokenization_utils_fast.EncodingFast, entities: list[medkit.core.text.Entity], tagging_scheme: typing_extensions.Literal[bilou, iob2] = 'bilou') list[str] #
Transform entities from a encoded document to a list of BILOU/IOB2 tags.
- Parameters:
- text_encodingEncodingFast
Encoding of the document of reference, this is created by a HuggingFace fast tokenizer. It contains a tokenized version of the document to tag.
- entitieslist of Entity
The list of entities to transform
- tagging_scheme{“bilou”, “iob2”}, default=”bilou”
Scheme to tag the tokens, it can be bilou or iob2
- Returns:
- list of str
A list describing the document with tags. By default the tags could be “B”, “I”, “L”, “O”,”U”, if tagging_scheme is iob2 the tags could be “B”, “I”,”O”.
Examples
>>> # Define a fast tokenizer, i.e. : bert tokenizer >>> from transformers import AutoTokenizer >>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", use_fast=True)
>>> document = TextDocument(text="medkit") >>> entities = [ ... Entity(label="corporation", spans=[Span(start=0, end=6)], text="medkit") ... ] >>> # Get text encoding of the document using the tokenizer >>> text_encoding = tokenizer(document.text).encodings[0] >>> print(text_encoding.tokens) ['[CLS]', 'med',##kit', '[SEP]']
Transform to BILOU tags
>>> tags = transform_entities_to_tags(text_encoding, entities) >>> assert tags == ["O", "B-corporation", "L-corporation", "O"]
Transform to IOB2 tags
>>> tags = transform_entities_to_tags(text_encoding, entities, "iob2") >>> assert tags == ["O", "B-corporation", "I-corporation", "O"]
- medkit.text.ner.hf_tokenization_utils.align_and_map_tokens_with_tags(text_encoding: transformers.tokenization_utils_fast.EncodingFast, tags: list[str], tag_to_id: dict[str, int], map_sub_tokens: bool = True) list[int] #
Return a list of tags_ids aligned with the text encoding.
Tags considered as special tokens will have the SPECIAL_TAG_ID_HF.
- Parameters:
- text_encodingEncodingFast
Text encoding after tokenization with a HuggingFace fast tokenizer
- tagslist of str
A list of tags i.e BILOU tags
- tag_to_iddict of str to int
Mapping tag to id
- map_sub_tokensbool, default=True
When a token is not in the vocabulary of the tokenizer, it could split the token into multiple subtokens. If map_sub_tokens is True, all tags inside a token will be converted. If map_sub_tokens is False, only the first subtoken of a split token will be converted.
- Returns:
- list of int
A list of tags ids
Examples
>>> # Define a fast tokenizer, i.e. : bert tokenizer >>> from transformers import AutoTokenizer >>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", use_fast=True)
>>> # define data to map >>> text_encoding = tokenizer("medkit").encodings[0] >>> tags = ["O", "B-corporation", "I-corporation", "O"] >>> tag_to_id = {"O": 0, "B-corporation": 1, "I-corporation": 2} >>> print(text_encoding.tokens) ['[CLS]', 'med',##kit', '[SEP]']
Mapping all tags to tags_ids
>>> tags_ids = align_and_map_tokens_with_tags(text_encoding, tags, tag_to_id) >>> assert tags_ids == [-100, 1, 2, -100]
Mapping only first tag in tokens
>>> tags_ids = align_and_map_tokens_with_tags(text_encoding, tags, tag_to_id, False) >>> assert tags_ids == [-100, 1, -100, -100]