medkit.text.ner.hf_entity_matcher#

Classes#

HFEntityMatcher

Entity matcher based on HuggingFace transformers model.

Module Contents#

class medkit.text.ner.hf_entity_matcher.HFEntityMatcher(model: str | pathlib.Path, aggregation_strategy: typing_extensions.Literal[none, simple, first, average, max] = 'max', attrs_to_copy: list[str] | None = None, device: int = -1, batch_size: int = 1, hf_auth_token: str | None = None, cache_dir: str | pathlib.Path | None = None, name: str | None = None, uid: str | None = None)#

Bases: medkit.core.text.NEROperation

Entity matcher based on HuggingFace transformers model.

Any token classification model from the HuggingFace hub can be used (for instance “samrawal/bert-base-uncased_clinical-ner”).

Parameters:
modelstr or Path

Name (on the HuggingFace models hub) or path of the NER model. Must be a model compatible with the TokenClassification transformers class.

aggregation_strategystr, default=”max”

Strategy to fuse tokens based on the model prediction, passed to TokenClassificationPipeline. Defaults to “max”, cf https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.TokenClassificationPipeline.aggregation_strategy for details

attrs_to_copylist of str, optional

Labels of the attributes that should be copied from the input segment to the created entity. Useful for propagating context attributes (negation, antecendent, etc).

deviceint, default=-1

Device to use for the transformer model. Follows the HuggingFace convention (-1 for “cpu” and device number for gpu, for instance 0 for “cuda:0”).

batch_sizeint, default=1

Number of segments in batches processed by the transformer model.

hf_auth_tokenstr, optional

HuggingFace Authentication token (to access private models on the hub)

cache_dirstr or Path, optional

Directory where to store downloaded models. If not set, the default HuggingFace cache dir is used.

namestr, optional

Name describing the matcher (defaults to the class name).

uidstr, optional

Identifier of the matcher.

init_args#
model#
attrs_to_copy#
valid_model#
_pipeline#
run(segments: list[medkit.core.text.Segment]) list[medkit.core.text.Entity]#

Return entities for each match in segments.

Parameters:
segmentslist of Segment

List of segments into which to look for matches.

Returns:
list of Entity

Entities found in segments.

_matches_to_entities(matches: list[dict], segment: medkit.core.text.Segment) Iterator[medkit.core.text.Entity]#
static make_trainable(model_name_or_path: str | pathlib.Path, labels: list[str], tagging_scheme: typing_extensions.Literal[bilou, iob2], tag_subtokens: bool = False, tokenizer_max_length: int | None = None, hf_auth_token: str | None = None, device: int = -1)#

Return the trainable component of the operation.

This component can be trained using Trainer, and then used in a new HFEntityMatcher operation.