medkit.text.ner.hf_entity_matcher

medkit.text.ner.hf_entity_matcher#

Classes#

HFEntityMatcher

Entity matcher based on HuggingFace transformers model.

Module Contents#

class medkit.text.ner.hf_entity_matcher.HFEntityMatcher(model: str | pathlib.Path, aggregation_strategy: typing_extensions.Literal[none, simple, first, average, max] = 'max', attrs_to_copy: list[str] | None = None, device: int = -1, batch_size: int = 1, hf_auth_token: str | None = None, cache_dir: str | pathlib.Path | None = None, name: str | None = None, uid: str | None = None)#

Bases: medkit.core.text.NEROperation

Entity matcher based on HuggingFace transformers model.

Any token classification model from the HuggingFace hub can be used (for instance “samrawal/bert-base-uncased_clinical-ner”).

Parameters:

modelstr or Path: Name (on the HuggingFace models hub) or path of the NER model. Must be a model compatible with the TokenClassification transformers class.
aggregation_strategystr, default=”max”: Strategy to fuse tokens based on the model prediction, passed to TokenClassificationPipeline. Defaults to “max”, cf https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.TokenClassificationPipeline.aggregation_strategy for details
attrs_to_copylist of str, optional: Labels of the attributes that should be copied from the input segment to the created entity. Useful for propagating context attributes (negation, antecendent, etc).
deviceint, default=-1: Device to use for the transformer model. Follows the HuggingFace convention (-1 for “cpu” and device number for gpu, for instance 0 for “cuda:0”).
batch_sizeint, default=1: Number of segments in batches processed by the transformer model.
hf_auth_tokenstr, optional: HuggingFace Authentication token (to access private models on the hub)
cache_dirstr or Path, optional: Directory where to store downloaded models. If not set, the default HuggingFace cache dir is used.
namestr, optional: Name describing the matcher (defaults to the class name).
uidstr, optional: Identifier of the matcher.

init_args#

model#

attrs_to_copy#

valid_model#

_pipeline#

run(segments: list[medkit.core.text.Segment]) → list[medkit.core.text.Entity]#

Return entities for each match in segments.

Parameters:

segmentslist of Segment: List of segments into which to look for matches.

Returns:

list of Entity: Entities found in segments.

_matches_to_entities(matches: list[dict], segment: medkit.core.text.Segment) → Iterator[medkit.core.text.Entity]#

static make_trainable(model_name_or_path: str | pathlib.Path, labels: list[str], tagging_scheme: typing_extensions.Literal[bilou, iob2], tag_subtokens: bool = False, tokenizer_max_length: int | None = None, hf_auth_token: str | None = None, device: int = -1)#

Return the trainable component of the operation.

This component can be trained using Trainer, and then used in a new HFEntityMatcher operation.