medkit.text.ner.hf_entity_matcher#
Classes#
Entity matcher based on HuggingFace transformers model. |
Module Contents#
- class medkit.text.ner.hf_entity_matcher.HFEntityMatcher(model: str | pathlib.Path, aggregation_strategy: typing_extensions.Literal[none, simple, first, average, max] = 'max', attrs_to_copy: list[str] | None = None, device: int = -1, batch_size: int = 1, hf_auth_token: str | None = None, cache_dir: str | pathlib.Path | None = None, name: str | None = None, uid: str | None = None)#
Bases:
medkit.core.text.NEROperation
Entity matcher based on HuggingFace transformers model.
Any token classification model from the HuggingFace hub can be used (for instance “samrawal/bert-base-uncased_clinical-ner”).
- Parameters:
- modelstr or Path
Name (on the HuggingFace models hub) or path of the NER model. Must be a model compatible with the TokenClassification transformers class.
- aggregation_strategystr, default=”max”
Strategy to fuse tokens based on the model prediction, passed to TokenClassificationPipeline. Defaults to “max”, cf https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.TokenClassificationPipeline.aggregation_strategy for details
- attrs_to_copylist of str, optional
Labels of the attributes that should be copied from the input segment to the created entity. Useful for propagating context attributes (negation, antecendent, etc).
- deviceint, default=-1
Device to use for the transformer model. Follows the HuggingFace convention (-1 for “cpu” and device number for gpu, for instance 0 for “cuda:0”).
- batch_sizeint, default=1
Number of segments in batches processed by the transformer model.
- hf_auth_tokenstr, optional
HuggingFace Authentication token (to access private models on the hub)
- cache_dirstr or Path, optional
Directory where to store downloaded models. If not set, the default HuggingFace cache dir is used.
- namestr, optional
Name describing the matcher (defaults to the class name).
- uidstr, optional
Identifier of the matcher.
- init_args#
- model#
- attrs_to_copy#
- valid_model#
- _pipeline#
- run(segments: list[medkit.core.text.Segment]) list[medkit.core.text.Entity] #
Return entities for each match in segments.
- Parameters:
- segmentslist of Segment
List of segments into which to look for matches.
- Returns:
- list of Entity
Entities found in segments.
- _matches_to_entities(matches: list[dict], segment: medkit.core.text.Segment) Iterator[medkit.core.text.Entity] #
- static make_trainable(model_name_or_path: str | pathlib.Path, labels: list[str], tagging_scheme: typing_extensions.Literal[bilou, iob2], tag_subtokens: bool = False, tokenizer_max_length: int | None = None, hf_auth_token: str | None = None, device: int = -1)#
Return the trainable component of the operation.
This component can be trained using
Trainer
, and then used in a new HFEntityMatcher operation.