medkit.text.ner.nlstruct_entity_matcher#

Classes#

NLStructEntityMatcher

Entity matcher based on a NLstruct InformationExtraction model.

Module Contents#

class medkit.text.ner.nlstruct_entity_matcher.NLStructEntityMatcher(model_name_or_dirpath: str | pathlib.Path, attrs_to_copy: list[str] | None = None, device: int = -1, hf_auth_token: str | None = None, cache_dir: str | pathlib.Path | None = None, name: str | None = None, uid: str | None = None)#

Bases: medkit.core.text.NEROperation

Entity matcher based on a NLstruct InformationExtraction model.

The matcher expects a directory with a torch checkpoint and a text file if the model was pretrained using word embeddings.

The paper [1] presents a model trained with the NLstruct [2] library and the mimic learning approach. The model used a private teacher model to annotate the unlabeled [CAS clinical French corpus](https://aclanthology.org/W18-5614/). The weights of the CAS student model are shared via the HuggingFace Hub and you can load the model using the following model name NesrineBannour/CAS-privacy-preserving-model to create a NLstructEntityMatcher.

Parameters:
model_name_or_dirpathstr or Path

Name (on the HuggingFace models hub) or dirpath of the NLstruct model. The model dir must contain a PyTorch file (‘.cpkt’,’.pt’) and a text file (.txt) representing the FastText embeddings if required.

attrs_to_copylist of str, optional

Labels of the attributes that should be copied from the input segment to the created entity. Useful for propagating context attributes (negation, antecendent, etc).

deviceint, default=-1

Device to use for the NLstruct model. Follows the HuggingFace convention (-1 for “cpu” and device number for gpu, for instance 0 for “cuda:0”).

hf_auth_tokenstr, optional

HuggingFace Authentication token (to access private models on the hub)

cache_dirstr or Path, optional

Directory where to store downloaded models. If not set, the default HuggingFace cache dir is used.

namestr, optional

Name describing the matcher (defaults to the class name).

uidstr, optional

Identifier of the matcher.

References

[1]

Nesrine Bannour, Perceval WajsbĂŒrt, Bastien Rance, Xavier Tannier, and AurĂ©lie NĂ©vĂ©ol. 2022. Privacy-preserving mimic models for clinical named entity recognition in French. Journal of Biomedical Informatics 130, (2022), 104073. DOI: https://doi.org/https://doi.org/10.1016/j.jbi.2022.104073

[2]

Perceval WajsbĂŒrt. 2021. Extraction and normalization of simple and structured entities in medical documents. Theses. Sorbonne UniversitĂ©. Retrieved from https://hal.archives-ouvertes.fr/tel-03624928

init_args#
cache_dir#
attrs_to_copy#
model_name_or_dirpath#
device#
model#
static _load_from_checkpoint_dir(checkpoint_dir: pathlib.Path, device)#

Get the location of the checkpoint and fix the path of the Fast Text file in the configuration.

Return the nlstruct model created with the modified config.

run(segments: list[medkit.core.text.Segment]) list[medkit.core.text.Entity]#

Return entities for each match in segments.

Parameters:
segmentslist of Segment

List of segments into which to look for matches.

Returns:
list of Entity

Entities found in segments.

_matches_to_entities(matches: list[dict], segment: medkit.core.text.Segment) Iterator[medkit.core.text.Entity]#