medkit.text.ner.nlstruct_entity_matcher

medkit.text.ner.nlstruct_entity_matcher#

Classes#

NLStructEntityMatcher

Entity matcher based on a NLstruct InformationExtraction model.

Module Contents#

Bases: medkit.core.text.NEROperation

Entity matcher based on a NLstruct InformationExtraction model.

The matcher expects a directory with a torch checkpoint and a text file if the model was pretrained using word embeddings.

The paper [1] presents a model trained with the NLstruct [2] library and the mimic learning approach. The model used a private teacher model to annotate the unlabeled [CAS clinical French corpus](https://aclanthology.org/W18-5614/). The weights of the CAS student model are shared via the HuggingFace Hub and you can load the model using the following model name NesrineBannour/CAS-privacy-preserving-model to create a NLstructEntityMatcher.

Parameters:

model_name_or_dirpathstr or Path: Name (on the HuggingFace models hub) or dirpath of the NLstruct model. The model dir must contain a PyTorch file (‘.cpkt’,’.pt’) and a text file (.txt) representing the FastText embeddings if required.
attrs_to_copylist of str, optional: Labels of the attributes that should be copied from the input segment to the created entity. Useful for propagating context attributes (negation, antecendent, etc).
deviceint, default=-1: Device to use for the NLstruct model. Follows the HuggingFace convention (-1 for “cpu” and device number for gpu, for instance 0 for “cuda:0”).
hf_auth_tokenstr, optional: HuggingFace Authentication token (to access private models on the hub)
cache_dirstr or Path, optional: Directory where to store downloaded models. If not set, the default HuggingFace cache dir is used.
namestr, optional: Name describing the matcher (defaults to the class name).
uidstr, optional: Identifier of the matcher.

References

[1]

Nesrine Bannour, Perceval Wajsbürt, Bastien Rance, Xavier Tannier, and Aurélie Névéol. 2022. Privacy-preserving mimic models for clinical named entity recognition in French. Journal of Biomedical Informatics 130, (2022), 104073. DOI: https://doi.org/https://doi.org/10.1016/j.jbi.2022.104073

[2]

Perceval Wajsbürt. 2021. Extraction and normalization of simple and structured entities in medical documents. Theses. Sorbonne Université. Retrieved from https://hal.archives-ouvertes.fr/tel-03624928

init_args#

cache_dir#

attrs_to_copy#

model_name_or_dirpath#

device#

model#

static _load_from_checkpoint_dir(checkpoint_dir: pathlib.Path, device)#

Get the location of the checkpoint and fix the path of the Fast Text file in the configuration.

Return the nlstruct model created with the modified config.

run(segments: list[medkit.core.text.Segment]) → list[medkit.core.text.Entity]#

Return entities for each match in segments.

Parameters:

segmentslist of Segment: List of segments into which to look for matches.

Returns:

list of Entity: Entities found in segments.

_matches_to_entities(matches: list[dict], segment: medkit.core.text.Segment) → Iterator[medkit.core.text.Entity]#