medkit.text.ner.nlstruct_entity_matcher#
Classes#
Entity matcher based on a NLstruct InformationExtraction model. |
Module Contents#
- class medkit.text.ner.nlstruct_entity_matcher.NLStructEntityMatcher(model_name_or_dirpath: str | pathlib.Path, attrs_to_copy: list[str] | None = None, device: int = -1, hf_auth_token: str | None = None, cache_dir: str | pathlib.Path | None = None, name: str | None = None, uid: str | None = None)#
Bases:
medkit.core.text.NEROperation
Entity matcher based on a NLstruct InformationExtraction model.
The matcher expects a directory with a torch checkpoint and a text file if the model was pretrained using word embeddings.
The paper [1] presents a model trained with the NLstruct [2] library and the mimic learning approach. The model used a private teacher model to annotate the unlabeled [CAS clinical French corpus](https://aclanthology.org/W18-5614/). The weights of the CAS student model are shared via the HuggingFace Hub and you can load the model using the following model name NesrineBannour/CAS-privacy-preserving-model to create a NLstructEntityMatcher.
- Parameters:
- model_name_or_dirpathstr or Path
Name (on the HuggingFace models hub) or dirpath of the NLstruct model. The model dir must contain a PyTorch file (â.cpktâ,â.ptâ) and a text file (.txt) representing the FastText embeddings if required.
- attrs_to_copylist of str, optional
Labels of the attributes that should be copied from the input segment to the created entity. Useful for propagating context attributes (negation, antecendent, etc).
- deviceint, default=-1
Device to use for the NLstruct model. Follows the HuggingFace convention (-1 for âcpuâ and device number for gpu, for instance 0 for âcuda:0â).
- hf_auth_tokenstr, optional
HuggingFace Authentication token (to access private models on the hub)
- cache_dirstr or Path, optional
Directory where to store downloaded models. If not set, the default HuggingFace cache dir is used.
- namestr, optional
Name describing the matcher (defaults to the class name).
- uidstr, optional
Identifier of the matcher.
References
[1]Nesrine Bannour, Perceval WajsbĂŒrt, Bastien Rance, Xavier Tannier, and AurĂ©lie NĂ©vĂ©ol. 2022. Privacy-preserving mimic models for clinical named entity recognition in French. Journal of Biomedical Informatics 130, (2022), 104073. DOI: https://doi.org/https://doi.org/10.1016/j.jbi.2022.104073
[2]Perceval WajsbĂŒrt. 2021. Extraction and normalization of simple and structured entities in medical documents. Theses. Sorbonne UniversitĂ©. Retrieved from https://hal.archives-ouvertes.fr/tel-03624928
- init_args#
- cache_dir#
- attrs_to_copy#
- model_name_or_dirpath#
- device#
- model#
- static _load_from_checkpoint_dir(checkpoint_dir: pathlib.Path, device)#
Get the location of the checkpoint and fix the path of the Fast Text file in the configuration.
Return the nlstruct model created with the modified config.
- run(segments: list[medkit.core.text.Segment]) list[medkit.core.text.Entity] #
Return entities for each match in segments.
- Parameters:
- segmentslist of Segment
List of segments into which to look for matches.
- Returns:
- list of Entity
Entities found in segments.
- _matches_to_entities(matches: list[dict], segment: medkit.core.text.Segment) Iterator[medkit.core.text.Entity] #