medkit.text.ner.umls_matcher#

Classes#

UMLSMatcher

Entity annotator identifying UMLS concepts using the simstring fuzzy matching algorithm.

Module Contents#

class medkit.text.ner.umls_matcher.UMLSMatcher(umls_dir: str | pathlib.Path, cache_dir: str | pathlib.Path, language: str, threshold: float = 0.9, min_length: int = 3, max_length: int = 50, similarity: typing_extensions.Literal[cosine, dice, jaccard, overlap] = 'jaccard', lowercase: bool = True, normalize_unicode: bool = False, spacy_tokenization: bool = False, semgroups: Sequence[str] = ('ANAT', 'CHEM', 'DEVI', 'DISO', 'PHYS', 'PROC'), blacklist: list[str] | None = None, same_beginning: bool = False, output_labels_by_semgroup: str | dict[str, str] | None = None, attrs_to_copy: list[str] | None = None, name: str | None = None, uid: str | None = None)#

Bases: medkit.text.ner._base_simstring_matcher.BaseSimstringMatcher

Entity annotator identifying UMLS concepts using the simstring fuzzy matching algorithm.

This operation is heavily inspired by the QuickUMLS library (Georgetown-IR-Lab/QuickUMLS).

By default, only terms belonging to the ANAT (anatomy), CHEM (Chemicals & Drugs), DEVI (Devices), DISO (Disorders), PHYS (Physiology) and PROC (Procedures) semgroups will be considered. This behavior can be changed with the semgroups parameter.

Note that setting spacy_tokenization_language to True might reduce the number of false positives. This requires the spacy optional dependency, which can be installed with pip install medkit-lib[spacy].

Parameters:
umls_dirstr or Path

Path to the UMLS directory containing the MRCONSO.RRF and MRSTY.RRF files.

cache_dirstr or Path

Path to the directory into which the umls database will be cached. If it doesn’t exist yet, the database will be automatically generated (it can take a long time) and stored there, ready to be reused on further instantiations. If it already exists, a check will be done to make sure the params used when the database was generated are consistent with the params of the current instance. If you want to rebuild the database with new params using the same cache dir, you will have to manually delete it first.

languagestr

Language to consider as found in the MRCONSO.RRF file. Example: “FRE”. Will trigger a regeneration of the database if changed.

thresholdfloat, default=0.9

Minimum similarity threshold (between 0.0 and 1.0) between a UMLS term and the text of a matched entity.

min_lengthint, default=3

Minimum number of chars in matched entities.

max_lengthint, default=50

Maximum number of chars in matched entities.

similaritystr, default=”jaccard”

Similarity metric to use.

lowercasebool, default=True

Whether to use lowercased versions of UMLS terms and input entities (except for acronyms for which the uppercase term is always used). Will trigger a regeneration of the database if changed.

normalize_unicodebool, default=False

Whether to use ASCII-only versions of UMLS terms and input entities (non-ASCII chars replaced by closest ASCII chars). Will trigger a regeneration of the database if changed.

spacy_tokenizationbool, default=False

If True, spacy will be used to tokenize input segments and filter out some tokens based on their part-of-speech tags, such as determinants, conjunctions and prepositions. If None, a simple regexp based tokenization will be used, which is faster but might give more false positives.

semgroupssequence of str, default=(“ANAT”, “CHEM”, “DEVI”, “DISO”, “PHYS”, “PROC”)

Ids of UMLS semantic groups that matched concepts should belong to. :see: https://lhncbc.nlm.nih.gov/semanticnetwork/download/sg_archive/SemGroups-v04.txt If set to None, all concepts can be matched. Will trigger a regeneration of the database if changed.

blacklistlist of str, optional

Optional list of exact terms to ignore.

same_beginningbool, default=False

Ignore all matches that start with a different character than the term of the rule. This can be convenient to get rid of false positives on words that are very similar but have opposite meanings because of a preposition, for instance “activation” and “inactivation”.

output_labels_by_semgroupstr or dict, optional

By default, ~`medkit.text.ner.umls.SEMGROUP_LABELS` will be used as entity labels. Use this parameter to override them. Example: {“DISO”: “problem”, “PROC”: “test}. If output_labels_by_semgroup is a string, all entities will use this string as label instead. Will trigger a regeneration of the database if changed.

attrs_to_copylist of str, optional

Labels of the attributes that should be copied from the source segment to the created entity. Useful for propagating context attributes (negation, antecedent, etc)

namestr, optional

Name describing the matcher (defaults to the class name).

uidstr, optional

Identifier of the matcher.

_SEMGROUP_BY_SEMTYPE = None#
umls_dir#
cache_dir#
labels_by_semgroup#
cache_params#
cache_params_file#
simstring_db_file#
rules_db_file#
classmethod _get_labels_by_semgroup(output_labels: str | dict[str, str] | None) dict[str, str]#

Return a mapping giving the label to use for all entries of a given semgroup.

output_labelsstr or dict of str to str, optional

Optional mapping of labels to use. Can be used to override the default labels. If output_labels is a single string, it will be used as a unique label for all semgroups

Returns:
dict of str to str

A mapping with semgroups as keys and corresponding label as values

classmethod _build_rules(umls_dir: pathlib.Path, language: str, lowercase: bool, normalize_unicode: bool, semgroups: set[str] | None, labels_by_semgroup: dict[str, str]) Iterator[medkit.text.ner._base_simstring_matcher.BaseSimstringMatcherRule]#

Create rules for all UMLS entries with appropriate labels.