medkit.text.ner.umls_utils#

Attributes#

SEMGROUP_LABELS

Labels corresponding to UMLS semgroups

SEMGROUPS

Valid UMLS semgroups

Classes#

UMLSEntry

Entry in MRCONSO.RRF file of a UMLS dictionary.

Functions#

load_umls_entries(→ Iterator[UMLSEntry])

Load all terms and associated CUIs found in a UMLS MRCONSO.RRF file.

preprocess_term_to_match(term, lowercase, ...[, ...])

Preprocess a UMLS term for matching purposes.

preprocess_acronym(→ str | None)

Detect if a term contains an acronym with the expanded form between parenthesis.

guess_umls_version(→ str)

Try to infer UMLS version (ex: "2021AB") from any UMLS-related path.

Module Contents#

medkit.text.ner.umls_utils.SEMGROUP_LABELS#

Labels corresponding to UMLS semgroups

medkit.text.ner.umls_utils.SEMGROUPS#

Valid UMLS semgroups

class medkit.text.ner.umls_utils.UMLSEntry#

Entry in MRCONSO.RRF file of a UMLS dictionary.

Attributes:
cuistr

Unique identifier of the concept designated by the term

termstr

Original version of the term

semtypeslist of str, optional

Semantic types of the concept (TUIs)

semgroupslist of str, optional

Semantic groups of the concept

cui: str#
term: str#
semtypes: list[str] | None = None#
semgroups: list[str] | None = None#
to_dict()#
medkit.text.ner.umls_utils.load_umls_entries(mrconso_file: str | pathlib.Path, mrsty_file: str | pathlib.Path | None = None, sources: list[str] | None = None, languages: list[str] | None = None, show_progress: bool = False) Iterator[UMLSEntry]#

Load all terms and associated CUIs found in a UMLS MRCONSO.RRF file.

Parameters:
mrconso_filestr or Path

Path to the UMLS MRCONSO.RRF file

mrsty_filestr or Path, optional

Path to the UMLS MRSTY.RRF file. If provided, semtypes info will be included in the entries returned.

sourceslist of str, optional

Sources to consider (ex: ICD10, CCS) If none provided, CUIs and terms of all sources will be taken into account.

languageslist of str, optional

Languages to consider. If none provided, CUIs and terms of all languages will be taken into account

show_progressbool, default=False

Whether to show a progressbar

Returns:
iterator of UMLSEntry

Iterator over all term entries found in UMLS install

medkit.text.ner.umls_utils.preprocess_term_to_match(term: str, lowercase: bool, normalize_unicode: bool, clean_nos: bool = True, clean_brackets: bool = False, clean_dashes: bool = False)#

Preprocess a UMLS term for matching purposes.

Parameters:
term: str

Term to preprocess

lowercasebool

Whether term should be lowercased

normalize_unicodebool

Whether term_to_match should be ASCII-only (non-ASCII chars replaced by closest ASCII chars)

clean_nosbool, default=True

Whether to remove “NOS”

clean_bracketsbool, default=False

Whether to remove brackets

clean_dashesbool, default=False

Whether to remove dashes

medkit.text.ner.umls_utils.preprocess_acronym(term: str) str | None#

Detect if a term contains an acronym with the expanded form between parenthesis.

Eventually return the acronym if any is detected.

This will work for terms such as: “ECG (ÉlectroCardioGramme)”, where the acronym can be rebuilt by taking the ASCII version of each uppercase letter inside the parenthesis.

Parameters:
termstr

Term that may contain an acronym. Ex: “ECG (ÉlectroCardioGramme)”

Returns:
str, optional

The acronym in the term if any, else None. Ex: “ECG”

medkit.text.ner.umls_utils.guess_umls_version(path: str | pathlib.Path) str#

Try to infer UMLS version (ex: “2021AB”) from any UMLS-related path.

Parameters:
pathstr or Path

Path to the root directory of the UMLS install or any file inside that directory

Returns:
str

UMLS version, estimated by finding the leaf-most folder in path that is not “META”, “NET” nor “LEX”, nor a subfolder of these folders