medkit.text.ner.umls_utils#
Attributes#
Labels corresponding to UMLS semgroups |
|
Valid UMLS semgroups |
Classes#
Entry in MRCONSO.RRF file of a UMLS dictionary. |
Functions#
|
Load all terms and associated CUIs found in a UMLS MRCONSO.RRF file. |
|
Preprocess a UMLS term for matching purposes. |
|
Detect if a term contains an acronym with the expanded form between parenthesis. |
|
Try to infer UMLS version (ex: "2021AB") from any UMLS-related path. |
Module Contents#
- medkit.text.ner.umls_utils.SEMGROUP_LABELS#
Labels corresponding to UMLS semgroups
- medkit.text.ner.umls_utils.SEMGROUPS#
Valid UMLS semgroups
- class medkit.text.ner.umls_utils.UMLSEntry#
Entry in MRCONSO.RRF file of a UMLS dictionary.
- Attributes:
- cuistr
Unique identifier of the concept designated by the term
- termstr
Original version of the term
- semtypeslist of str, optional
Semantic types of the concept (TUIs)
- semgroupslist of str, optional
Semantic groups of the concept
- cui: str#
- term: str#
- semtypes: list[str] | None = None#
- semgroups: list[str] | None = None#
- to_dict()#
- medkit.text.ner.umls_utils.load_umls_entries(mrconso_file: str | pathlib.Path, mrsty_file: str | pathlib.Path | None = None, sources: list[str] | None = None, languages: list[str] | None = None, show_progress: bool = False) Iterator[UMLSEntry] #
Load all terms and associated CUIs found in a UMLS MRCONSO.RRF file.
- Parameters:
- mrconso_filestr or Path
Path to the UMLS MRCONSO.RRF file
- mrsty_filestr or Path, optional
Path to the UMLS MRSTY.RRF file. If provided, semtypes info will be included in the entries returned.
- sourceslist of str, optional
Sources to consider (ex: ICD10, CCS) If none provided, CUIs and terms of all sources will be taken into account.
- languageslist of str, optional
Languages to consider. If none provided, CUIs and terms of all languages will be taken into account
- show_progressbool, default=False
Whether to show a progressbar
- Returns:
- iterator of UMLSEntry
Iterator over all term entries found in UMLS install
- medkit.text.ner.umls_utils.preprocess_term_to_match(term: str, lowercase: bool, normalize_unicode: bool, clean_nos: bool = True, clean_brackets: bool = False, clean_dashes: bool = False)#
Preprocess a UMLS term for matching purposes.
- Parameters:
- term: str
Term to preprocess
- lowercasebool
Whether term should be lowercased
- normalize_unicodebool
Whether term_to_match should be ASCII-only (non-ASCII chars replaced by closest ASCII chars)
- clean_nosbool, default=True
Whether to remove “NOS”
- clean_bracketsbool, default=False
Whether to remove brackets
- clean_dashesbool, default=False
Whether to remove dashes
- medkit.text.ner.umls_utils.preprocess_acronym(term: str) str | None #
Detect if a term contains an acronym with the expanded form between parenthesis.
Eventually return the acronym if any is detected.
This will work for terms such as: “ECG (ÉlectroCardioGramme)”, where the acronym can be rebuilt by taking the ASCII version of each uppercase letter inside the parenthesis.
- Parameters:
- termstr
Term that may contain an acronym. Ex: “ECG (ÉlectroCardioGramme)”
- Returns:
- str, optional
The acronym in the term if any, else None. Ex: “ECG”
- medkit.text.ner.umls_utils.guess_umls_version(path: str | pathlib.Path) str #
Try to infer UMLS version (ex: “2021AB”) from any UMLS-related path.
- Parameters:
- pathstr or Path
Path to the root directory of the UMLS install or any file inside that directory
- Returns:
- str
UMLS version, estimated by finding the leaf-most folder in path that is not “META”, “NET” nor “LEX”, nor a subfolder of these folders