medkit.text.ner.umls_coder_normalizer

medkit.text.ner.umls_coder_normalizer#

Classes#

UMLSCoderNormalizer

Normalizer adding UMLS normalization attributes to pre-existing entities.

Module Contents#

class medkit.text.ner.umls_coder_normalizer.UMLSCoderNormalizer(umls_mrconso_file: str | pathlib.Path, language: str, model: str | pathlib.Path, embeddings_cache_dir: str | pathlib.Path, summary_method: typing_extensions.Literal[mean, cls] = 'cls', normalize_embeddings: bool = True, lowercase: bool = False, normalize_unicode: bool = False, threshold: float | None = None, max_nb_matches: int = 1, device: int = -1, batch_size: int = 128, hf_auth_token: str | None = None, nb_umls_embeddings_chunks: int | None = None, hf_cache_dir: str | pathlib.Path | None = None, name: str | None = None, uid: str | None = None)#

Bases: medkit.core.Operation

Normalizer adding UMLS normalization attributes to pre-existing entities.

Based on GanjinZero/CODER.

An UMLS MRCONSO.RRF file is needed. The normalizer identifies UMLS concepts by comparing embeddings of reference UMLS terms with the embeddings of the input entities. Any text transformer model from the HuggingFace Hub can be used, but “GanjinZero/UMLSBert_ENG” was specifically trained for this task (for english).

When UMLSCoderNormalizer is used for the first time for a given MRCONSO.RRF, the embeddings of all umls terms are pre-computed (this can take a very long time) and stored in embeddings_cache_dir, so they can be reused next time.

If another MRCONSO.RRF file is used, or if a parameter impacting the computation of embeddings (model, summary_method, etc) is changed, then another embeddings_cache_dir must be used, or embeddings_cache_dir must be deleted so it can be created properly.

If the UMLS embeddings are too big to be held in memory, use nb_umls_embeddings_chunks.

Parameters:

umls_mrconso_filestr or Path: Path to the UMLS MRCONSO.RRF file.
languagestr: Language of the UMLS terms to use (ex: “ENG”, “FRE”).
modelstr or Path: Name on the Hugging Face hub or path to the transformers model that will be used to extract embeddings (ex: “GanjinZero/UMLSBert_ENG”).
embeddings_cache_dirstr or Path: Path to the directory into which pre-computed embeddings of UMLS terms should be cached. If it doesn’t exist yet, the embeddings will be automatically generated (it can take a long time) and stored there, ready to be reused on further instantiations. If it already exists, a check will be done to make sure the params used when the embeddings were computed are consistent with the params of the current instance.
summary_method{“mean”, “cls”}, default=”cls”: If set to “mean”, the embeddings extracted will be the mean of the pooling layers of the model. Otherwise, when set to “cls”, the last hidden layer will be used.
normalize_embeddingsbool, default=True: Whether to normalize the extracted embeddings.
lowercasebool, default=False: Whether to use lowercased versions of UMLS terms and input entities.
normalize_unicodebool, default=False: Whether to use ASCII-only versions of UMLS terms and input entities (non-ASCII chars replaced by closest ASCII chars).
thresholdfloat, optional: Minimum similarity threshold (between 0.0 and 1.0) between the embeddings of an entity and of an UMLS term for a normalization attribute to be added.
max_nb_matchesint, default=1: Maximum number of normalization attributes to add to each entity.
deviceint, default=-1: Device to use for transformers models. Follows the Hugging Face convention (-1 for “cpu” and device number for gpu, for instance 0 for “cuda:0”).
batch_sizeint, default=128: Number of entities in batches processed by the embeddings extraction pipeline.
hf_auth_tokenstr, optional: HuggingFace Authentication token (to access private models on the hub)
nb_umls_embeddings_chunksint, optional: Number of umls embeddings chunks to load at the same time when computing embeddings similarities. (a chunk contains 65536 embeddings). If None, all pre-computed umls embeddings are pre-loaded in memory and similaries are computed in one shot. Otherwise, at each call to run(), umls embeddings are loaded by groups of chunks and similaries are computed for each group. Use this when umls embeddings are too big to be fully loaded in memory. The higher this value, the more memory needed.
hf_cache_dir: str or Path, optional: Directory where to store downloaded models. If not set, the default HuggingFace cache dir is used.
namestr, optional: Name describing the normalizer (defaults to the class name).
uidstr, optional: Identifier of the normalizer.

init_args#

umls_mrconso_file#

embeddings_cache_dir#

language#

model#

summary_method#

normalize_embeddings#

lowercase#

normalize_unicode#

threshold#

max_nb_matches#

device#

nb_umls_embeddings_chunks#

_pipeline: _EmbeddingsPipeline#

_umls_version#

umls_terms_file#

_umls_entries#

run(entities: list[medkit.core.text.Entity])#

Add normalization attributes to each entity in entities.

Each entity will have zero, one or more normalization attributes depending on max_nb_matches and on how many matches with a similarity above threshold are found.

Parameters:

entitieslist of Entity: List of entities to add normalization attributes to

_find_best_matches(entities: list[medkit.core.text.Entity]) → tuple[list[list[int]], list[list[float]]]#

_load_umls_embeddings(files: list[pathlib.Path]) → torch#

_normalize_entity(entity: medkit.core.text.Entity, match_indices: list[int], match_scores: list[float])#

_build_umls_embeddings(show_progress=True)#