medkit.text.ner._base_simstring_matcher#

Classes#

BaseSimstringMatcherRule

Rule to use with BaseSimstringMatcher.

BaseSimstringMatcherNormalization

Descriptor of normalization attributes to attach to entities created from a ~.BaseSimstringMatcherRule.

BaseSimstringMatcher

Base class for entity matcher using the simstring fuzzy matching algorithm.

Functions#

build_simstring_matcher_databases(simstring_db_file, ...)

Generate the databases needed by BaseSimstringMatcher.

Module Contents#

class medkit.text.ner._base_simstring_matcher.BaseSimstringMatcherRule#

Rule to use with BaseSimstringMatcher.

Attributes:
termstr

Term to match using similarity-based fuzzy matching

labelstr

Label to use for the entities created when a match is found

case_sensitivebool, default=False

Whether to take case into account when looking for matches.

unicode_sensitivebool, default=False

Whether to use ASCII-only versions of the rule term and input texts when looking for matches (non-ASCII chars replaced by closest ASCII chars).

normalizationslist of BaseSimstringMatcherNormalization, optional

List of normalization attributes that should be attached to the entities created.

term: str#
label: str#
case_sensitive: bool = False#
unicode_sensitive: bool = False#
normalizations: list[BaseSimstringMatcherNormalization]#
class medkit.text.ner._base_simstring_matcher.BaseSimstringMatcherNormalization#

Descriptor of normalization attributes to attach to entities created from a ~.BaseSimstringMatcherRule.

Attributes:
kb_namestr

The name of the knowledge base we are referencing. Ex: “umls”

kb_idint or str

The id of the entity in the knowledge base, for instance a CUI

kb_versionstr, optional

The name of the knowledge base we are referencing. Ex: “202AB”

termstr, optional

Normalized version of the entity text in the knowledge base

kb_name: str#
kb_id: int | str#
kb_version: str | None = None#
term: str | None = None#
to_attribute(score: float) medkit.core.text.EntityNormAttribute#

Create a normalization attribute based on the normalization descriptor.

Parameters:
scorefloat

Score of similarity between the normalized term and the entity text

Returns:
EntityNormAttribute

Normalization attribute to add to entity

class medkit.text.ner._base_simstring_matcher.BaseSimstringMatcher(simstring_db_file: pathlib.Path, rules_db_file: pathlib.Path, threshold: float = 0.9, min_length: int = 3, max_length: int = 50, similarity: typing_extensions.Literal[cosine, dice, jaccard, overlap] = 'jaccard', spacy_tokenization_language: str | None = None, blacklist: list[str] | None = None, same_beginning: bool = False, attrs_to_copy: list[str] | None = None, name: str | None = None, uid: str | None = None)#

Bases: medkit.core.text.NEROperation

Base class for entity matcher using the simstring fuzzy matching algorithm.

Parameters:
simstring_db_filePath

Simstring database to use

rules_db_filePath

Rules database (in python shelve format) mapping matched terms to corresponding rules

thresholdfloat, default=0.9

Minimum similarity (between 0.0 and 1.0) between a rule term and the text of an entity matched on that rule.

min_lengthint, default=3

Minimum number of chars in matched entities.

max_lengthint, default=50

Maximum number of chars in matched entities.

similaritystr, default=”jaccard”

Similarity metric to use.

spacy_tokenization_languagestr, optional

2-letter code (ex: “fr”, “en”, etc.) designating the language of the spacy model to use for tokenization. If provided, spacy will be used to tokenize input segments and filter out some tokens based on their part-of-speech tags, such as determinants, conjunctions and prepositions. If None, a simple regexp based tokenization will be used, which is faster but might give more false positives.

blacklistlist of str, optional

List of exact terms to ignore.

same_beginningbool, default=False

Ignore all matches that start with a different character than the term of the rule. This can be convenient to get rid of false positives on words that are very similar but have opposite meanings because of a preposition, for instance “activation” and “inactivation”.

attrs_to_copylist of str, optional

Labels of the attributes that should be copied from the source segment to the created entity. Useful for propagating context attributes (negation, antecedent, etc.).

namestr, optional

Name describing the matcher (defaults to the class name).

uidstr, optional

Identifier of the matcher.

init_args#
min_length#
max_length#
threshold#
similarity#
blacklist#
same_beginning#
attrs_to_copy#
_simstring_db_reader#
measure#
_rules_db#
run(segments: list[medkit.core.text.Segment]) list[medkit.core.text.Entity]#

Return entities (with optional normalization attributes) matched in segments.

Parameters:
segmentslist of Segment

List of segments into which to look for matches

Returns:
list of Entity

Entities found in segments (with optional normalization attributes)

_find_matches_in_segment(segment: medkit.core.text.Segment, spacy_doc: Any | None) Iterator[medkit.core.text.Entity]#

Return an iterator to the entities matched in a segment.

static _filter_overlapping_matches(matches: list[_Match]) list[_Match]#

Find and remove overlapping matches.

Remove overlapping matches by keeping matches with best score then max length among overlapping matches.

_build_entity(segment: medkit.core.text.Segment, match: _Match) medkit.core.text.Entity#

Build an entity from a match in a segment.

medkit.text.ner._base_simstring_matcher.build_simstring_matcher_databases(simstring_db_file: pathlib.Path, rules_db_file: pathlib.Path, rules: Iterable[BaseSimstringMatcherRule])#

Generate the databases needed by BaseSimstringMatcher.

Parameters:
simstring_db_filePath

Database used by the fuzzy matching simstring library.

rules_db_filePath

shelve database storing the mapping between terms to match and corresponding BaseSimstringMatcherRule` objects (one term to match may correspond to several rules)

rulesiterable of BaseSimstringMatcherRule

Rules to add to databases