medkit.text.ner._base_simstring_matcher#
Classes#
Rule to use with |
|
Descriptor of normalization attributes to attach to entities created from a ~.BaseSimstringMatcherRule. |
|
Base class for entity matcher using the simstring fuzzy matching algorithm. |
Functions#
|
Generate the databases needed by |
Module Contents#
- class medkit.text.ner._base_simstring_matcher.BaseSimstringMatcherRule#
Rule to use with
BaseSimstringMatcher
.- Attributes:
- termstr
Term to match using similarity-based fuzzy matching
- labelstr
Label to use for the entities created when a match is found
- case_sensitivebool, default=False
Whether to take case into account when looking for matches.
- unicode_sensitivebool, default=False
Whether to use ASCII-only versions of the rule term and input texts when looking for matches (non-ASCII chars replaced by closest ASCII chars).
- normalizationslist of BaseSimstringMatcherNormalization, optional
List of normalization attributes that should be attached to the entities created.
- term: str#
- label: str#
- case_sensitive: bool = False#
- unicode_sensitive: bool = False#
- normalizations: list[BaseSimstringMatcherNormalization]#
- class medkit.text.ner._base_simstring_matcher.BaseSimstringMatcherNormalization#
Descriptor of normalization attributes to attach to entities created from a ~.BaseSimstringMatcherRule.
- Attributes:
- kb_namestr
The name of the knowledge base we are referencing. Ex: “umls”
- kb_idint or str
The id of the entity in the knowledge base, for instance a CUI
- kb_versionstr, optional
The name of the knowledge base we are referencing. Ex: “202AB”
- termstr, optional
Normalized version of the entity text in the knowledge base
- kb_name: str#
- kb_id: int | str#
- kb_version: str | None = None#
- term: str | None = None#
- to_attribute(score: float) medkit.core.text.EntityNormAttribute #
Create a normalization attribute based on the normalization descriptor.
- Parameters:
- scorefloat
Score of similarity between the normalized term and the entity text
- Returns:
- EntityNormAttribute
Normalization attribute to add to entity
- class medkit.text.ner._base_simstring_matcher.BaseSimstringMatcher(simstring_db_file: pathlib.Path, rules_db_file: pathlib.Path, threshold: float = 0.9, min_length: int = 3, max_length: int = 50, similarity: typing_extensions.Literal[cosine, dice, jaccard, overlap] = 'jaccard', spacy_tokenization_language: str | None = None, blacklist: list[str] | None = None, same_beginning: bool = False, attrs_to_copy: list[str] | None = None, name: str | None = None, uid: str | None = None)#
Bases:
medkit.core.text.NEROperation
Base class for entity matcher using the simstring fuzzy matching algorithm.
- Parameters:
- simstring_db_filePath
Simstring database to use
- rules_db_filePath
Rules database (in python shelve format) mapping matched terms to corresponding rules
- thresholdfloat, default=0.9
Minimum similarity (between 0.0 and 1.0) between a rule term and the text of an entity matched on that rule.
- min_lengthint, default=3
Minimum number of chars in matched entities.
- max_lengthint, default=50
Maximum number of chars in matched entities.
- similaritystr, default=”jaccard”
Similarity metric to use.
- spacy_tokenization_languagestr, optional
2-letter code (ex: “fr”, “en”, etc.) designating the language of the spacy model to use for tokenization. If provided, spacy will be used to tokenize input segments and filter out some tokens based on their part-of-speech tags, such as determinants, conjunctions and prepositions. If None, a simple regexp based tokenization will be used, which is faster but might give more false positives.
- blacklistlist of str, optional
List of exact terms to ignore.
- same_beginningbool, default=False
Ignore all matches that start with a different character than the term of the rule. This can be convenient to get rid of false positives on words that are very similar but have opposite meanings because of a preposition, for instance “activation” and “inactivation”.
- attrs_to_copylist of str, optional
Labels of the attributes that should be copied from the source segment to the created entity. Useful for propagating context attributes (negation, antecedent, etc.).
- namestr, optional
Name describing the matcher (defaults to the class name).
- uidstr, optional
Identifier of the matcher.
- init_args#
- min_length#
- max_length#
- threshold#
- similarity#
- blacklist#
- same_beginning#
- attrs_to_copy#
- _simstring_db_reader#
- measure#
- _rules_db#
- run(segments: list[medkit.core.text.Segment]) list[medkit.core.text.Entity] #
Return entities (with optional normalization attributes) matched in segments.
- Parameters:
- segmentslist of Segment
List of segments into which to look for matches
- Returns:
- list of Entity
Entities found in segments (with optional normalization attributes)
- _find_matches_in_segment(segment: medkit.core.text.Segment, spacy_doc: Any | None) Iterator[medkit.core.text.Entity] #
Return an iterator to the entities matched in a segment.
- static _filter_overlapping_matches(matches: list[_Match]) list[_Match] #
Find and remove overlapping matches.
Remove overlapping matches by keeping matches with best score then max length among overlapping matches.
- _build_entity(segment: medkit.core.text.Segment, match: _Match) medkit.core.text.Entity #
Build an entity from a match in a segment.
- medkit.text.ner._base_simstring_matcher.build_simstring_matcher_databases(simstring_db_file: pathlib.Path, rules_db_file: pathlib.Path, rules: Iterable[BaseSimstringMatcherRule])#
Generate the databases needed by
BaseSimstringMatcher
.- Parameters:
- simstring_db_filePath
Database used by the fuzzy matching simstring library.
- rules_db_filePath
shelve database storing the mapping between terms to match and corresponding BaseSimstringMatcherRule` objects (one term to match may correspond to several rules)
- rulesiterable of BaseSimstringMatcherRule
Rules to add to databases