medkit.text.ner.simstring_matcher#
Classes#
Rule to use with |
|
Descriptor of normalization attributes to attach to entities created from a |
|
Entity matcher relying on string similarity. |
Module Contents#
- class medkit.text.ner.simstring_matcher.SimstringMatcherRule#
Bases:
medkit.text.ner._base_simstring_matcher.BaseSimstringMatcherRule
Rule to use with
SimstringMatcher
.- Attributes:
- term:
Term to match using similarity-based fuzzy matching
- label:
Label to use for the entities created when a match is found
- case_sensitive:
Whether to take case into account when looking for matches.
- unicode_sensitive:
Whether to use ASCII-only versions of the rule term and input texts when looking for matches (non-ASCII chars replaced by closest ASCII chars).
- normalizations:
Optional list of normalization attributes that should be attached to the entities created
- static from_dict(data: dict[str, Any]) SimstringMatcherRule #
Create a SimStringMatcherRule from a dict.
- class medkit.text.ner.simstring_matcher.SimstringMatcherNormalization#
Bases:
medkit.text.ner._base_simstring_matcher.BaseSimstringMatcherNormalization
Descriptor of normalization attributes to attach to entities created from a
SimstringMatcherRule
.- Attributes:
- kb_name:
The name of the knowledge base we are referencing. Ex: “umls”
- kb_version:
The name of the knowledge base we are referencing. Ex: “202AB”
- kb_id:
The id of the entity in the knowledge base, for instance a CUI
- term:
Optional normalized version of the entity text in the knowledge base
- static from_dict(data: dict[str, Any]) SimstringMatcherNormalization #
Create a SimstringMatcherNormalization object from a dict.
- class medkit.text.ner.simstring_matcher.SimstringMatcher(rules: list[SimstringMatcherRule], threshold: float = 0.9, min_length: int = 3, max_length: int = 50, similarity: typing_extensions.Literal[cosine, dice, jaccard, overlap] = 'jaccard', spacy_tokenization_language: str | None = None, blacklist: list[str] | None = None, same_beginning: bool = False, attrs_to_copy: list[str] | None = None, name: str | None = None, uid: str | None = None)#
Bases:
medkit.text.ner._base_simstring_matcher.BaseSimstringMatcher
Entity matcher relying on string similarity.
Uses the simstring fuzzy matching algorithm (http://chokkan.org/software/simstring/).
Note that setting spacy_tokenization_language to True might reduce the number of false positives. This requires the spacy optional dependency, which can be installed with pip install medkit-lib[spacy].
- Parameters:
- rules: list of SimstringMatcherRule
Rules to use for matching entities.
- threshold: float, default=0.9
Minimum similarity (between 0.0 and 1.0) between a rule term and the text of an entity matched on that rule.
- min_length: int, default=3
Minimum number of chars in matched entities.
- max_length: int, default=50
Maximum number of chars in matched entities.
- similarity: str, default=”jaccard”
Similarity metric to use.
- spacy_tokenization_language: str, optional
2-letter code (ex: “fr”, “en”, etc.) designating the language of the spacy model to use for tokenization. If provided, spacy will be used to tokenize input segments and filter out some tokens based on their part-of-speech tags, such as determinants, conjunctions and prepositions. If None, a simple regexp based tokenization will be used, which is faster but might give more false positives.
- blacklist: list of str, optional
Optional list of exact terms to ignore.
- same_beginning: bool, default=False
Ignore all matches that start with a different character than the term of the rule. This can be convenient to get rid of false positives on words that are very similar but have opposite meanings because of a preposition, for instance “activation” and “inactivation”.
- attrs_to_copy: list of str, optional
Labels of the attributes that should be copied from the source segment to the created entity. Useful for propagating context attributes (negation, antecedent, etc.).
- name: str, optional
Name describing the matcher (defaults to the class name).
- uid: str, optional
Identifier of the matcher.
- _temp_dir#
- rules_db_file#
- simstring_db_file#
- static load_rules(path_to_rules: pathlib.Path, encoding: str | None = None) list[SimstringMatcherRule] #
Load all rules stored in a yml file.
- Parameters:
- path_to_rules
The path to a yml file containing a list of mappings with the same structure as
SimstringMatcherRule
- encoding: str, optional
The encoding of the file to open
- Returns:
- List[SimstringMatcherRule]
List of all the rules in path_to_rules, can be used to init a
SimstringMatcher
- static save_rules(rules: list[SimstringMatcherRule], path_to_rules: pathlib.Path, encoding: str | None = None)#
Store rules in a yml file.
- Parameters:
- rules: list of SimstringMatcherRule
The rules to save
- path_to_rules: Path
The path to a yml file that will contain the rules
- encoding: str, optional
The encoding of the yml file