medkit.text.ner.regexp_matcher#
Classes#
Regexp-based rule to use with |
|
Descriptor of normalization attributes to attach to entities created from a |
|
Metadata dict added to entities matched by |
|
Entity annotator relying on regexp-based rules. |
Module Contents#
- class medkit.text.ner.regexp_matcher.RegexpMatcherRule#
Regexp-based rule to use with
RegexpMatcher
.- Attributes:
- regexp: str
The regexp pattern used to match entities
- label: str
The label to attribute to entities created based on this rule
- term: str, optional
The optional normalized version of the entity text
- id: str, optional
Unique identifier of the rule to store in the metadata of the entities
- version: str, optional
Version string to store in the metadata of the entities
- index_extract: int, default=0
If the regexp has groups, the index of the group to use to extract the entity
- case_sensitive: bool, default=True
Whether to ignore case when running regexp and `exclusion_regexp
- unicode_sensitive: bool, default=True
If True, regexp rule matches are searched on unicode text. So, regexp and `exclusion_regexps shall not contain non-ASCII chars because they would never be matched. If False, regexp rule matches are searched on closest ASCII text when possible. (cf. RegexpMatcher)
- exclusion_regexp: str, optional
An optional exclusion pattern. Note that this exclusion pattern will be executed on the whole input annotation, so when relying on exclusion_regexp make sure the input annotations passed to RegexpMatcher are “local”-enough (sentences or syntagmas) rather than the whole text or paragraphs
- normalizations: list of RegexpMatcherNormalization, optional
Optional list of normalization attributes that should be attached to the entities created
- regexp: str#
- label: str#
- term: str | None = None#
- id: str | None = None#
- version: str | None = None#
- index_extract: int = 0#
- case_sensitive: bool = True#
- unicode_sensitive: bool = True#
- exclusion_regexp: str | None = None#
- normalizations: list[RegexpMatcherNormalization]#
- __post_init__()#
- class medkit.text.ner.regexp_matcher.RegexpMatcherNormalization#
Descriptor of normalization attributes to attach to entities created from a
RegexpMatcherRule
.- Attributes:
- kb_name: str
The name of the knowledge base we are referencing. Ex: “umls”
- kb_version: str
The name of the knowledge base we are referencing. Ex: “202AB”
- kb_id: str, optional
The id of the entity in the knowledge base, for instance a CUI
- kb_name: str#
- kb_id: Any#
- kb_version: str | None = None#
- class medkit.text.ner.regexp_matcher.RegexpMetadata#
Bases:
typing_extensions.TypedDict
Metadata dict added to entities matched by
RegexpMatcher
.- Parameters:
- rule_id: str or int
Identifier of the rule used to match an entity. If the rule has no id, then the index of the rule in the list of rules is used instead.
- version: str, optional
Optional version of the rule used to match an entity
- rule_id: str | int#
- version: str | None#
- class medkit.text.ner.regexp_matcher.RegexpMatcher(rules: list[RegexpMatcherRule] | None = None, attrs_to_copy: list[str] | None = None, name: str | None = None, uid: str | None = None)#
Bases:
medkit.core.text.NEROperation
Entity annotator relying on regexp-based rules.
For detecting entities, the module uses rules that may be sensitive to unicode or not. When the rule is not sensitive to unicode, we try to convert unicode chars to the closest ascii chars. However, some characters need to be pre-processed before (e.g., n° -> number). So, if the text lengths are different, we fall back on initial unicode text for detection even if rule is not unicode-sensitive. In this case, a warning is logged for recommending to pre-process data.
- init_args#
- rules#
- attrs_to_copy#
- _patterns#
- _exclusion_patterns#
- _has_non_unicode_sensitive_rule#
- run(segments: list[medkit.core.text.Segment]) list[medkit.core.text.Entity] #
Return entities (with optional normalization attributes) matched in segments.
- Parameters:
- segments: list of Segment
List of segments into which to look for matches
- Returns:
- list of Entity:
Entities found in segments (with optional normalization attributes). Entities have a metadata dict with fields described in
RegexpMetadata
- _find_matches_in_segment(segment: medkit.core.text.Segment) Iterator[medkit.core.text.Entity] #
- _find_matches_in_segment_for_rule(rule_index: int, segment: medkit.core.text.Segment, text_ascii: str | None) Iterator[medkit.core.text.Entity] #
- static _create_norm_attr(norm: RegexpMatcherNormalization) medkit.core.text.EntityNormAttribute #
- static load_rules(path_to_rules: pathlib.Path, encoding: str | None = None) list[RegexpMatcherRule] #
Load all rules stored in a yml file.
- Parameters:
- path_to_rules: Path
Path to a yml file containing a list of mappings with the same structure as RegexpMatcherRule
- encoding: str, optional
Encoding of the file to open
- Returns:
- list of RegexpMatcherRule
List of all the rules in path_to_rules, can be used to init a RegexpMatcher
- static check_rules_sanity(rules: list[RegexpMatcherRule])#
Check consistency of a set of rules.
- static save_rules(rules: list[RegexpMatcherRule], path_to_rules: pathlib.Path, encoding: str | None = None)#
Store rules in a yml file.
- Parameters:
- rules: list of RegexpMatcherRule
The rules to save
- path_to_rules: Path
Path to a .yml file that will contain the rules
- encoding: str, optional
Encoding of the .yml file