medkit.text.context

Contents

medkit.text.context#

Submodules#

Classes#

FamilyDetector

Annotator for creating family attributes.

FamilyDetectorRule

Regexp-based rule to use with FamilyDetector.

FamilyMetadata

Metadata dict added to family attributes with True value.

HypothesisDetector

Annotator detecting and creating hypothesis attributes.

HypothesisDetectorRule

Regexp-based rule to use with HypothesisDetector.

HypothesisRuleMetadata

Metadata added to hypothesis attributes with True value detected by a rule.

HypothesisVerbMetadata

Metadata added to hypothesis attributes with True value detected by a verb.

NegationDetector

Annotator creating negation attributes.

NegationDetectorRule

Regexp-based rule to use with NegationDetector.

NegationMetadata

Metadata dict added to negation attributes with True value.

Package Contents#

class medkit.text.context.FamilyDetector(output_label: str, rules: list[FamilyDetectorRule] | None = None, uid: str | None = None)#

Bases: medkit.core.text.ContextOperation

Annotator for creating family attributes.

Annotator creating family attributes with boolean values indicating if a family reference has been detected.

Because family attributes will be attached to whole annotations, each input annotation should be “local”-enough rather than a big chunk of text (ie a sentence or a syntagma).

For detecting family references, the module uses rules that may be sensitive to unicode or not. When the rule is not sensitive to unicode, we try to convert unicode chars to the closest ascii chars. However, some characters need to be pre-processed before (e.g., n° -> number). So, if the text lengths are different, we fall back on initial unicode text for detection even if rule is not unicode-sensitive. In this case, a warning is logged for recommending to pre-process data.

Note that for better results, family detection should be run at the sentence level (ie on sentence segments) rather than at the syntagma level [1].

Parameters:
output_labelstr

The label of the created attributes

ruleslist of FamilyDetectorRule, optional

The set of rules to use when detecting family references. If none provided, the rules in “family_detector_default_rules.yml” will be used

uidstr, optional

Identifier of the detector

References

[1] Garcelon, N., Neuraz, A., Benoit, V., Salomon, R., & Burgun, A. (2017).

Improving a full-text search engine: the importance of negation detection and family history context to identify cases in a biomedical data warehouse. Journal of the American Medical Informatics Association : JAMIA, 24(3), 607-613. https://doi.org/10.1093/jamia/ocw144

init_args#
output_label#
rules#
_non_empty_text_pattern#
_patterns#
_exclusion_patterns#
_has_non_unicode_sensitive_rule#
run(segments: list[medkit.core.text.Segment])#

Run the operation.

Add a family attribute to each segment with a boolean value indicating if a family reference has been detected.

Family attributes with a True value have a metadata dict with fields described in FamilyMetadata.

Parameters:
segmentslist of Segment

List of segments to detect as being family references or not

_detect_family_ref_in_segment(segment: medkit.core.text.Segment) medkit.core.Attribute | None#
_find_matching_rule(text: str) str | int | None#
static load_rules(path_to_rules: pathlib.Path, encoding: str | None = None) list[FamilyDetectorRule]#

Load all rules stored in a yml file.

Parameters:
path_to_rulesPath

Path to a yml file containing a list of mappings with the same structure as FamilyDetectorRule

encodingstr, optional

Encoding of the file to open

Returns:
list of FamilyDetectorRule

List of all the rules in path_to_rules, can be used to init a FamilyDetector

static check_rules_sanity(rules: list[FamilyDetectorRule])#

Check consistency of a set of rules.

static save_rules(rules: list[FamilyDetectorRule], path_to_rules: pathlib.Path, encoding: str | None = None)#

Store rules in a YAML file.

Parameters:
ruleslist of FamilyDetectorRule

The rules to save

path_to_rulesPath

Path to a .yml file that will contain the rules

encodingstr, optional

Encoding of the .yml file

class medkit.text.context.FamilyDetectorRule#

Regexp-based rule to use with FamilyDetector.

Input text may be converted before detecting rule.

Parameters:
regexpstr

The regexp pattern used to match a family reference

exclusion_regexpslist of str, optional

Optional exclusion patterns

idstr, optional

Unique identifier of the rule to store in the metadata of the entities

case_sensitivebool, default=False

Whether to consider case when running regexp and `exclusion_regexs

unicode_sensitivebool, default=False

If True, rule matches are searched on unicode text. So, regexp and exclusion_regexps shall not contain non-ASCII chars because they would never be matched. If False, rule matches are searched on closest ASCII text when possible. (cf. FamilyDetector)

regexp: str#
exclusion_regexps: list[str]#
id: str | None = None#
case_sensitive: bool = False#
unicode_sensitive: bool = False#
__post_init__()#
class medkit.text.context.FamilyMetadata#

Bases: typing_extensions.TypedDict

Metadata dict added to family attributes with True value.

Parameters:
rule_idstr or int

Identifier of the rule used to detect a family reference. If the rule has no id, then the index of the rule in the list of rules is used instead.

rule_id: str | int#
class medkit.text.context.HypothesisDetector(output_label: str = 'hypothesis', rules: list[HypothesisDetectorRule] | None = None, verbs: dict[str, dict[str, dict[str, list[str]]]] | None = None, modes_and_tenses: list[tuple[str, str]] | None = None, max_length: int = 150, uid: str | None = None)#

Bases: medkit.core.text.ContextOperation

Annotator detecting and creating hypothesis attributes.

Hypothesis will be considered present either because of the presence of a certain text pattern in a segment, or because of the usage of a certain verb at a specific mode and tense (for instance conditional).

Because hypothesis attributes will be attached to whole segments, each input segment should be “local”-enough (ie a sentence or a syntagma) rather than a big chunk of text.

Parameters:
output_labelstr, default=”hypothesis”

The label of the created attributes

ruleslist of HypothesisDetectorRule, optional

The set of rules to use when detecting hypothesis. If none provided, the rules in “hypothesis_detector_default_rules.yml” will be used

verbsdict of str to dict, optional

Conjugated verbs forms, to be used in association with modes_and_tenses. Conjugated forms of a verb at a specific mode and tense must be provided in nested dicts with the 1st key being the verb’s root, the 2d key the mode and the 3d key the tense. For instance verb[“aller”][“indicatif][“présent”] would hold the list [“vais”, “vas”, “va”, “allons”, aller”, “vont”] When verbs is provided, modes_and_tenses must also be provided. If none provided, the rules in “hypothesis_detector_default_verbs.yml” will be used.

modes_and_tenseslist of tuple of str, optional

List of tuples of all modes and tenses associated with hypothesis. Will be used to select conjugated forms in verbs that denote hypothesis.

max_lengthint, default=150

Maximum number of characters in a hypothesis segment. Segments longer than this will never be considered as hypothesis

uidstr, optional

Identifier of the detector

init_args#
output_label: str#
rules: list[HypothesisDetectorRule]#
verbs: dict[str, dict[str, dict[str, list[str]]]]#
modes_and_tenses: list[tuple[str, str]]#
max_length: int#
_patterns_by_verb#
_non_empty_text_pattern#
_patterns#
_exclusion_patterns#
_has_non_unicode_sensitive_rule#
run(segments: list[medkit.core.text.Segment])#

Run the operation.

Add a hypothesis attribute to each segment with a boolean value indicating if a hypothesis has been detected.

Hypothesis attributes with a True value have a metadata dict with fields described in either HypothesisRuleMetadata or HypothesisVerbMetadata.

Parameters:
segmentslist of Segment

List of segments to detect as being hypothesis or not

_detect_hypothesis_in_segment(segment: medkit.core.text.Segment) medkit.core.Attribute | None#
_find_matching_verb(text: str) str | None#
_find_matching_rule(text: str) str | int | None#
static load_verbs(path_to_verbs: pathlib.Path, encoding: str | None = None) dict[str, dict[str, dict[str, list[str]]]]#

Load all conjugated verb forms stored in a YAML file.

Conjugated verb forms at a specific mode and tense must be stored in nested mappings with the 1st key being the verb root, the 2d key the mode and the 3d key the tense.

Parameters:
path_to_verbsPath

Path to a yml file containing a list of verbs form, arranged by mode and tense.

encodingstr, optional

Encoding on the file to open

Returns:
dict of str to dict

List of verb forms in path_to_verbs, can be used to init an HypothesisDetector

static load_rules(path_to_rules: pathlib.Path, encoding: str | None = None) list[HypothesisDetectorRule]#

Load all rules stored in a YAML file.

Parameters:
path_to_rulesPath

Path to a yml file containing a list of mappings with the same structure as HypothesisDetectorRule

encodingstr, optional

Encoding of the file to open

Returns:
list of HypothesisDetectorRule

List of all the rules in path_to_rules, can be used to init an HypothesisDetector

classmethod get_example() HypothesisDetector#

Instantiate an HypothesisDetector with example rules and verbs, designed for usage with EDS documents.

static check_rules_sanity(rules: list[HypothesisDetectorRule])#

Check consistency of a set of rules.

static save_rules(rules: list[HypothesisDetectorRule], path_to_rules: pathlib.Path, encoding: str | None = None)#

Store rules in a YAML file.

Parameters:
ruleslist of HypothesisDetectorRule

The rules to save

path_to_rulesPath

Path to a .yml file that will contain the rules

encodingstr, optional

Encoding of the .yml file

class medkit.text.context.HypothesisDetectorRule#

Regexp-based rule to use with HypothesisDetector.

Attributes:
regexpstr

The regexp pattern used to match a hypothesis

exclusion_regexpslist of str, optional

Optional exclusion patterns

idstr, optional

Unique identifier of the rule to store in the metadata of the entities

case_sensitivebool, default=False

Whether to ignore case when running regexp and `exclusion_regexps

unicode_sensitivebool, default=False

Whether to replace all non-ASCII chars by the closest ASCII chars on input text before running regexp and `exclusion_regexps. If True, then regexp and `exclusion_regexps shouldn’t contain non-ASCII chars because they would never be matched.

regexp: str#
exclusion_regexps: list[str]#
id: str | None = None#
case_sensitive: bool = False#
unicode_sensitive: bool = False#
__post_init__()#
class medkit.text.context.HypothesisRuleMetadata#

Bases: typing_extensions.TypedDict

Metadata added to hypothesis attributes with True value detected by a rule.

Parameters:
typestr

Metadata type, here “rule” (use to differentiate between rule/verb metadata dict)

rule_idstr

Identifier of the rule used to detect an hypothesis. If the rule has no uid, then the index of the rule in the list of rules is used instead

type: typing_extensions.Literal[rule]#
rule_id: str#
class medkit.text.context.HypothesisVerbMetadata#

Bases: typing_extensions.TypedDict

Metadata added to hypothesis attributes with True value detected by a verb.

Parameters:
typestr

Metadata type, here “verb” (use to differentiate between rule/verb metadata dict).

matched_verbstr

Root of the verb used to detect an hypothesis.

type: typing_extensions.Literal[verb]#
matched_verb: str#
class medkit.text.context.NegationDetector(output_label: str, rules: list[NegationDetectorRule] | None = None, uid: str | None = None)#

Bases: medkit.core.text.ContextOperation

Annotator creating negation attributes.

Because negation attributes will be attached to whole annotations, each input annotation should be “local”-enough rather than a big chunk of text (ie a sentence or a syntagma).

For detecting negation, the module uses rules that may be sensitive to unicode or not. When the rule is not sensitive to unicode, we try to convert unicode chars to the closest ascii chars. However, some characters need to be pre-processed before (e.g., n° -> number). So, if the text lengths are different, we fall back on initial unicode text for detection even if rule is not unicode-sensitive. In this case, a warning is logged for recommending to pre-process data.

init_args#
output_label#
rules#
_non_empty_text_pattern#
_patterns#
_exclusion_patterns#
_has_non_unicode_sensitive_rule#
run(segments: list[medkit.core.text.Segment])#

Run the operation.

Add a negation attribute to each segment with a boolean value indicating if a hypothesis has been found.

Negation attributes with a True value have a metadata dict with fields described in NegationRuleMetadata.

Parameters:
segmentslist of Segment

List of segments to detect as being negated or not

_detect_negation_in_segment(segment: medkit.core.text.Segment) medkit.core.Attribute | None#
_find_matching_rule(text: str) str | int | None#
static load_rules(path_to_rules: pathlib.Path, encoding: str | None = None) list[NegationDetectorRule]#

Load all rules stored in a yml file.

Parameters:
path_to_rulesPath

Path to a yml file containing a list of mappings with the same structure as NegationDetectorRule

encodingstr, optional

Encoding of the file to open

Returns:
list of NegationDetectorRule

List of all the rules in path_to_rules, can be used to init a NegationDetector

static check_rules_sanity(rules: list[NegationDetectorRule])#

Check consistency of a set of rules.

static save_rules(rules: list[NegationDetectorRule], path_to_rules: pathlib.Path, encoding: str | None = None)#

Store rules in a yml file.

Parameters:
ruleslist of NegationDetectorRule

The rules to save

path_to_rulesPath

Path to a .yml file that will contain the rules

encodingstr, optional

Encoding of the .yml file

class medkit.text.context.NegationDetectorRule#

Regexp-based rule to use with NegationDetector.

Input text may be converted before detecting rule.

Parameters:
regexpstr

The regexp pattern used to match a negation

exclusion_regexpslist of str, optional

Optional exclusion patterns

idstr, optional

Unique identifier of the rule to store in the metadata of the entities

case_sensitivebool, default=False

Whether to consider case when running regexp and `exclusion_regexs

unicode_sensitivebool, default=False

If True, rule matches are searched on unicode text. So, regexp and `exclusion_regexs shall not contain non-ASCII chars because they would never be matched. If False, rule matches are searched on closest ASCII text when possible. (cf. NegationDetector)

regexp: str#
exclusion_regexps: list[str]#
id: str | None = None#
case_sensitive: bool = False#
unicode_sensitive: bool = False#
__post_init__()#
class medkit.text.context.NegationMetadata#

Bases: typing_extensions.TypedDict

Metadata dict added to negation attributes with True value.

Parameters:
rule_idstr or int

Identifier of the rule used to detect a negation. If the rule has no uid, then the index of the rule in the list of rules is used instead.

rule_id: str | int#