medkit.text.ner#
Submodules#
- medkit.text.ner._base_simstring_matcher
- medkit.text.ner.adicap_norm_attribute
- medkit.text.ner.date_attribute
- medkit.text.ner.duckling_matcher
- medkit.text.ner.edsnlp_date_matcher
- medkit.text.ner.edsnlp_tnm_matcher
- medkit.text.ner.hf_entity_matcher
- medkit.text.ner.hf_entity_matcher_trainable
- medkit.text.ner.hf_tokenization_utils
- medkit.text.ner.iamsystem_matcher
- medkit.text.ner.nlstruct_entity_matcher
- medkit.text.ner.quick_umls_matcher
- medkit.text.ner.regexp_matcher
- medkit.text.ner.simstring_matcher
- medkit.text.ner.tnm_attribute
- medkit.text.ner.umls_coder_normalizer
- medkit.text.ner.umls_matcher
- medkit.text.ner.umls_utils
Classes#
Attribute describing tissue sample using the ADICAP coding. |
|
Attribute representing an absolute date or time associated to a segment or entity. |
|
Attribute representing a time quantity associated to a segment or entity. |
|
Attribute representing a relative date or time associated to a segment or entity. |
|
Direction of a |
|
Entity annotator using Duckling (facebook/duckling). |
|
Entity annotator relying on regexp-based rules. |
|
Descriptor of normalization attributes to attach to entities created from a |
|
Regexp-based rule to use with |
|
Metadata dict added to entities matched by |
|
Entity matcher relying on string similarity. |
|
Descriptor of normalization attributes to attach to entities created from a |
|
Rule to use with |
|
Entity annotator identifying UMLS concepts using the simstring fuzzy matching algorithm. |
Package Contents#
- class medkit.text.ner.ADICAPNormAttribute(code: str, sampling_mode: str | None = None, technic: str | None = None, organ: str | None = None, pathology: str | None = None, pathology_type: str | None = None, behaviour_type: str | None = None, metadata: dict[str, Any] | None = None, uid: str | None = None)#
Bases:
medkit.core.text.EntityNormAttribute
Attribute describing tissue sample using the ADICAP coding.
ADICAP: Association pour le Développement de l’Informatique en Cytologie et Anatomo-Pathologie
This class is replicating EDS-NLP’s Adicap class, making it a medkit Attribute.
The code field fully describes the tissue sample. Additional information is derived from code in human readable fields (sampling_code, technic, organ, pathology, pathology_type, behaviour_type)
- Attributes:
- uid:
Identifier of the attribute
- label:
The attribute label, always set to
EntityNormAttribute.LABEL
- value:
ADICAP code prefix with “adicap:” (ex: “adicap:BHGS0040”)
- code:
ADICAP code as a string (ex: “BHGS0040”)
- kb_id:
Same as code
- sampling_mode:
Sampling mode (ex: “BIOPSIE CHIRURGICALE”)
- technic:
Sampling technic (ex: “HISTOLOGIE ET CYTOLOGIE PAR INCLUSION”)
- organ:
Organ and regions (ex: “SEIN (ÉGALEMENT UTILISÉ CHEZ L’HOMME)”)
- pathology:
General pathology (ex: “PATHOLOGIE GÉNÉRALE NON TUMORALE”)
- pathology_type:
Pathology type (ex: “ETAT SUBNORMAL - LESION MINEURE”)
- behaviour_type:
Behaviour type (ex: “CARACTERES GENERAUX”)
- metadata:
Metadata of the attribute
- sampling_mode: str | None#
- technic: str | None#
- organ: str | None#
- pathology: str | None#
- pathology_type: str | None#
- behaviour_type: str | None#
- property code: str#
- to_dict() dict[str, Any] #
- classmethod from_dict(adicap_dict: dict[str, Any]) typing_extensions.Self #
Create an Attribute from a dict.
- Parameters:
- attribute_dict: dict of str to Any
A dictionary from a serialized Attribute as generated by to_dict()
- class medkit.text.ner.DateAttribute(label: str, year: int | None = None, month: int | None = None, day: int | None = None, hour: int | None = None, minute: int | None = None, second: int | None = None, metadata: dict[str, Any] | None = None, uid: str | None = None)#
Bases:
medkit.core.Attribute
Attribute representing an absolute date or time associated to a segment or entity.
The date or time can be incomplete: each date/time component is optional but at least one must be provided.
- Attributes:
- uidstr
Identifier of the attribute
- labelstr
Label of the attribute
- valueAny, optional
String representation of the date with YYYY-MM-DD format for the date part and HH:MM:SS for the time part, if present. Missing components are replaced with question marks.
- yearint, optional
Year component of the date
- monthint, optional
Month component of the date
- dayint, optional
Day component of the date
- hourint, optional
Hour component of the time
- minuteint, optional
Minute component of the time
- secondint, optional
Second component of the time
- metadatadict of str to Any
Metadata of the attribute
- year: int | None#
- month: int | None#
- day: int | None#
- hour: int | None#
- minute: int | None#
- second: int | None#
- value#
- to_brat() str #
Return a value compatible with the brat format.
- to_spacy() str #
Return a value compatible with spaCy.
- to_dict() dict[str, Any] #
- classmethod from_dict(date_dict: dict[str, Any]) typing_extensions.Self #
Create an Attribute from a dict.
- Parameters:
- attribute_dict: dict of str to Any
A dictionary from a serialized Attribute as generated by to_dict()
- class medkit.text.ner.DurationAttribute(label: str, years: int = 0, months: int = 0, weeks: int = 0, days: int = 0, hours: int = 0, minutes: int = 0, seconds: int = 0, metadata: dict[str, Any] | None = None, uid: str | None = None)#
Bases:
medkit.core.Attribute
Attribute representing a time quantity associated to a segment or entity.
Each date/time component is optional but at least one must be provided.
- Attributes:
- uidstr
Identifier of the attribute
- labelstr
Label of the attribute
- valueAny, optional
String representation of the duration (ex: “1 year 10 months 2 days”)
- yearsint
Year component of the date quantity
- monthsint
Month component of the date quantity
- weeksint
Week component of the date quantity
- daysint
Day component of the date quantity
- hoursint
Hour component of the time quantity
- minutesint
Minute component of the time quantity
- secondsint
Second component of the time quantity
- metadatadict of str to Any
Metadata of the attribute
- years: int#
- months: int#
- weeks: int#
- days: int#
- hours: int#
- minutes: int#
- seconds: int#
- value#
- to_brat() str #
Return a value compatible with the brat format.
- to_spacy() str #
Return a value compatible with spaCy.
- to_dict() dict[str, Any] #
- classmethod from_dict(duration_dict: dict[str, Any]) typing_extensions.Self #
Create an Attribute from a dict.
- Parameters:
- attribute_dict: dict of str to Any
A dictionary from a serialized Attribute as generated by to_dict()
- class medkit.text.ner.RelativeDateAttribute(label: str, direction: RelativeDateDirection, years: int = 0, months: int = 0, weeks: int = 0, days: int = 0, hours: int = 0, minutes: int = 0, seconds: int = 0, metadata: dict[str, Any] | None = None, uid: str | None = None)#
Bases:
medkit.core.Attribute
Attribute representing a relative date or time associated to a segment or entity.
A date or time offset from an (unknown) reference date time with a direction.
At least one date or time component must be non-zero.
- Attributes:
- uidstr
Identifier of the attribute
- labelstr
Label of the attribute
- valueAny, optional
String representation of the relative date (ex: “+ 1 year 10 months 2 days”)
- directionRelativeDateDirection
Direction the relative date. Ex: “2 years ago” corresponds to the PAST direction and “in 2 weeks” to the FUTURE direction.
- yearsint
Year component of the date offset
- monthsint
Month component of the date offset
- weeksint
Week component of the date offset
- daysint
Day component of the date offset
- hoursint
Hour component of the time offset
- minutesint
Minute component of the time offset
- secondsint
Second component of the time offset
- metadatadict of str to Any
Metadata of the attribute
- direction: RelativeDateDirection#
- years: int#
- months: int#
- weeks: int#
- days: int#
- hours: int#
- minutes: int#
- seconds: int#
- value#
- to_brat() str #
Return a value compatible with the brat format.
- to_spacy() str #
Return a value compatible with spaCy.
- to_dict() dict[str, Any] #
- classmethod from_dict(date_dict: dict[str, Any]) typing_extensions.Self #
Create an Attribute from a dict.
- Parameters:
- attribute_dict: dict of str to Any
A dictionary from a serialized Attribute as generated by to_dict()
- class medkit.text.ner.RelativeDateDirection(*args, **kwds)#
Bases:
enum.Enum
Direction of a
RelativeDateAttribute
.- PAST = 'past'#
- FUTURE = 'future'#
- class medkit.text.ner.DucklingMatcher(output_label: str, version: str, url: str = 'http://localhost:8000', locale: str = 'fr_FR', dims: list[str] | None = None, attrs_to_copy: list[str] | None = None, uid: str | None = None)#
Bases:
medkit.core.text.NEROperation
Entity annotator using Duckling (facebook/duckling).
- This annotator can parse several types of information in multiple languages:
amount of money, credit card numbers, distance, duration, email, numeral, ordinal, phone number, quantity, temperature, time, url, volume.
This annotator currently requires a Duckling Server running. The easiest method is to run a docker container :
>>> docker run --rm -d -p <PORT>:8000 --name duckling rasa/duckling:<TAG>
This command will start a Duckling server listening on port <PORT>. The version of the server is identified by <TAG>
- init_args#
- output_label: str#
- version: str#
- url: str#
- locale: str#
- dims: list[str] | None#
- attrs_to_copy: list[str]#
- run(segments: list[medkit.core.text.Segment]) list[medkit.core.text.Entity] #
Return entities for each match in segments.
- Parameters:
- segmentslist of Segment
List of segments into which to look for matches
- Returns:
- list of Entity
Entities found in segments
- _find_matches_in_segment(segment: medkit.core.text.Segment) Iterator[medkit.core.text.Entity] #
- _test_connection()#
- class medkit.text.ner.RegexpMatcher(rules: list[RegexpMatcherRule] | None = None, attrs_to_copy: list[str] | None = None, name: str | None = None, uid: str | None = None)#
Bases:
medkit.core.text.NEROperation
Entity annotator relying on regexp-based rules.
For detecting entities, the module uses rules that may be sensitive to unicode or not. When the rule is not sensitive to unicode, we try to convert unicode chars to the closest ascii chars. However, some characters need to be pre-processed before (e.g., n° -> number). So, if the text lengths are different, we fall back on initial unicode text for detection even if rule is not unicode-sensitive. In this case, a warning is logged for recommending to pre-process data.
- init_args#
- rules#
- attrs_to_copy#
- _patterns#
- _exclusion_patterns#
- _has_non_unicode_sensitive_rule#
- run(segments: list[medkit.core.text.Segment]) list[medkit.core.text.Entity] #
Return entities (with optional normalization attributes) matched in segments.
- Parameters:
- segments: list of Segment
List of segments into which to look for matches
- Returns:
- list of Entity:
Entities found in segments (with optional normalization attributes). Entities have a metadata dict with fields described in
RegexpMetadata
- _find_matches_in_segment(segment: medkit.core.text.Segment) Iterator[medkit.core.text.Entity] #
- _find_matches_in_segment_for_rule(rule_index: int, segment: medkit.core.text.Segment, text_ascii: str | None) Iterator[medkit.core.text.Entity] #
- static _create_norm_attr(norm: RegexpMatcherNormalization) medkit.core.text.EntityNormAttribute #
- static load_rules(path_to_rules: pathlib.Path, encoding: str | None = None) list[RegexpMatcherRule] #
Load all rules stored in a yml file.
- Parameters:
- path_to_rules: Path
Path to a yml file containing a list of mappings with the same structure as RegexpMatcherRule
- encoding: str, optional
Encoding of the file to open
- Returns:
- list of RegexpMatcherRule
List of all the rules in path_to_rules, can be used to init a RegexpMatcher
- static check_rules_sanity(rules: list[RegexpMatcherRule])#
Check consistency of a set of rules.
- static save_rules(rules: list[RegexpMatcherRule], path_to_rules: pathlib.Path, encoding: str | None = None)#
Store rules in a yml file.
- Parameters:
- rules: list of RegexpMatcherRule
The rules to save
- path_to_rules: Path
Path to a .yml file that will contain the rules
- encoding: str, optional
Encoding of the .yml file
- class medkit.text.ner.RegexpMatcherNormalization#
Descriptor of normalization attributes to attach to entities created from a
RegexpMatcherRule
.- Attributes:
- kb_name: str
The name of the knowledge base we are referencing. Ex: “umls”
- kb_version: str
The name of the knowledge base we are referencing. Ex: “202AB”
- kb_id: str, optional
The id of the entity in the knowledge base, for instance a CUI
- kb_name: str#
- kb_id: Any#
- kb_version: str | None = None#
- class medkit.text.ner.RegexpMatcherRule#
Regexp-based rule to use with
RegexpMatcher
.- Attributes:
- regexp: str
The regexp pattern used to match entities
- label: str
The label to attribute to entities created based on this rule
- term: str, optional
The optional normalized version of the entity text
- id: str, optional
Unique identifier of the rule to store in the metadata of the entities
- version: str, optional
Version string to store in the metadata of the entities
- index_extract: int, default=0
If the regexp has groups, the index of the group to use to extract the entity
- case_sensitive: bool, default=True
Whether to ignore case when running regexp and `exclusion_regexp
- unicode_sensitive: bool, default=True
If True, regexp rule matches are searched on unicode text. So, regexp and `exclusion_regexps shall not contain non-ASCII chars because they would never be matched. If False, regexp rule matches are searched on closest ASCII text when possible. (cf. RegexpMatcher)
- exclusion_regexp: str, optional
An optional exclusion pattern. Note that this exclusion pattern will be executed on the whole input annotation, so when relying on exclusion_regexp make sure the input annotations passed to RegexpMatcher are “local”-enough (sentences or syntagmas) rather than the whole text or paragraphs
- normalizations: list of RegexpMatcherNormalization, optional
Optional list of normalization attributes that should be attached to the entities created
- regexp: str#
- label: str#
- term: str | None = None#
- id: str | None = None#
- version: str | None = None#
- index_extract: int = 0#
- case_sensitive: bool = True#
- unicode_sensitive: bool = True#
- exclusion_regexp: str | None = None#
- normalizations: list[RegexpMatcherNormalization]#
- __post_init__()#
- class medkit.text.ner.RegexpMetadata#
Bases:
typing_extensions.TypedDict
Metadata dict added to entities matched by
RegexpMatcher
.- Parameters:
- rule_id: str or int
Identifier of the rule used to match an entity. If the rule has no id, then the index of the rule in the list of rules is used instead.
- version: str, optional
Optional version of the rule used to match an entity
- rule_id: str | int#
- version: str | None#
- class medkit.text.ner.SimstringMatcher(rules: list[SimstringMatcherRule], threshold: float = 0.9, min_length: int = 3, max_length: int = 50, similarity: typing_extensions.Literal[cosine, dice, jaccard, overlap] = 'jaccard', spacy_tokenization_language: str | None = None, blacklist: list[str] | None = None, same_beginning: bool = False, attrs_to_copy: list[str] | None = None, name: str | None = None, uid: str | None = None)#
Bases:
medkit.text.ner._base_simstring_matcher.BaseSimstringMatcher
Entity matcher relying on string similarity.
Uses the simstring fuzzy matching algorithm (http://chokkan.org/software/simstring/).
Note that setting spacy_tokenization_language to True might reduce the number of false positives. This requires the spacy optional dependency, which can be installed with pip install medkit-lib[spacy].
- Parameters:
- rules: list of SimstringMatcherRule
Rules to use for matching entities.
- threshold: float, default=0.9
Minimum similarity (between 0.0 and 1.0) between a rule term and the text of an entity matched on that rule.
- min_length: int, default=3
Minimum number of chars in matched entities.
- max_length: int, default=50
Maximum number of chars in matched entities.
- similarity: str, default=”jaccard”
Similarity metric to use.
- spacy_tokenization_language: str, optional
2-letter code (ex: “fr”, “en”, etc.) designating the language of the spacy model to use for tokenization. If provided, spacy will be used to tokenize input segments and filter out some tokens based on their part-of-speech tags, such as determinants, conjunctions and prepositions. If None, a simple regexp based tokenization will be used, which is faster but might give more false positives.
- blacklist: list of str, optional
Optional list of exact terms to ignore.
- same_beginning: bool, default=False
Ignore all matches that start with a different character than the term of the rule. This can be convenient to get rid of false positives on words that are very similar but have opposite meanings because of a preposition, for instance “activation” and “inactivation”.
- attrs_to_copy: list of str, optional
Labels of the attributes that should be copied from the source segment to the created entity. Useful for propagating context attributes (negation, antecedent, etc.).
- name: str, optional
Name describing the matcher (defaults to the class name).
- uid: str, optional
Identifier of the matcher.
- _temp_dir#
- rules_db_file#
- simstring_db_file#
- static load_rules(path_to_rules: pathlib.Path, encoding: str | None = None) list[SimstringMatcherRule] #
Load all rules stored in a yml file.
- Parameters:
- path_to_rules
The path to a yml file containing a list of mappings with the same structure as
SimstringMatcherRule
- encoding: str, optional
The encoding of the file to open
- Returns:
- List[SimstringMatcherRule]
List of all the rules in path_to_rules, can be used to init a
SimstringMatcher
- static save_rules(rules: list[SimstringMatcherRule], path_to_rules: pathlib.Path, encoding: str | None = None)#
Store rules in a yml file.
- Parameters:
- rules: list of SimstringMatcherRule
The rules to save
- path_to_rules: Path
The path to a yml file that will contain the rules
- encoding: str, optional
The encoding of the yml file
- class medkit.text.ner.SimstringMatcherNormalization#
Bases:
medkit.text.ner._base_simstring_matcher.BaseSimstringMatcherNormalization
Descriptor of normalization attributes to attach to entities created from a
SimstringMatcherRule
.- Attributes:
- kb_name:
The name of the knowledge base we are referencing. Ex: “umls”
- kb_version:
The name of the knowledge base we are referencing. Ex: “202AB”
- kb_id:
The id of the entity in the knowledge base, for instance a CUI
- term:
Optional normalized version of the entity text in the knowledge base
- static from_dict(data: dict[str, Any]) SimstringMatcherNormalization #
Create a SimstringMatcherNormalization object from a dict.
- class medkit.text.ner.SimstringMatcherRule#
Bases:
medkit.text.ner._base_simstring_matcher.BaseSimstringMatcherRule
Rule to use with
SimstringMatcher
.- Attributes:
- term:
Term to match using similarity-based fuzzy matching
- label:
Label to use for the entities created when a match is found
- case_sensitive:
Whether to take case into account when looking for matches.
- unicode_sensitive:
Whether to use ASCII-only versions of the rule term and input texts when looking for matches (non-ASCII chars replaced by closest ASCII chars).
- normalizations:
Optional list of normalization attributes that should be attached to the entities created
- static from_dict(data: dict[str, Any]) SimstringMatcherRule #
Create a SimStringMatcherRule from a dict.
- class medkit.text.ner.UMLSMatcher(umls_dir: str | pathlib.Path, cache_dir: str | pathlib.Path, language: str, threshold: float = 0.9, min_length: int = 3, max_length: int = 50, similarity: typing_extensions.Literal[cosine, dice, jaccard, overlap] = 'jaccard', lowercase: bool = True, normalize_unicode: bool = False, spacy_tokenization: bool = False, semgroups: Sequence[str] = ('ANAT', 'CHEM', 'DEVI', 'DISO', 'PHYS', 'PROC'), blacklist: list[str] | None = None, same_beginning: bool = False, output_labels_by_semgroup: str | dict[str, str] | None = None, attrs_to_copy: list[str] | None = None, name: str | None = None, uid: str | None = None)#
Bases:
medkit.text.ner._base_simstring_matcher.BaseSimstringMatcher
Entity annotator identifying UMLS concepts using the simstring fuzzy matching algorithm.
This operation is heavily inspired by the QuickUMLS library (Georgetown-IR-Lab/QuickUMLS).
By default, only terms belonging to the ANAT (anatomy), CHEM (Chemicals & Drugs), DEVI (Devices), DISO (Disorders), PHYS (Physiology) and PROC (Procedures) semgroups will be considered. This behavior can be changed with the semgroups parameter.
Note that setting spacy_tokenization_language to True might reduce the number of false positives. This requires the spacy optional dependency, which can be installed with pip install medkit-lib[spacy].
- Parameters:
- umls_dirstr or Path
Path to the UMLS directory containing the MRCONSO.RRF and MRSTY.RRF files.
- cache_dirstr or Path
Path to the directory into which the umls database will be cached. If it doesn’t exist yet, the database will be automatically generated (it can take a long time) and stored there, ready to be reused on further instantiations. If it already exists, a check will be done to make sure the params used when the database was generated are consistent with the params of the current instance. If you want to rebuild the database with new params using the same cache dir, you will have to manually delete it first.
- languagestr
Language to consider as found in the MRCONSO.RRF file. Example: “FRE”. Will trigger a regeneration of the database if changed.
- thresholdfloat, default=0.9
Minimum similarity threshold (between 0.0 and 1.0) between a UMLS term and the text of a matched entity.
- min_lengthint, default=3
Minimum number of chars in matched entities.
- max_lengthint, default=50
Maximum number of chars in matched entities.
- similaritystr, default=”jaccard”
Similarity metric to use.
- lowercasebool, default=True
Whether to use lowercased versions of UMLS terms and input entities (except for acronyms for which the uppercase term is always used). Will trigger a regeneration of the database if changed.
- normalize_unicodebool, default=False
Whether to use ASCII-only versions of UMLS terms and input entities (non-ASCII chars replaced by closest ASCII chars). Will trigger a regeneration of the database if changed.
- spacy_tokenizationbool, default=False
If True, spacy will be used to tokenize input segments and filter out some tokens based on their part-of-speech tags, such as determinants, conjunctions and prepositions. If None, a simple regexp based tokenization will be used, which is faster but might give more false positives.
- semgroupssequence of str, default=(“ANAT”, “CHEM”, “DEVI”, “DISO”, “PHYS”, “PROC”)
Ids of UMLS semantic groups that matched concepts should belong to. :see: https://lhncbc.nlm.nih.gov/semanticnetwork/download/sg_archive/SemGroups-v04.txt If set to None, all concepts can be matched. Will trigger a regeneration of the database if changed.
- blacklistlist of str, optional
Optional list of exact terms to ignore.
- same_beginningbool, default=False
Ignore all matches that start with a different character than the term of the rule. This can be convenient to get rid of false positives on words that are very similar but have opposite meanings because of a preposition, for instance “activation” and “inactivation”.
- output_labels_by_semgroupstr or dict, optional
By default, ~`medkit.text.ner.umls.SEMGROUP_LABELS` will be used as entity labels. Use this parameter to override them. Example: {“DISO”: “problem”, “PROC”: “test}. If output_labels_by_semgroup is a string, all entities will use this string as label instead. Will trigger a regeneration of the database if changed.
- attrs_to_copylist of str, optional
Labels of the attributes that should be copied from the source segment to the created entity. Useful for propagating context attributes (negation, antecedent, etc)
- namestr, optional
Name describing the matcher (defaults to the class name).
- uidstr, optional
Identifier of the matcher.
- _SEMGROUP_BY_SEMTYPE = None#
- umls_dir#
- cache_dir#
- labels_by_semgroup#
- cache_params#
- cache_params_file#
- simstring_db_file#
- rules_db_file#
- classmethod _get_labels_by_semgroup(output_labels: str | dict[str, str] | None) dict[str, str] #
Return a mapping giving the label to use for all entries of a given semgroup.
- output_labelsstr or dict of str to str, optional
Optional mapping of labels to use. Can be used to override the default labels. If output_labels is a single string, it will be used as a unique label for all semgroups
- Returns:
- dict of str to str
A mapping with semgroups as keys and corresponding label as values
- classmethod _build_rules(umls_dir: pathlib.Path, language: str, lowercase: bool, normalize_unicode: bool, semgroups: set[str] | None, labels_by_semgroup: dict[str, str]) Iterator[medkit.text.ner._base_simstring_matcher.BaseSimstringMatcherRule] #
Create rules for all UMLS entries with appropriate labels.