Text Processing

Text Processing #

This page lists and explains all components related to text processing.

For more details about all sub-packages, please refer to medkit.text.

Overview #

Here is a listing of all medkit text operations with a direct link to their corresponding API docs.

Preprocessing#
Operation	Description
`CharReplacer`	Fast replacement of 1-char string by n-char strings
`RegexpReplacer`	Patterns replacement
`EDSCleaner`	Cleaning of texts extracted from the APHP EDS
`DuplicateFinder`	Detection of duplicated parts across documents based on duptextfinder

Segmentation#
Operation	Description
`SectionTokenizer`	Rule-based detection of sections
`SentenceTokenizer`	Rule-based sentence splitting
`RushSentenceTokenizer`	Sentence splitting based on PyRuSH
`SyntagmaTokenizer`	Rule-based sub-sentence splitting

Context Detection#
Operation	Description
`NegationDetector`	Detection of negation
`HypothesisDetector`	Detection of hypothesis
`FamilyDetector`	Detection of family antecedents

Named Entity Recognition#
Operation	Description
`RegexpMatcher`	Regexp-based entity matching
`SimstringMatcher`	Fast fuzzy matching based on simstring
`IAMSystemMatcher`	Advanced entity matching based on IAMSystem
`UMLSMatcher`	Matching of UMLS terms based on simstring
`QuickUMLSMatcher`	Matching of UMLS terms based on QuickUMLS
`HFEntityMatcher`	Entity matcher relying on HuggingFace transformers models
`DucklingMatcher`	General matcher (dates, quantities, etc.) relying on Duckling coder normalizer
`EDSNLPDateMatcher`	Date/time matching based on EDS-NLP
`EDSNLPTNMMatcher`	TNM (Tumour/Node/Metastasis) matching based on EDS-NLP
`UMLSCoderNormalizer`	Normalization of pre-existing entities to UMLS CUIs relying on a CODER model
`NLStructEntityMatcher`	Entity matcher relying on NLStruct models.

spaCy#
Operation	Description
`SpacyPipeline`	Operation wrapping a spaCy pipeline to work at the annotation level
`SpacyDocPipeline`	Operation wrapping a spaCy pipeline to work at the document level
`EDSNLPPipeline`	Operation wrapping an EDS-NLP pipeline to work at the annotation level
`EDSNLPDocPipeline`	Operation wrapping an EDS-NLP pipeline to work at the document level

Miscellaneous#
Operation	Description
`SyntacticRelationExtractor`	Relation detector relying on spaCy’s dependency parser
`HFTranslator`	Translation operation relying on HuggingFace transformers models
`AttributeDuplicator`	Propagation of attributes based on annotation spans
`DocumentSplitter`	A component to divide text documents using its segments as a reference

Preprocessing #

This section provides some information about how to use preprocessing operations.

medkit provides a set of operations to preprocess text documents, such as substituting subtexts by other ones (“n°” to “number”), whilst keeping span information.

Some rule-based operations yield better results when text documents preprocessed. For example, concerning RegexpMatcher:

When a rule is not sensitive to unicode, we try to convert unicode chars to the closest ascii chars. However, some characters may need to be preprocessed before. If the text lengths are different after preprocessing, we fall back on the initial unicode text for detection even if the rule is not unicode-sensitive. In this case, a warning recommending to preprocess data is raised.

For more details, refer to medkit.text.preprocessing.

CharReplacer is a preprocessing operation allowing to replace one character by another one. It is faster than RegexpReplacer but is limited to character replacement and does not support pattern replacement.

For example, if you want to replace some special characters like +:

from medkit.core.text import TextDocument
from medkit.text.preprocessing import CharReplacer

doc = TextDocument(text="Il suit + ou - son traitement,")

rules = [("+", "plus"), ("-", "moins")]
op = CharReplacer(output_label="preprocessed_text", rules=rules)
new_segment = op.run([doc.raw_segment])[0]
print(new_segment.text)

Results:

new_segment.text: “Il suit plus ou moins son traitement,”
new_segment.spans: [Span(start=0, end=8), ModifiedSpan(length=4, replaced_spans=[Span(start=8, end=9)]), Span(start=9, end=13), ModifiedSpan(length=5, replaced_spans=[Span(start=13, end=14)]), Span(start=14, end=30)]

medkit also provides some pre-defined rules that you can import (cf. medkit.text.preprocessing.rules) and combine with your own rules.

For example:

from medkit.text.preprocessing import (
    CharReplacer,
    LIGATURE_RULES,
    SIGN_RULES,
    SPACE_RULES,
    DOT_RULES,
    FRACTION_RULES,
    QUOTATION_RULES,
)

rules = (
    LIGATURE_RULES
    + SIGN_RULES
    + SPACE_RULES
    + DOT_RULES
    + FRACTION_RULES
    + QUOTATION_RULES
    + ...  # Custom rules
)

op = CharReplacer(output_label="preprocessed_text", rules=rules)

Note

If you do not provide rules when initializing char replacer operation, all pre-defined rules (i.e., ALL_RULES) are used.

RegexpReplacer #

The RegexpReplacer operation uses an algorithm based on regular expressions for detecting patterns in the text and replace them by the new text, whilst preserving span information.

It may be useful if you need to replace subtext, or text with a context, by other ones. For example, if you want to replace “n°” by “numéro”:

from medkit.core.text import TextDocument
from medkit.text.preprocessing import RegexpReplacer

doc = TextDocument(text="À l'aide d'une canule n ° 3,")

rule = (r"n\s*°", "numéro")
op = RegexpReplacer(output_label="preprocessed_text", rules=[rule])
new_segment = op.run([doc.raw_segment])[0]
print(new_segment.text)

Results:

new_segment.text: “À l’aide d’une canule numéro 3,”
new_segment.spans: [Span(start=0, end=22), ModifiedSpan(length=6, replaced_spans=[Span(start=22, end=25)]), Span(start=25, end=28)]

Warning

If you have a lot of single characters to change, it is not the optimal way to do it for performance reasons. In this case, we recommend to use CharReplacer.

Text Cleanup #

medkit also provides an operation for cleaning text up. This module has been implemented for a specific case of EDS documents.

For more details, please check out this example which makes use of class EDSCleaner.

Text Segmentation #

This section lists text segmentation modules. They are part of medkit.text.segmentation package.

For more details, please refer to medkit.text.segmentation submodules.

SectionTokenizer and SyntagmaTokenizer may rely on a description file containing the set of user-defined rules for splitting document text into a list of medkit Segment corresponding successively to sections or syntagmata.

For SectionTokenizer, here is the YAML schema reference for the file:

sections : dictionary of key-values where key will be the section name and value is a list of keywords to detect as the start of the section.
rules : list of modification rules which role is to rename a detected section
- rules.section_name : name of the detected section to rename
- rules.new_section_name : new name wanted for the section
- rules.order: order condition for renaming. Possible values: BEFORE, AFTER
- other_sections : list of other section names (i.e., context of the section to rename) to use with the order condition

Note

You may test French default rules using section_tokenizer = SectionTokenizer(). The corresponding file content is available here.

For SyntagmaTokenizer, here is the YAML schema reference of the file:

syntagma.separators : list of regular expressions allowing to trigger the start of a new syntagma

Note

You may test default French rules using syntagma_tokenizer = SyntagmaTokenizer(). The corresponding file content is available here.

Examples

For a better understanding, you may follow these tutorial examples:

Context Detection #

This section lists text annotators for detecting context. They are part of medkit.text.context package.

Hypothesis #

If you want to test default French rules of HypothesisDetector, you may use:

from medkit.text.context import HypothesisDetector

syntagmata = ...
detector = HypothesisDetector()
detector.run(syntagmata)

For more details, please refer to hypothesis_detector.

Negation #

medkit provides a rule-based negation detector which attaches a negation attribute to a text segment.

For more details, please refer to negation_detector.

Family reference #

medkit provides a rule-based family detector which attaches a family attribute to a text segment.

For more details, please refer to family_detector.

Named Entity Recognition #

This section lists text annotators for detecting entities. They are part of the medkit.text.ner package.

Regular Expression Matcher #

medkit provides a rule-based entity matcher.

For an example of RegexpMatcher usage, you can follow this tutorial.

For more details, please refer to regexp_matcher.

IAM system Matcher #

The IAMsystem library is available under the following medkit operation.

For more details, please refer to iamsystem_matcher.

medkit also provides a custom implementation (MedkitKeyword) of IAMsystem IEntity, which allows user:

to associate kb_name to kb_id
to provide a medkit entity label (e.g., category) associated to the IAM system entity label (i.e., text to search).

For more details, please check out the IAMsystem matcher example.

Simstring Matcher #

medkit provides an entity matcher using the simstring fuzzy matching algorithm.

For more details, please refer to simstring_matcher.

Quick UMLS Matcher #

Important

QuickUMLSMatcher requires additional dependencies:

Python dependencies

pip install 'medkit[quick-umls-matcher]'

QuickUMLS install
```
python -m quickumls.install <umls_install_path> <destination_path>
```
where <umls_install_path> is the path to the UMLS folder containing the MRCONSO.RRF and MRSTY.RRF files.

spaCy models

import spacy

if not spacy.util.is_package("en_core_web_sm"):
    spacy.cli.download("en_core_web_sm")
if not spacy.util.is_package("fr_core_news_sm"):
    spacy.cli.download("fr_core_news_sm")

Given a text document named doc with text The patient has asthma:

from medkit.text.ner.quick_umls_matcher import QuickUMLSMatcher

sentence = ...
umls_matcher = QuickUMLSMatcher(version="2021AB", language="ENG")
entities = umls_matcher.run([sentence])

The entity (entities[0]) will have the following description:

entity.text = “asthma”
entity.spans = [Span(16, 22)]
entity.label = “disorder”

Its normalization attribute (norm = entity.get_norms()[0]) will be:

norm is an instance of UMLSNormAttribute
norm.cui = _ASTHMA_CUI
norm.umls_version = “2021AB”
norm.term = “asthma”
norm.score = 1.0
norm.sem_types = [“T047”]

For more details about public APIs, please refer to quick_umls_matcher.

UMLS Matcher #

As an alternative to QuickUMLSMatcher, medkit also provides an entity matcher dedicated to UMLS terms, that uses the simstring fuzzy matching algorithm and does not rely on QuickUMLS.

For more details, please refer to umls_matcher.

Duckling Matcher #

medkit provides an entity annotator that uses Duckling.

For more details, please refer to duckling_matcher.

EDS-NLP Date Matcher #

The EDS-NLP dates pipeline can be used directly within medkit to detect dates and durations in documents.

Important

EDSNLPDateMatcher requires additional dependencies:

pip install 'medkit-lib[edsnlp]'

For more details, please refer to edsnlp_date_matcher.

Hugging Face Entity Matcher #

medkit provides an entity matcher based on Hugging Face models.

Important

HFEntityMatcher requires additional dependencies:

pip install 'medkit-lib[hf-entity-matcher]'

For more details, please refer to hf_entity_matcher.

UMLS Coder Normalizer #

This operation is not an entity matcher per-say but a normalizer that will add normalization attributes to pre-existing entities.

Important

UMLSCoderNormalizer requires additional dependencies:

pip install 'medkit-lib[umls-coder-normalizer]'

For more details, please refer to umls_coder_normalizer.

UMLS Normalization #

This module provides a subclass of EntityNormAttribute to facilitate handling of UMLS information.

For more details, please refer to umls_norm_attribute.

NLStruct Entity Matcher #

medkit provides an entity matcher for pretrained NLStruct models, which can detect nested entities.

Important

NLStructEntityMatcher requires additional dependencies:

pip install 'medkit-lib[nlstruct]'

You can load directly a model from a local directory, or from the Hugging Face hub.

from medkit.core.text import Segment, Span
from medkit.text.ner.nlstruct_entity_matcher import NLStructEntityMatcher

text="Je lui prescris du lorazepam."
segment = Segment(text=text,spans=[Span(0,len(text))],label="test")

# define the matcher using a French model
entity_matcher = NLStructEntityMatcher(model_name_or_dirpath="NesrineBannour/CAS-privacy-preserving-model")
entities = entity_matcher.run([segment])

spaCy modules #

medkit provides operations and utilities for integrating spaCy pipelines. They are part of medkit.text.spacy package.

Important

This module requires additional dependencies:

pip install 'medkit-lib[spacy]'

spaCy pipelines #

The SpacyPipeline component is an annotation-level operation. It takes medkit segments as inputs, runs a spaCy pipeline, and returns segments by converting spaCy outputs.

The SpacyDocPipeline component is a document-level operation, similarly to DocPipeline. It takes a medkit document as inputs, runs a spaCy pipeline, and attach the spaCy annotations to the document.

Translation #

Please refer to translation.

Hugging Face Translator #

Important

HFTranslator requires additional dependencies:

pip install 'medkit-lib[hf-translator]'

Extraction of Syntactic Relations #

This module detects syntactic relations between entities using a parser of dependencies.

For more details, please refer to syntactic_relation_extractor.

Postprocessing #

medkit provides some modules to facilitate postprocessing operations.

For the moment, you can use this module to:

align source and target Segments from the same TextDocument
duplicate attributes between segments. For example, you can duplicate an attribute from a sentence to its entities.
filter overlapping entities: useful when creating named entity reconigtion (NER) datasets
create mini-documents from a TextDocument.

Examples

Creating mini-documents from sections: document splitter

For more details, please refer to postprocessing.

Metrics #

This module provides components to evaluate annotations as well as some implementations of MetricsComputer to monitor the training of components available in medkit.

The components inside metrics are also known as evaluators. An evaluator allows you to assess performance by task.

For more details, please refer to metrics

Text Classification Evaluation #

medkit provides TextClassificationEvaluator, an evaluator for document attributes. You can compute the following metrics depending on several use cases.

Classification Report #

compute_classification_report: To compare a list of reference and predicted documents. This method uses scikit-learn as a backend for computing precision, recall, and F1-score.

Inter-Rater Agreement #

compute_cohen_kappa: To compare the degree of agreement between lists of documents made by two annotators.
compute_krippendorff_alpha: To compare the degree of agreement between lists of documents made by multiple annotators.

For more details, please refer to TextClassificationEvaluator or irr_utils.

NER Evaluation #

medkit uses seqeval as a backend for evaluation.

Important

This module requires additional dependencies:

pip install 'medkit-lib[metrics-ner]'

Entity Detection #

An example of a perfect match:

The document has two entities: PER and GPE;
An operation has detected both entities.

from medkit.core.text import TextDocument, Entity, Span
from medkit.text.metrics.ner import SeqEvalEvaluator

document = TextDocument("Marie lives in Paris", 
                        anns = [Entity(label="PER",spans=[Span(0,5)],text="Marie"),
                                Entity(label="GPE",spans=[Span(15,20)],text="Paris")])

pred_ents = [Entity(label="PER",spans=[Span(0,5)],text="Marie"),
             Entity(label="GPE",spans=[Span(15,20)],text="Paris")]

# define a evaluator using `iob2` as tagging scheme
evaluator = SeqEvalEvaluator(tagging_scheme="iob2")
metrics = evaluator.compute(documents=[document], predicted_entities=[pred_ents])

print(metrics)

Results:

{
    'macro_precision': 1.0,
    'macro_recall': 1.0,
    'macro_f1-score': 1.0,
    'support': 2,
    'accuracy': 1.0,
    'GPE_precision': 1.0,
    'GPE_recall': 1.0,
    'GPE_f1-score': 1.0,
    'GPE_support': 1,
    'PER_precision': 1.0,
    'PER_recall': 1.0,
    'PER_f1-score': 1.0,
    'PER_support': 1,
}

For more details, please refer to SeqEvalEvaluator

Training of NER components #

For example, a trainable component detects PER and GPE entities using iob2 as tagging scheme. The Trainer may compute metrics during its training/evaluation loop.

from medkit.text.metrics.ner import SeqEvalMetricsComputer
from medkit.training import Trainer

seqeval_mc = SeqEvalMetricsComputer(
    id_to_label={
        'O': 0,
        'B-PER': 1,
        'I-PER': 2,
        'B-GPE': 3,
        'I-GPE': 4,
    },
    tagging_scheme="iob2",
)

trainer = Trainer(..., metrics_computer=seqeval_mc)

For more details, please refer to SeqEvalMetricsComputer and the training API.

Hint

There is a utility to convert labels to NER tags if required, see hf_tokenization_utils.

Text Processing

Contents

Text Processing #

Overview #

Preprocessing #

CharReplacer #

RegexpReplacer #

Text Cleanup #

Text Segmentation #

Context Detection #

Hypothesis #

Negation #

Family reference #

Named Entity Recognition #

Regular Expression Matcher #

IAM system Matcher #

Simstring Matcher #

Quick UMLS Matcher #

UMLS Matcher #

Duckling Matcher #

EDS-NLP Date Matcher #

Hugging Face Entity Matcher #

UMLS Coder Normalizer #

UMLS Normalization #

NLStruct Entity Matcher #

spaCy modules #

spaCy pipelines #

Translation #

Hugging Face Translator #

Extraction of Syntactic Relations #

Postprocessing #

Metrics #

Text Classification Evaluation #

Classification Report #

Inter-Rater Agreement #

NER Evaluation #

Entity Detection #

Training of NER components #