First steps#
This tutorial will show you how to use medkit
to annotate a text document,
by successively applying pre-processing, entity matching
and context detection operations.
Loading a text document#
For starters, let’s load a text file using the TextDocument
class:
# You can download the file available in source code
# !wget https://raw.githubusercontent.com/medkit-lib/medkit/main/docs/data/text/1.txt
from pathlib import Path
from medkit.core.text import TextDocument
doc = TextDocument.from_file(Path("../data/text/1.txt"))
The full raw text can be accessed through the text
attribute:
print(doc.text)
A TextDocument
can store TextAnnotation
objects.
For now, our document is free of annotations.
Splitting a document in sentences#
A common task in natural language processing is to split (or tokenize) text documents in sentences.
medkit
provides several segmentation operations,
including a rule-based SentenceTokenizer
class
that relies on a list of punctuation characters.
from medkit.text.segmentation import SentenceTokenizer
sent_tokenizer = SentenceTokenizer(
output_label="sentence",
punct_chars=[".", "?", "!"],
)
As all operations, SentenceTokenizer
defines a run()
method.
This method accepts a list of Segment
objects
(a Segment
is a TextAnnotation
that represents parts of a document’s raw text)
and returns a list of Segment
objects.
Here, we can pass a special Segment
containing the full text of the document,
which can be retrieved through the raw_segment
attribute of TextDocument
:
sentences = sent_tokenizer.run([doc.raw_segment])
for sentence in sentences:
print(f"uid={sentence.uid}")
print(f"text={sentence.text!r}")
print(f"spans={sentence.spans}, label={sentence.label}\n")
Each segment features:
an
uid
attribute, which unique value is automatically generated;a
text
attribute holding the text that the segment refers to;a
spans
attribute reflecting the position of this text in the document’s raw text. Here, there is only one span per segment, but multiple discontinuous spans are supported;a
label
attribute (set to “sentence” in our example), which could be different for other kinds of segments.
Preprocessing a document#
If you take a look at the 13th and 14th detected sentences, you will notice something strange:
print(repr(sentences[12].text))
print(repr(sentences[13].text))
This is actually one sentence that was split into two segments, because the sentence tokenizer incorrectly considers the dot in the decimal weight value to mark the end of a sentence. We could be a little smarter when configuring the tokenizer, but instead, for the sake of learning, let’s fix this with a pre-processing step that replaces dots by commas in decimal numbers.
For this, we can use the RegexpReplacer
class,
a regexp-based “search-and-replace” operation.
As other medkit
operations, it can be configured with a set of user-determined rules:
from medkit.text.preprocessing import RegexpReplacer
rule = (r"(?<=\d)\.(?=\d)", ",") # => (pattern to replace, new text)
regexp_replacer = RegexpReplacer(output_label="clean_text", rules=[rule])
The run()
method of the normalizer takes a list of Segment
objects
and returns a list of new Segment
objects, one for each input Segment
.
In our case we only want to preprocess the full raw text segment,
and we will only receive one preprocessed segment,
so we can call it with:
clean_segment = regexp_replacer.run([doc.raw_segment])[0]
print(clean_segment.text)
We may use again our previously-defined sentence tokenizer again, but this time on the preprocessed text:
sentences = sent_tokenizer.run([clean_segment])
print(sentences[12].text)
Problem fixed!
Finding entities#
The medkit
library also comes with operations to perform NER (named entity recognition),
for instance with RegexpMatcher
.
Let’s instantiate one with a few simple rules:
from medkit.text.ner import RegexpMatcher, RegexpMatcherRule
regexp_rules = [
RegexpMatcherRule(regexp=r"\ballergies?\b", label="problem"),
RegexpMatcherRule(regexp=r"\basthme\b", label="problem"),
RegexpMatcherRule(regexp=r"\ballegra?\b", label="treatment", case_sensitive=False),
RegexpMatcherRule(regexp=r"\bvaporisateurs?\b", label="treatment"),
RegexpMatcherRule(regexp=r"\bloratadine?\b", label="treatment", case_sensitive=False),
RegexpMatcherRule(regexp=r"\bnasonex?\b", label="treatment", case_sensitive=False),
]
regexp_matcher = RegexpMatcher(rules=regexp_rules)
As you can see, you can also define some rules that ignore case distinctions
by setting case-sensitive
parameter to False
.
In this example, we decide to make it for drugs (Allegra, Nasonex and Loratadine).
Note
When RegexpMatcher
is instantiated without any rules,
it will use a set of default rules that where initially created
to be used with documents in French from the APHP EDS.
These rules are stored in file regexp_matcher_default_rules.yml
located in the medkit.text.ner
module.
You may also define your own rules in a .yml
file.
You can then load them using the RegexpMatcher.load_rules()
static method
and pass them to the RegexpMatcher
constructor.
Since RegexpMatcher
is an NER operation,
its run()
method returns a list of Entity
objects
representing the entities that were matched (Entity
is a subclass of Segment
).
As input, it expects a list of Segment
objects.
Let’s give it the sentences returned by the sentence tokenizer:
entities = regexp_matcher.run(sentences)
for entity in entities:
print(f"uid={entity.uid}")
print(f"text={entity.text!r}, spans={entity.spans}, label={entity.label}\n")
Just like sentences, each entity features uid
, text
, spans
and label
attributes
(in this case, determined by the rule that was used to match it).
Detecting negation#
So far, we have detected several entities with "problem"
or "treatment"
labels in our document.
We might be tempted to use them directly
to build a list of problems that the patient faces and treatments that were given,
but if we look at how these entities are used in the document,
we will see that some of these entities actually denote the absence of a problem or treatment.
To solve this kind of situation, medkit
comes with context detectors,
such as NegationDetector
.
NegationDetector.run()
receives a list of Segment
objects.
It does not return anything, but it will append an Attribute
object
to each segment with a boolean value indicating whether a negation was detected or not
(Segment
and Entity
objects can have a list of Attribute
objects,
accessible through their AttributeContainer
).
Let’s instantiate a NegationDetector
with a couple of simplistic handcrafted rules
and run it on our sentences:
from medkit.text.context import NegationDetector, NegationDetectorRule
neg_rules = [
NegationDetectorRule(regexp=r"\bpas\s*d[' e]\b"),
NegationDetectorRule(regexp=r"\bsans\b", exclusion_regexps=[r"\bsans\s*doute\b"]),
NegationDetectorRule(regexp=r"\bne\s*semble\s*pas"),
]
neg_detector = NegationDetector(output_label="is_negated", rules=neg_rules)
neg_detector.run(sentences)
Note
Similarly to RegexpMatcher
, DetectionDetector
also comes with a set of default rules
designed for documents from the EDS,
which are stored in file negation_detector_default_rules.yml
located in the medkit.text.context
module.
And now, let’s look at which sentence have been detected as being negated:
for sentence in sentences:
neg_attr = sentence.attrs.get(label="is_negated")[0]
if neg_attr.value:
print(sentence.text)
Our simple negation detector does not work too bad, but sometimes some part of the sentence is tagged with a negation whilst the rest does not, resulting in the whole sentence getting flagged as being negated.
To mitigate this, each sentence can be split into finer-grained segments called syntagmas.
medkit
provides a SyntagmaTokenizer
for that purpose.
Let’s instantiate one, apply it to our sentences and run the negation detector again,
but this time on the syntagmas:
Note
SyntagmaTokenizer
also has default rules designed for documents from the EDS,
which are stored in file default_syntagma_definition.yml
located in the medkit.text.segmentation
module.
from medkit.text.segmentation import SyntagmaTokenizer
synt_tokenizer = SyntagmaTokenizer(
output_label="syntagma",
separators=[r"\bmais\b", r"\bet\b"],
)
syntagmas = synt_tokenizer.run(sentences)
neg_detector.run(syntagmas)
for syntagma in syntagmas:
neg_attr = syntagma.attrs.get(label="is_negated")[0]
if neg_attr.value:
print(syntagma.text)
We now have some information about negation attached to syntagmas, but the end goal is really to know, for each entity, whether it should be considered as negated or not. In more practical terms, we have got negation attributes attached to our syntagmas, but what we would prefer is to have negation attributes attached to entities.
In medkit
, the way to do this is to use the attrs_to_copy
parameter,
which is available for all NER operations.
This parameter tells the operation which attributes should be copied
from the input segments to the newly matched entities (based on their label).
In other words, it provides a way to propagate context attributes
(such as negation attributes) for segments to entities.
Let’s again use a RegexpMatcher
to find some entities,
but this time from syntagmas rather than from sentences,
and using attrs_to_copy
to copy negation attributes:
regexp_matcher = RegexpMatcher(rules=regexp_rules, attrs_to_copy=["is_negated"])
entities = regexp_matcher.run(syntagmas)
for entity in entities:
neg_attr = entity.attrs.get(label="is_negated")[0]
print(f"text='{entity.text}', label={entity.label}, is_negated={neg_attr.value}")
We now have a negation Attribute
for each entity!
Augmenting a document#
We now have an interesting set of annotations. We might want to process them directly, for instance to generate table-like data about patient treatment in order to compute some statistics. But we could also want to attach them back to our document in order to save them or export them to some format.
The annotations of a text document can be access with TextDocument.anns
,
an instance of TextAnnotationContainer
)
that behaves roughly like a list but also offers additional filtering methods.
Annotations can be added by calling its add()
method:
for entity in entities:
doc.anns.add(entity)
The document and its corresponding entities can be exported to supported formats
such as brat (see BratOutputConverter
)
or Doccano (see DoccanoOutputConverter
),
or serialized to JSON (see medkit_json
):
from medkit.io import medkit_json
medkit_json.save_text_document(doc, "doc_1.json")
Visualizing entities with displacy#
Rather than printing entities, we can visualize them with displacy
,
a visualization tool part of the spaCy NLP library.
medkit
provides helper functions to facilitate the use of displacy
in the displacy_utils
module:
from spacy import displacy
from medkit.text.spacy.displacy_utils import medkit_doc_to_displacy
displacy_data = medkit_doc_to_displacy(doc)
displacy.render(displacy_data, manual=True, style="ent")
Wrapping it up#
In this tutorial, we have:
created a
TextDocument
from an existing text file;instantiated several pre-processing, segmentation, context detection and entity matching operations;
run these operations sequentially over the document and obtained entities;
attached these entities back to the original document.
The operations used throughout this tutorial are rather basic ones, mostly rule-based,
but there are many more available in medkit
,
including model-based NER operations.
You can learn more about them in the API reference.
To dive further into medkit
, you might be interested in an overview
of the various entity matching methods available in medkit,
context detection,
or how to encapsulate all these operations in a pipeline.