Cleaning text with a predefined operation#
Medkit allows us to transform and clean up text without destroying the original text. We could, for example, implement a set of clean-up steps within the run
method of an operation to pre-process raw text.
In this example, we will use a predefined EDSCleaner
operation to show how a cleaning process works in medkit. This operation is inspired by french documents with formatting problems given previous conversion processes.
Loading a text to clean#
Consider the following document:
# You can download the file available in source code
# !wget https://raw.githubusercontent.com/medkit-lib/medkit/main/docs/examples/input/text/text_to_clean.txt
from pathlib import Path
from medkit.core.text import TextDocument
doc = TextDocument.from_file(Path("./input/text/text_to_clean.txt"))
print(doc.text)
As we note, the text has:
additional spaces;
multiple newlines characters;
long parentheses and numbers in English format.
This complicates text segmentation of the text, it may be a good idea to clean up the text before segmenting or creating annotations.
Using EDSCleaner operation#
As mentioned before, you can create your own custom cleanup operation. In this case, we use the predefined operation for a french document (coming from the EDS) to format the document.
The main idea is to transform the raw_segment
and keep track of the modifications made by the operation. That segment is defined using the span of the text.
A span in medkit
In medkit the span of an annotation is a list of simple spans
Span
or modified spansModifiedSpan
. With this mechanism, we keep track of the modifications and can return to the original version whenever we want.
The EDSCleaner
is configurable, we initialize keep_endlines=True
to facilitate the visualization. Otherwise, the output segment would be a plain text with no newlines (\n)
characters.
from medkit.text.preprocessing import EDSCleaner
eds_cleaner = EDSCleaner(keep_endlines=True)
raw_segment = doc.raw_segment
clean_segment = eds_cleaner.run([raw_segment])[0]
print(clean_segment.text)
The class works on Segments
. In the run
method it performs several operations to delete or change characters of interest. By default, it performs these operations:
Changes points between uppercase letters to spaces
Changes points between numbers to commas
Deletes multiple newline characters.
Deletes multiple whitespaces.
Note
There are two special operations that process parentheses and dots near French keywords such as Dr., Mme. and others. To enable/disable these operations you can use handle_parentheses_eds
and handle_points_eds
.
Extract text from the clean text#
Now that we have a clean segment, we can run an operation on the new segment. We can detect the sentences, for example.
from medkit.text.segmentation import SentenceTokenizer
sentences = SentenceTokenizer().run([clean_segment])
for sent in sentences:
print(f"{sent.text!r}")
A created sentence in detail
The span of each generated sentence contains the modifications made by eds_cleaner object. Letโs look at the second sentence:
sentence = sentences[1]
print(f"text={sentence.text!r}")
print("spans=\n","\n".join(f"{sp}" for sp in sentence.spans))
The sentence starts with the character M
(index 56), followed by a point .
which has been replaced by a space (index 57). Then, the whole text up to the newline character has not been modified, so it corresponds to the original span (index 58 to 110). Each modification is stored by ModifiedSpan
objects, until the end of the sentence, the character index 177.
Displaying in the original text#
Since the sentence contains the information from the original spans, it will always be possible to go back and display the information in the raw text.
To get the original spans, we can use normalize_spans()
. Next, we can extract the raw text using extract()
.
from medkit.core.text.span_utils import normalize_spans, extract
spans_sentence = normalize_spans(sentence.spans)
ranges = [(s.start, s.end) for s in spans_sentence]
extracted_text, spans = extract(raw_segment.text,raw_segment.spans,ranges)
print(f"- Sentence in the ORIGINAL version:\n \"{extracted_text}\"")
Thatโs how an operation transforms text and extracts information without losing the raw text.