Cleaning text with a predefined operation

Cleaning text with a predefined operation#

Medkit allows us to transform and clean up text without destroying the original text. We could, for example, implement a set of clean-up steps within the run method of an operation to pre-process raw text.

In this example, we will use a predefined EDSCleaner operation to show how a cleaning process works in medkit. This operation is inspired by french documents with formatting problems given previous conversion processes.

Loading a text to clean#

Consider the following document:

# You can download the file available in source code
# !wget https://raw.githubusercontent.com/medkit-lib/medkit/main/docs/examples/input/text/text_to_clean.txt

from pathlib import Path
from medkit.core.text import TextDocument

doc = TextDocument.from_file(Path("./input/text/text_to_clean.txt"))
print(doc.text)

As we note, the text has:

additional spaces;
multiple newlines characters;
long parentheses and numbers in English format.

This complicates text segmentation of the text, it may be a good idea to clean up the text before segmenting or creating annotations.

Using EDSCleaner operation#

As mentioned before, you can create your own custom cleanup operation. In this case, we use the predefined operation for a french document (coming from the EDS) to format the document.

The main idea is to transform the raw_segment and keep track of the modifications made by the operation. That segment is defined using the span of the text.

A span in medkit

In medkit the span of an annotation is a list of simple spans Span or modified spans ModifiedSpan. With this mechanism, we keep track of the modifications and can return to the original version whenever we want.

The EDSCleaner is configurable, we initialize keep_endlines=True to facilitate the visualization. Otherwise, the output segment would be a plain text with no newlines (\n) characters.

from medkit.text.preprocessing import EDSCleaner

eds_cleaner = EDSCleaner(keep_endlines=True)
raw_segment = doc.raw_segment
clean_segment = eds_cleaner.run([raw_segment])[0]
print(clean_segment.text)

The class works on Segments. In the run method it performs several operations to delete or change characters of interest. By default, it performs these operations:

Changes points between uppercase letters to spaces
Changes points between numbers to commas
Deletes multiple newline characters.
Deletes multiple whitespaces.

Note

There are two special operations that process parentheses and dots near French keywords such as Dr., Mme. and others. To enable/disable these operations you can use handle_parentheses_eds and handle_points_eds.

Extract text from the clean text#

Now that we have a clean segment, we can run an operation on the new segment. We can detect the sentences, for example.

from medkit.text.segmentation import SentenceTokenizer

sentences = SentenceTokenizer().run([clean_segment])
for sent in sentences:
  print(f"{sent.text!r}")

A created sentence in detail

The span of each generated sentence contains the modifications made by eds_cleaner object. Let’s look at the second sentence:

sentence = sentences[1]
print(f"text={sentence.text!r}")
print("spans=\n","\n".join(f"{sp}" for sp in sentence.spans))

The sentence starts with the character M (index 56), followed by a point . which has been replaced by a space (index 57). Then, the whole text up to the newline character has not been modified, so it corresponds to the original span (index 58 to 110). Each modification is stored by ModifiedSpan objects, until the end of the sentence, the character index 177.

Displaying in the original text#

Since the sentence contains the information from the original spans, it will always be possible to go back and display the information in the raw text.

To get the original spans, we can use normalize_spans(). Next, we can extract the raw text using extract().

from medkit.core.text.span_utils import normalize_spans, extract

spans_sentence = normalize_spans(sentence.spans)
ranges = [(s.start, s.end) for s in spans_sentence]
extracted_text, spans = extract(raw_segment.text,raw_segment.spans,ranges)
print(f"- Sentence in the ORIGINAL version:\n \"{extracted_text}\"")

That’s how an operation transforms text and extracts information without losing the raw text.