Section Tokenizer#
This tutorial will show an example of how to apply section tokenizer medkit operation on a text document.
Loading a text document#
First, let’s load a text file using the TextDocument class:
# You can download the file available in source code
# !wget https://raw.githubusercontent.com/medkit-lib/medkit/main/docs/data/text/1.txt
from pathlib import Path
from medkit.core.text import TextDocument
doc = TextDocument.from_file(Path("../../data/text/1.txt"))
The full raw text can be accessed through the text attribute:
print(doc.text)
Defining section definition rules#
To split the text document into segments corresponding to each section, we have to define a set of rules. These rules allow the operation to detect keywords triggering a new section.
from medkit.text.segmentation.section_tokenizer import SectionTokenizer
section_dict = {
"patient": ["SUBJECTIF"],
"traitement": ["MÉDICAMENTS", "PLAN"],
"allergies": ["ALLERGIES"],
"examen clinique": ["EXAMEN PHYSIQUE"],
"diagnostique": ["EVALUATION"],
}
tokenizer = SectionTokenizer(section_dict=section_dict)
The sections definition is a dictionary of key-values where key will be the section name and value a list of keywords to detect as the start of the section.
For example, if we detect the keyword EVALUATION in text,
a new section named diagnostique will begin with this keyword,
and will end with the next detected section, or the end of the text otherwise.
As all operations, SectionTokenizer defines a run() method.
This method returns a list of Segment objects
(a Segment is a TextAnnotation that represents a portion of a document’s full raw text).
As input, it also expects a list of Segment objects.
Here, we can pass a special segment containing the whole raw text of the document,
that we can retrieve through the raw_segment attribute of TextDocument:
sections = tokenizer.run([doc.raw_segment])
print(f"Number of detected sections: {len(sections)}\n")
for section in sections:
print(f"metadata = {section.metadata}")
print(f"label = {section.label}")
print(f"spans = {section.spans}")
print(f"text = {section.text!r}\n")
As you can see, we have detected 6 different sections.
Each section is a segment which features:
an
uidattribute, which unique value is automatically generated;a
textattribute holding the text that the segment refers to;a
spansattribute reflecting the position of this text in the document’s full raw text. Here we only have one span for each segment, but multiple discontinuous spans are supported;and a
label, always equal to"SECTION"in our case but it could be different for other kinds of segments or if you initialize the operation with your own output label.a
metadataattribute, which contains a dictionary with section name value.
Defining section rules with renaming#
SectionTokenizer also allows to define rules (i.e., SectionModificationRule)
for renaming detected sections based on the context of the section in the text.
Let’s take the same example.
from medkit.text.segmentation.section_tokenizer import SectionTokenizer, SectionModificationRule
section_dict = {
"patient": ["SUBJECTIF"],
"traitement": ["MÉDICAMENTS", "PLAN"],
"allergies": ["ALLERGIES"],
"examen clinique": ["EXAMEN PHYSIQUE"],
"diagnostique": ["EVALUATION"],
}
Now, let’s add some rules for managing these cases:
if
traitementsection is detected beforediagnostiquesection, then we rename it intotraitement_entreeif
traitementsection is detected afterdiagnostiquesection, then we rename it intotraitement_sortie
treatment_rules = [
SectionModificationRule(
section_name="traitement",
new_section_name="traitement_entree",
other_sections=["diagnostique"],
order="BEFORE"),
SectionModificationRule(
section_name="traitement",
new_section_name="traitement_sortie",
other_sections=["diagnostique"],
order="AFTER")
]
tokenizer = SectionTokenizer(section_dict=section_dict, section_rules=treatment_rules)
Let’s run this new operation on document raw text.
sections = tokenizer.run([doc.raw_segment])
print(f"Number of detected sections: {len(sections)}\n")
for section in sections:
print(f"metadata = {section.metadata}")
print(f"label = {section.label}")
print(f"spans = {section.spans}")
print(f"text = {section.text!r}\n")
There are still 6 sections detected, but 2 have been renamed to traitement_entree and traitement_sortie.
Using a YAML definition file#
We have seen how to write rules programmatically.
However, it is also possible to load a YAML file containing all your rules.
First, let’s create the YAML file corresponding to the previous steps.
import pathlib
filepath = pathlib.Path("section.yml")
SectionTokenizer.save_section_definition(
section_dict=section_dict,
section_rules=treatment_rules,
filepath=filepath,
encoding='utf-8')
with open(filepath, 'r') as f:
print(f.read())
Now, we will see how to initialize the SectionTokenizer operation for using this YAML file.
# Use tokenizer initialized using a yaml file
from medkit.text.segmentation.section_tokenizer import SectionTokenizer
section_dict, section_rules = SectionTokenizer.load_section_definition(filepath)
print(f"section_dict = {section_dict!r}\n")
print(f"section_rules = {section_rules!r}")
tokenizer = SectionTokenizer(section_dict=section_dict, section_rules=section_rules)
Now, let’s run the operation. We can observe that the results are the same.
sections = tokenizer.run([doc.raw_segment])
print(f"Number of detected sections: {len(sections)}\n")
for section in sections:
print(f"metadata = {section.metadata}")
print(f"label = {section.label}")
print(f"spans = {section.spans}")
print(f"text = {section.text!r}\n")
filepath.unlink()