Syntagma Tokenizer#
This tutorial will show an example of how to apply syntagma tokenizer medkit
operation on a text document.
Loading a text document#
First, let’s load a text file using the TextDocument
class:
# You can download the file available in source code
# !wget https://raw.githubusercontent.com/medkit-lib/medkit/main/docs/data/text/1.txt
from pathlib import Path
from medkit.core.text import TextDocument
doc = TextDocument.from_file(Path("../../data/text/1.txt"))
The full raw text can be accessed through the text
attribute:
print(doc.text)
Defining syntagma definition rules#
To split the text document into segments corresponding to a text part, we have to define a set of rules. These rules allow the operation to split the text based on regular expressions rules.
from medkit.text.segmentation.syntagma_tokenizer import SyntagmaTokenizer
separators = (
"(?<=\. )[\w\d]+", # Trigger: starts after a dot and space
"(?<=\n)[\w\d]+", # Trigger: starts after a newline
"(?<=: )\w+", # Trigger: starts after :
"(?<= )mais\s+(?=\w)", # Trigger: starts with 'mais' if space before and after
"(?<= )sans\s+(?=\w)", # Trigger: starts with 'sans' if space before and after
"(?<= )donc\s+(?=\w)", # Trigger: starts with 'donc' if space before and after
)
tokenizer = SyntagmaTokenizer(separators)
The syntagmas definition is a list of regular expressions allowing to trigger the start of a new syntagma.
Like other operations, SyntagmaTokenizer
defines a run()
method.
This method returns a list of Segment
objects
(a Segment
is a TextAnnotation
that represents a portion of a document’s full raw text).
As input, it also expects a list of Segment
objects.
Here, we can pass a special segment containing the whole raw text of the document,
that we can retrieve through the raw_segment
attribute of TextDocument
:
syntagmas = tokenizer.run([doc.raw_segment])
print(f"Number of detected syntagmas: {len(syntagmas)}")
print(f"Syntagmas label: {syntagmas[0].label}\n")
for syntagma in syntagmas:
print(f"{syntagma.spans}\t{syntagma.text!r}")
As you can see, the text have been split into 39 segments, which default label is "SYNTAGMA"
.
The corresponding spans reflect the position of the text in the document’s raw text.
Using a YAML definition file#
We have seen how to write rules programmatically.
However, it is also possible to load a YAML file containing all your rules.
First, let’s create the YAML file based on previous steps.
import pathlib
filepath = pathlib.Path("syntagma.yml")
SyntagmaTokenizer.save_syntagma_definition(
syntagma_seps=separators,
filepath=filepath,
encoding='utf-8')
with open(filepath, 'r') as f:
print(f.read())
Now, we will see how to initialize the SyntagmaTokenizer
operation for using this yaml file.
# Use tokenizer initialized using a yaml file
from medkit.text.segmentation import SyntagmaTokenizer
separators = SyntagmaTokenizer.load_syntagma_definition(filepath)
print("separators = ")
for sep in separators:
print(f"- {sep!r}")
tokenizer = SyntagmaTokenizer(separators=separators)
Now let’s run the operation. We can observe that the results are the same.
syntagmas = tokenizer.run([doc.raw_segment])
print(f"Number of detected syntagmas: {len(syntagmas)}\n")
for syntagma in syntagmas:
print(f"{syntagma.spans}\t{syntagma.text!r}")
filepath.unlink()