medkit.text.segmentation.rush_sentence_tokenizer

medkit.text.segmentation.rush_sentence_tokenizer#

Classes#

RushSentenceTokenizer

Sentence segmentation annotator based on PyRuSH.

Module Contents#

class medkit.text.segmentation.rush_sentence_tokenizer.RushSentenceTokenizer(output_label: str = _DEFAULT_LABEL, path_to_rules: str | pathlib.Path | None = None, keep_newlines: bool = True, attrs_to_copy: list[str] | None = None, uid: str | None = None)#

Bases: medkit.core.text.SegmentationOperation

Sentence segmentation annotator based on PyRuSH.

Parameters:

output_label: str, optional: The output label of the created annotations.
path_to_rules: str or Path, optional: Path to csv or tsv file to provide to PyRuSH. If none provided, “rush_tokenizer_default_rules.tsv” will be used (corresponds to the “conf/rush_rules.tsv” in the PyRush repo)
keep_newlines: bool, default=True: With the default rules, newline chars are not used to split sentences, therefore a sentence maybe contain one or more newline chars. If keep_newlines is False, newlines will be replaced by spaces.
attrs_to_copy: list of str, optional: Labels of the attributes that should be copied from the input segment to the derived segment. For example, useful for propagating section name.
uid: str, optional: Identifier of the tokenizer

_DEFAULT_LABEL = 'sentence'#

init_args#

output_label#

path_to_rules#

keep_newlines#

attrs_to_copy#

_rush#

run(segments: list[medkit.core.text.Segment]) → list[medkit.core.text.Segment]#

Return sentences detected in segments.

Parameters:

segments: list of Segment: List of segments into which to look for sentences

Returns:

list of Segment:: Sentences segments found in segments

_find_sentences_in_segment(segment: medkit.core.text.Segment) → Iterator[medkit.core.text.Segment]#