medkit.text.segmentation.rush_sentence_tokenizer#

Classes#

RushSentenceTokenizer

Sentence segmentation annotator based on PyRuSH.

Module Contents#

class medkit.text.segmentation.rush_sentence_tokenizer.RushSentenceTokenizer(output_label: str = _DEFAULT_LABEL, path_to_rules: str | pathlib.Path | None = None, keep_newlines: bool = True, attrs_to_copy: list[str] | None = None, uid: str | None = None)#

Bases: medkit.core.text.SegmentationOperation

Sentence segmentation annotator based on PyRuSH.

Parameters:
output_label: str, optional

The output label of the created annotations.

path_to_rules: str or Path, optional

Path to csv or tsv file to provide to PyRuSH. If none provided, “rush_tokenizer_default_rules.tsv” will be used (corresponds to the “conf/rush_rules.tsv” in the PyRush repo)

keep_newlines: bool, default=True

With the default rules, newline chars are not used to split sentences, therefore a sentence maybe contain one or more newline chars. If keep_newlines is False, newlines will be replaced by spaces.

attrs_to_copy: list of str, optional

Labels of the attributes that should be copied from the input segment to the derived segment. For example, useful for propagating section name.

uid: str, optional

Identifier of the tokenizer

_DEFAULT_LABEL = 'sentence'#
init_args#
output_label#
path_to_rules#
keep_newlines#
attrs_to_copy#
_rush#
run(segments: list[medkit.core.text.Segment]) list[medkit.core.text.Segment]#

Return sentences detected in segments.

Parameters:
segments: list of Segment

List of segments into which to look for sentences

Returns:
list of Segment:

Sentences segments found in segments

_find_sentences_in_segment(segment: medkit.core.text.Segment) Iterator[medkit.core.text.Segment]#