medkit.text.segmentation.rush_sentence_tokenizer#
Classes#
Sentence segmentation annotator based on PyRuSH. |
Module Contents#
- class medkit.text.segmentation.rush_sentence_tokenizer.RushSentenceTokenizer(output_label: str = _DEFAULT_LABEL, path_to_rules: str | pathlib.Path | None = None, keep_newlines: bool = True, attrs_to_copy: list[str] | None = None, uid: str | None = None)#
Bases:
medkit.core.text.SegmentationOperation
Sentence segmentation annotator based on PyRuSH.
- Parameters:
- output_label: str, optional
The output label of the created annotations.
- path_to_rules: str or Path, optional
Path to csv or tsv file to provide to PyRuSH. If none provided, “rush_tokenizer_default_rules.tsv” will be used (corresponds to the “conf/rush_rules.tsv” in the PyRush repo)
- keep_newlines: bool, default=True
With the default rules, newline chars are not used to split sentences, therefore a sentence maybe contain one or more newline chars. If keep_newlines is False, newlines will be replaced by spaces.
- attrs_to_copy: list of str, optional
Labels of the attributes that should be copied from the input segment to the derived segment. For example, useful for propagating section name.
- uid: str, optional
Identifier of the tokenizer
- _DEFAULT_LABEL = 'sentence'#
- init_args#
- output_label#
- path_to_rules#
- keep_newlines#
- attrs_to_copy#
- _rush#
- run(segments: list[medkit.core.text.Segment]) list[medkit.core.text.Segment] #
Return sentences detected in segments.
- Parameters:
- segments: list of Segment
List of segments into which to look for sentences
- Returns:
- list of Segment:
Sentences segments found in segments
- _find_sentences_in_segment(segment: medkit.core.text.Segment) Iterator[medkit.core.text.Segment] #