medkit.text.segmentation.rush_sentence_tokenizer#
Classes#
Sentence segmentation annotator based on PyRuSH.  | 
Module Contents#
- class medkit.text.segmentation.rush_sentence_tokenizer.RushSentenceTokenizer(output_label: str = _DEFAULT_LABEL, path_to_rules: str | pathlib.Path | None = None, keep_newlines: bool = True, attrs_to_copy: list[str] | None = None, uid: str | None = None)#
 Bases:
medkit.core.text.SegmentationOperationSentence segmentation annotator based on PyRuSH.
- Parameters:
 - output_label: str, optional
 The output label of the created annotations.
- path_to_rules: str or Path, optional
 Path to csv or tsv file to provide to PyRuSH. If none provided, “rush_tokenizer_default_rules.tsv” will be used (corresponds to the “conf/rush_rules.tsv” in the PyRush repo)
- keep_newlines: bool, default=True
 With the default rules, newline chars are not used to split sentences, therefore a sentence maybe contain one or more newline chars. If keep_newlines is False, newlines will be replaced by spaces.
- attrs_to_copy: list of str, optional
 Labels of the attributes that should be copied from the input segment to the derived segment. For example, useful for propagating section name.
- uid: str, optional
 Identifier of the tokenizer
- _DEFAULT_LABEL = 'sentence'#
 
- init_args#
 
- output_label#
 
- path_to_rules#
 
- keep_newlines#
 
- attrs_to_copy#
 
- _rush#
 
- run(segments: list[medkit.core.text.Segment]) list[medkit.core.text.Segment]#
 Return sentences detected in segments.
- Parameters:
 - segments: list of Segment
 List of segments into which to look for sentences
- Returns:
 - list of Segment:
 Sentences segments found in segments
- _find_sentences_in_segment(segment: medkit.core.text.Segment) Iterator[medkit.core.text.Segment]#