medkit.text.preprocessing.eds_cleaner#
Classes#
EDS pre-processing annotation module. |
Module Contents#
- class medkit.text.preprocessing.eds_cleaner.EDSCleaner(output_label: str = _DEFAULT_LABEL, keep_endlines: bool = False, handle_parentheses_eds: bool = True, handle_points_eds: bool = True, uid: str | None = None)#
Bases:
medkit.core.Operation
EDS pre-processing annotation module.
This module is a non-destructive module allowing to remove and clean selected points and newlines characters. It respects the span modification by creating a new text-bound annotation containing the span modification information from input text.
- Parameters:
- output_labelstr, optional
The output label of the created annotations.
- keep_endlinesbool, default=False
If True, modify multiple endlines using .\n as a replacement. If False (default), modify multiple endlines using whitespaces (.\s) as a replacement.
- handle_parentheses_edsbool, default=True
If True (default), modify the text near to parentheses or keywords according to predefined rules for french documents If False, the text near to parentheses or keywords is not modified
- handle_points_edsbool, default=True
Modify points near to predefined keywords for french documents If True (default), modify the points near to keywords If False, the points near to keywords is not modified
- uidstr, optional
Identifier of the pre-processing module
- _DEFAULT_LABEL = 'clean_text'#
- init_args#
- output_label#
- keep_endlines#
- handle_parentheses_eds#
- handle_points_eds#
- run(segments: list[medkit.core.text.Segment]) list[medkit.core.text.Segment] #
Run the module on a list of segments provided as input and returns a new list of segments.
- Parameters:
- segmentslist of Segment
List of segments to normalize
- Returns:
- list of Segment
List of cleaned segments.
- _clean_segment_text(segment: medkit.core.text.Segment)#
Clean up a segment non-destructively, remove points between numbers and upper case letters.
Then remove multiple whitespaces or newline characters. Finally, modify parentheses or point after keywords if necessary.