medkit.text.preprocessing.eds_cleaner#

Classes#

EDSCleaner

EDS pre-processing annotation module.

Module Contents#

class medkit.text.preprocessing.eds_cleaner.EDSCleaner(output_label: str = _DEFAULT_LABEL, keep_endlines: bool = False, handle_parentheses_eds: bool = True, handle_points_eds: bool = True, uid: str | None = None)#

Bases: medkit.core.Operation

EDS pre-processing annotation module.

This module is a non-destructive module allowing to remove and clean selected points and newlines characters. It respects the span modification by creating a new text-bound annotation containing the span modification information from input text.

Parameters:
output_labelstr, optional

The output label of the created annotations.

keep_endlinesbool, default=False

If True, modify multiple endlines using .\n as a replacement. If False (default), modify multiple endlines using whitespaces (.\s) as a replacement.

handle_parentheses_edsbool, default=True

If True (default), modify the text near to parentheses or keywords according to predefined rules for french documents If False, the text near to parentheses or keywords is not modified

handle_points_edsbool, default=True

Modify points near to predefined keywords for french documents If True (default), modify the points near to keywords If False, the points near to keywords is not modified

uidstr, optional

Identifier of the pre-processing module

_DEFAULT_LABEL = 'clean_text'#
init_args#
output_label#
keep_endlines#
handle_parentheses_eds#
handle_points_eds#
run(segments: list[medkit.core.text.Segment]) list[medkit.core.text.Segment]#

Run the module on a list of segments provided as input and returns a new list of segments.

Parameters:
segmentslist of Segment

List of segments to normalize

Returns:
list of Segment

List of cleaned segments.

_clean_segment_text(segment: medkit.core.text.Segment)#

Clean up a segment non-destructively, remove points between numbers and upper case letters.

Then remove multiple whitespaces or newline characters. Finally, modify parentheses or point after keywords if necessary.