medkit.text.preprocessing.eds_cleaner

medkit.text.preprocessing.eds_cleaner#

Classes#

EDSCleaner

EDS pre-processing annotation module.

Module Contents#

class medkit.text.preprocessing.eds_cleaner.EDSCleaner(output_label: str = _DEFAULT_LABEL, keep_endlines: bool = False, handle_parentheses_eds: bool = True, handle_points_eds: bool = True, uid: str | None = None)#

Bases: medkit.core.Operation

EDS pre-processing annotation module.

This module is a non-destructive module allowing to remove and clean selected points and newlines characters. It respects the span modification by creating a new text-bound annotation containing the span modification information from input text.

Parameters:

output_labelstr, optional: The output label of the created annotations.
keep_endlinesbool, default=False: If True, modify multiple endlines using .\n as a replacement. If False (default), modify multiple endlines using whitespaces (.\s) as a replacement.
handle_parentheses_edsbool, default=True: If True (default), modify the text near to parentheses or keywords according to predefined rules for french documents If False, the text near to parentheses or keywords is not modified
handle_points_edsbool, default=True: Modify points near to predefined keywords for french documents If True (default), modify the points near to keywords If False, the points near to keywords is not modified
uidstr, optional: Identifier of the pre-processing module

_DEFAULT_LABEL = 'clean_text'#

init_args#

output_label#

keep_endlines#

handle_parentheses_eds#

handle_points_eds#

run(segments: list[medkit.core.text.Segment]) → list[medkit.core.text.Segment]#

Run the module on a list of segments provided as input and returns a new list of segments.

Parameters:

segmentslist of Segment: List of segments to normalize

Returns:

list of Segment: List of cleaned segments.

_clean_segment_text(segment: medkit.core.text.Segment)#

Clean up a segment non-destructively, remove points between numbers and upper case letters.

Then remove multiple whitespaces or newline characters. Finally, modify parentheses or point after keywords if necessary.