medkit.text.preprocessing#
Submodules#
Attributes#
Classes#
Generic character replacer to be used as pre-processing module. |
|
Detect duplicated chunks of text across a collection of text documents. |
|
Attribute indicating if some text is a duplicate of some other text in another document. |
|
EDS pre-processing annotation module. |
|
Generic pattern replacer to be used as pre-processing module. |
Package Contents#
- class medkit.text.preprocessing.CharReplacer(output_label: str, rules: list[tuple[str, str]] | None = None, name: str | None = None, uid: str | None = None)#
Bases:
medkit.core.operation.Operation
Generic character replacer to be used as pre-processing module.
This module is a non-destructive module allowing to replace selected 1-char string with the wanted n-chars strings. It respects the span modification by creating a new text-bound annotation containing the span modification information from input text.
- Parameters:
- output_labelstr
The output label of the created annotations
- ruleslist of tuple, optional
The list of replacement rules. Default: ALL_CHAR_RULES
- namestr, optional
Name describing the pre-processing module (defaults to the class name)
- uidstr, optional
Identifier of the pre-processing module
- init_args#
- output_label#
- rules#
- run(segments: list[medkit.core.text.Segment]) list[medkit.core.text.Segment] #
Run the module on a list of segments provided as input and returns a new list of segments.
- Parameters:
- segmentslist of Segment
List of segments to process
- Returns:
- list of Segment
List of new segments
- _process_segment_text(segment: medkit.core.text.Segment)#
- medkit.text.preprocessing.ALL_CHAR_RULES#
- medkit.text.preprocessing.DOT_RULES = [('…', '...'), ('⋯', '...')]#
- medkit.text.preprocessing.FRACTION_RULES = [('¼', '1/4'), ('½', '1/2'), ('¾', '3/4'), ('⅐', '1/7'), ('⅑', '1/9'), ('⅒', '1/10'), ('⅓',...#
- medkit.text.preprocessing.LIGATURE_RULES = [('Æ', 'AE'), ('æ', 'ae'), ('Œ', 'OE'), ('œ', 'oe')]#
- medkit.text.preprocessing.QUOTATION_RULES = [('»', '"'), ('«', '"'), ('“', '"'), ('”', '"'), ('„', '"'), ('‟', '"'), ('‹', '"'), ('›', '"'),...#
- medkit.text.preprocessing.SIGN_RULES = [('©', ''), ('®', ''), ('™', '')]#
- medkit.text.preprocessing.SPACE_RULES = [('\xa0', ' '), ('\u1680', ' '), ('\u2002', ' '), ('\u2003', ' '), ('\u2004', ' '), ('\u2005', '...#
- class medkit.text.preprocessing.DuplicateFinder(output_label: str, segments_to_output: typing_extensions.Literal[dup, nondup, both] = 'dup', min_duplicate_length: int = 5, fingerprint_type: typing_extensions.Literal[char, word] = 'word', fingerprint_length: int = 2, date_metadata_key: str | None = None, case_sensitive: bool = True, allow_multiline: bool = True, orf: int = 1)#
Bases:
medkit.core.Operation
Detect duplicated chunks of text across a collection of text documents.
When a duplicated chunk of text is found, a segment is created on the newest document covering the span that is duplicated. A
DuplicationAttribute
having “is_duplicate” as label and True as value is attached to the segment. It can later be propagated to the entities created from the duplicate segments.The attribute also holds the id of the source document from which the text was copied, the spans of the text in the source document, and optionally the date of the source document if provided.
Optionally, segments can also be created for non-duplicate zones to make it easier to process only those parts of the documents. For these segments, the attribute value is False and the source, spans and date fields are None.
NB: better performance may be achieved by installing the ncls python package, which will then be used by duptextfinder library.
- Parameters:
- output_labelstr
Label of created segments
- segments_to_outputstr, default=”dup”
Type of segments to create: only duplicate segments (“dup”), only non-duplicate segments (“nondup”), or both (“both”)
- min_duplicate_lengthint, default=5
Minimum length of duplicated segments, in characters (shorter segments will be discarded)
- fingerprint_typestr, default=”word”
Base unit to use for fingerprinting (either “char” or “word”)
- fingerprint_lengthint, default=2
Number of chars or words in each fingerprint. If fingerprint_type is set to “char”, this should be the same value as min_duplicate_length. If fingerprint_type is set to “word”, this should be around the average word size multiplied by min_duplicate_length
- date_metadata_keystr, optional
Key to use to retrieve the date of each document from their metadata dicts. When provided, this is used to determine which document should be the source of a duplicate (the older) and which document should be the recipient (the newer). If None, the order of the documents in the collection will be used.
- case_sensitivebool, default=True
Whether duplication detection should be case-sensitive or not
- allow_multilinebool, default=True
Whether detected duplicates can span across multiline lines, or each line should be handled separately
- orfint, default=1
Step size when building fingerprints, cf the duptextfinder documentation
- _NON_EMPTY_REGEXP#
- init_args#
- output_label#
- _output_duplicate#
- _output_nondup#
- date_metadata_key#
- fingerprint_type#
- fingerprint_length#
- min_duplicate_length#
- orf#
- case_sensitive#
- allow_multiline#
- run(collections: list[medkit.core.Collection])#
Find duplicates in each collection of documents.
For each duplicate found, a
Segment
object with aDuplicationAttribute
will be created and attached to the document that is the recipient of the duplication (ie not the source document).
- _find_duplicate_in_docs(docs: list[medkit.core.text.TextDocument])#
Find duplicates among a set of documents.
- _find_duplicates_in_doc(doc: medkit.core.text.TextDocument, duplicate_finder: duptextfinder.DuplicateFinder, docs_by_id: dict[str, medkit.core.text.TextDocument])#
Find duplicates between a document and previously processed documents.
- Parameters:
- docTextDocument
Document in which to look for duplicates
- duplicate_finderDuplicateFinder
Duplicate finder to use, that has already processed previous documents if any
- docs_by_iddict of str to TextDocument
Previously processed documents, by id
- _create_nondup_segment(target_segment, range_)#
Create a segment representing a non-duplicated zone.
- _create_duplicate_segment(target_segment, target_range, source_doc, source_range)#
Create a segment representing a duplicated zone.
- class medkit.text.preprocessing.DuplicationAttribute(value: bool, source_doc_id: str | None = None, source_spans: list[medkit.core.text.AnySpan] | None = None, source_doc_date: Any | None = None, metadata: dict[str, Any] | None = None, uid: str | None = None)#
Bases:
medkit.core.Attribute
Attribute indicating if some text is a duplicate of some other text in another document.
- Attributes:
- uidstr
Identifier of the attribute
- labelstr
The attribute label, always set to
DuplicationAttribute.LABEL
- valueAny, optional
True if the segment or entity to which the attribute belongs is a duplicate of the part of another document, False otherwise.
- source_doc_idstr, optional
Identifier of the document from which the text was copied
- source_spanslist of AnySpan, optional
Spans of the duplicated text in the source document
- source_doc_dateAny, optional
Date of the source document, if known
- source_doc_id: str | None#
- source_spans: list[medkit.core.text.AnySpan] | None#
- source_doc_date: Any | None#
- LABEL: ClassVar[str] = 'is_duplicate'#
Label used for all TNM attributes
- to_dict() dict[str, Any] #
- classmethod from_dict(attr_dict: dict[str, Any]) typing_extensions.Self #
Create an Attribute from a dict.
- Parameters:
- attribute_dict: dict of str to Any
A dictionary from a serialized Attribute as generated by to_dict()
- class medkit.text.preprocessing.EDSCleaner(output_label: str = _DEFAULT_LABEL, keep_endlines: bool = False, handle_parentheses_eds: bool = True, handle_points_eds: bool = True, uid: str | None = None)#
Bases:
medkit.core.Operation
EDS pre-processing annotation module.
This module is a non-destructive module allowing to remove and clean selected points and newlines characters. It respects the span modification by creating a new text-bound annotation containing the span modification information from input text.
- Parameters:
- output_labelstr, optional
The output label of the created annotations.
- keep_endlinesbool, default=False
If True, modify multiple endlines using .\n as a replacement. If False (default), modify multiple endlines using whitespaces (.\s) as a replacement.
- handle_parentheses_edsbool, default=True
If True (default), modify the text near to parentheses or keywords according to predefined rules for french documents If False, the text near to parentheses or keywords is not modified
- handle_points_edsbool, default=True
Modify points near to predefined keywords for french documents If True (default), modify the points near to keywords If False, the points near to keywords is not modified
- uidstr, optional
Identifier of the pre-processing module
- _DEFAULT_LABEL = 'clean_text'#
- init_args#
- output_label#
- keep_endlines#
- handle_parentheses_eds#
- handle_points_eds#
- run(segments: list[medkit.core.text.Segment]) list[medkit.core.text.Segment] #
Run the module on a list of segments provided as input and returns a new list of segments.
- Parameters:
- segmentslist of Segment
List of segments to normalize
- Returns:
- list of Segment
List of cleaned segments.
- _clean_segment_text(segment: medkit.core.text.Segment)#
Clean up a segment non-destructively, remove points between numbers and upper case letters.
Then remove multiple whitespaces or newline characters. Finally, modify parentheses or point after keywords if necessary.
- class medkit.text.preprocessing.RegexpReplacer(output_label: str, rules: list[tuple[str, str]] | None = None, name: str | None = None, uid: str | None = None)#
Bases:
medkit.core.operation.Operation
Generic pattern replacer to be used as pre-processing module.
This module is a non-destructive module allowing to replace a regex pattern by a new text. It respects the span modification by creating a new text-bound annotation containing the span modification information from input text.
- Parameters:
- output_labelstr
The output label of the created annotations
- ruleslist of tuple, optional
The list of replacement rules [(pattern_to_replace, new_text)]
- namestr, optional
Name describing the pre-processing module (defaults to the class name)
- uidstr, optional
Identifier of the pre-processing module
- init_args#
- output_label#
- rules#
- regex_rules#
- regex_rule#
- _pattern#
- run(segments: list[medkit.core.text.Segment]) list[medkit.core.text.Segment] #
Run the module on a list of segments provided as input and returns a new list of segments.
- Parameters:
- segmentslist of Segment
List of segments to normalize
- Returns:
- list of Segment
List of normalized segments
- _normalize_segment_text(segment: medkit.core.text.Segment)#