medkit.text.preprocessing.duplicate_finder#

Classes#

DuplicationAttribute

Attribute indicating if some text is a duplicate of some other text in another document.

DuplicateFinder

Detect duplicated chunks of text across a collection of text documents.

Module Contents#

class medkit.text.preprocessing.duplicate_finder.DuplicationAttribute(value: bool, source_doc_id: str | None = None, source_spans: list[medkit.core.text.AnySpan] | None = None, source_doc_date: Any | None = None, metadata: dict[str, Any] | None = None, uid: str | None = None)#

Bases: medkit.core.Attribute

Attribute indicating if some text is a duplicate of some other text in another document.

Attributes:
uidstr

Identifier of the attribute

labelstr

The attribute label, always set to DuplicationAttribute.LABEL

valueAny, optional

True if the segment or entity to which the attribute belongs is a duplicate of the part of another document, False otherwise.

source_doc_idstr, optional

Identifier of the document from which the text was copied

source_spanslist of AnySpan, optional

Spans of the duplicated text in the source document

source_doc_dateAny, optional

Date of the source document, if known

source_doc_id: str | None#
source_spans: list[medkit.core.text.AnySpan] | None#
source_doc_date: Any | None#
LABEL: ClassVar[str] = 'is_duplicate'#

Label used for all TNM attributes

to_dict() dict[str, Any]#
classmethod from_dict(attr_dict: dict[str, Any]) typing_extensions.Self#

Create an Attribute from a dict.

Parameters:
attribute_dict: dict of str to Any

A dictionary from a serialized Attribute as generated by to_dict()

class medkit.text.preprocessing.duplicate_finder.DuplicateFinder(output_label: str, segments_to_output: typing_extensions.Literal[dup, nondup, both] = 'dup', min_duplicate_length: int = 5, fingerprint_type: typing_extensions.Literal[char, word] = 'word', fingerprint_length: int = 2, date_metadata_key: str | None = None, case_sensitive: bool = True, allow_multiline: bool = True, orf: int = 1)#

Bases: medkit.core.Operation

Detect duplicated chunks of text across a collection of text documents.

When a duplicated chunk of text is found, a segment is created on the newest document covering the span that is duplicated. A DuplicationAttribute having “is_duplicate” as label and True as value is attached to the segment. It can later be propagated to the entities created from the duplicate segments.

The attribute also holds the id of the source document from which the text was copied, the spans of the text in the source document, and optionally the date of the source document if provided.

Optionally, segments can also be created for non-duplicate zones to make it easier to process only those parts of the documents. For these segments, the attribute value is False and the source, spans and date fields are None.

NB: better performance may be achieved by installing the ncls python package, which will then be used by duptextfinder library.

Parameters:
output_labelstr

Label of created segments

segments_to_outputstr, default=”dup”

Type of segments to create: only duplicate segments (“dup”), only non-duplicate segments (“nondup”), or both (“both”)

min_duplicate_lengthint, default=5

Minimum length of duplicated segments, in characters (shorter segments will be discarded)

fingerprint_typestr, default=”word”

Base unit to use for fingerprinting (either “char” or “word”)

fingerprint_lengthint, default=2

Number of chars or words in each fingerprint. If fingerprint_type is set to “char”, this should be the same value as min_duplicate_length. If fingerprint_type is set to “word”, this should be around the average word size multiplied by min_duplicate_length

date_metadata_keystr, optional

Key to use to retrieve the date of each document from their metadata dicts. When provided, this is used to determine which document should be the source of a duplicate (the older) and which document should be the recipient (the newer). If None, the order of the documents in the collection will be used.

case_sensitivebool, default=True

Whether duplication detection should be case-sensitive or not

allow_multilinebool, default=True

Whether detected duplicates can span across multiline lines, or each line should be handled separately

orfint, default=1

Step size when building fingerprints, cf the duptextfinder documentation

_NON_EMPTY_REGEXP#
init_args#
output_label#
_output_duplicate#
_output_nondup#
date_metadata_key#
fingerprint_type#
fingerprint_length#
min_duplicate_length#
orf#
case_sensitive#
allow_multiline#
run(collections: list[medkit.core.Collection])#

Find duplicates in each collection of documents.

For each duplicate found, a Segment object with a DuplicationAttribute will be created and attached to the document that is the recipient of the duplication (ie not the source document).

_find_duplicate_in_docs(docs: list[medkit.core.text.TextDocument])#

Find duplicates among a set of documents.

_find_duplicates_in_doc(doc: medkit.core.text.TextDocument, duplicate_finder: duptextfinder.DuplicateFinder, docs_by_id: dict[str, medkit.core.text.TextDocument])#

Find duplicates between a document and previously processed documents.

Parameters:
docTextDocument

Document in which to look for duplicates

duplicate_finderDuplicateFinder

Duplicate finder to use, that has already processed previous documents if any

docs_by_iddict of str to TextDocument

Previously processed documents, by id

_create_nondup_segment(target_segment, range_)#

Create a segment representing a non-duplicated zone.

_create_duplicate_segment(target_segment, target_range, source_doc, source_range)#

Create a segment representing a duplicated zone.