Detecting text duplicates#

Medkit provides support for detecting duplicates (zones of identical text) across a set of documents through the DuplicateFinder operation, which itself relies on the duptextfinder library developed at the HEGP.

No optional dependencies are required to use DuplicateFinder but it may perform faster if the ncls package is installed:

pip install ncls

Using collections to group documents#

The DuplicateFinder takes as input a list of Collection of text documents, rather than simply a list of documents, as other operations. This is because when dealing with large document bases, it would be very expensive to look for duplicates in all pairs of documents. So instead, we group together documents more likely to have duplicates, for instance documents related to the same patient, and only look for duplicates across documents belonging to the same group.

For the purpose of this tutorial, we have created 2 folders, each folder containing 2 text files regarding the same patient. The contents of one of the documents of the first patient were copy-pasted into the other document:

from pathlib import Path

main_dir = Path("data/duplicate_detection")
file_1 = main_dir / "patient_1/a10320aa-2008_04_13.txt"
print(file_1.read_text())
file_2 = main_dir / "patient_1/f1d3e530-2008_04_14.txt"
print(file_2.read_text())

Let’s create a list of collections, with one collection per patient:

from medkit.core import Collection
from medkit.core.text import TextDocument

# iterate over each subdirectory containing patient files
collections = []
for patient_subdir in sorted(main_dir.glob("*")):
    # create one TextDocument per .txt file
    docs = TextDocument.from_dir(patient_subdir)
    # group them in a Collection
    collection = Collection(text_docs=docs)
    collections.append(collection)

Identifying duplicated zones#

Let’s now instantiate a duplicate finder and run in on our collections:

from medkit.text.preprocessing import DuplicateFinder

dup_finder = DuplicateFinder(output_label="duplicate")
dup_finder.run(collections)

for collection in collections:
    for doc in collection.text_docs:
        dup_segs = doc.anns.get(label="duplicate")
        for dup_seg in dup_segs:
            print(repr(dup_seg.text))
            attr = dup_seg.attrs.get(label="is_duplicate")[0]
            print(f"{attr.label}={attr.value}")
            print(f"source_doc_id={attr.source_doc_id}")
            print(f"source_spans={attr.source_spans}")

As you can see, one duplicated zone has been detected and a segment has been created to identify the zone. A DuplicationAttribute is attached to it, with information about the source document and spans from which the text was duplicated.

Using dates to differentiate sources and duplicates#

When the DuplicateFinder encounters 2 identical pieces of text in 2 different documents, it has to decide which one is the original and which one is the duplicate. Obviously, we want to consider the text for the oldest document as the source and the text for the newest as the duplicate.

The default behavior of DuplicateFinder is to assume that the text documents are sorted from oldest to newest in the collection. However, that is not necessarily the case. This is why DuplicateFinder also supports retrieving the date from the document metadata.

Let’s rebuild our collection of text documents, adding a "creation_date" entry to the metadata of each doc (that we extract from the filename for the purpose of the example):

collections = []
for patient_subdir in sorted(main_dir.glob("*")):
    docs = []
    for file in patient_subdir.glob("*.txt"):
        # example file name: 02e0b400-2012_01_29.txt
        # we extract the date from the 2nd part of the base name
        date = file.name.split("-")[1]
        # add the date to the document metadata under the "creation_date" key
        doc = TextDocument(text=file.read_text(), metadata={"creation_date": date})
        docs.append(doc)
    collection = Collection(text_docs=docs)
    collections.append(collection)

and let’s use that metadata when finding duplicates:

# tell DuplicateFinder to use the "creation_date" metadata to order documents
dup_finder = DuplicateFinder(output_label="duplicate", date_metadata_key="creation_date")
dup_finder.run(collections)

Note that the values of the date metadata should be sortable, in a way that makes sense. For instance using date strings with format “DD-MM-YYYY” rather than “YYYY-MM-DD” wouldn’t work.

Ignoring duplicated zones#

Most of the time, we will probably want to ignore duplicate zones, ie to work on segments identifying the non-duplicate zones of our documents. This can be achieved with the segments_to_output init parameter of DuplicateFinder. By default, it is set to "dup", which means that segments for duplicate zones will be added to documents. But it can be set instead to "nondup", in which case segments for non-duplicate zones will be added to documents[1].

Let’s see an example of how to run a minimalistic NER pipeline on the non-duplicate zones of our documents:

from medkit.core import DocPipeline, Pipeline, PipelineStep
from medkit.text.segmentation import SentenceTokenizer
from medkit.text.ner import RegexpMatcher, RegexpMatcherRule

# create segments for non-duplicate zones with label "nonduplicate"
dup_finder = DuplicateFinder(
    output_label="nonduplicate",
    segments_to_output="nondup",
    date_metadata_key="creation_date",
)

# create a minimalistic NER pipeline
sentence_tok = SentenceTokenizer(output_label="sentence")
matcher = RegexpMatcher(rules=[RegexpMatcherRule(regexp=r"\binsuffisance\s*rénale\b", label="problem")])
pipeline = Pipeline(
    steps=[
        PipelineStep(sentence_tok, input_keys=["raw_text"], output_keys=["sentences"]),
        PipelineStep(matcher, input_keys=["sentences"], output_keys=["entities"]),
    ],
    input_keys=["raw_text"],
    output_keys=["entities"],
)

# use "nonduplicate" segments as input to the NER pipeline
doc_pipeline = DocPipeline(
    pipeline=pipeline,
    labels_by_input_key={"raw_text": ["nonduplicate"]},
)

# run everything
dup_finder.run(collections)
for collection in collections:
    doc_pipeline.run(collection.text_docs)

Let’s now visualize the annotations of the 2 documents of the first patient:

from spacy import displacy
from medkit.text.spacy.displacy_utils import medkit_doc_to_displacy

doc_1 = collections[0].text_docs[0]
displacy_data = medkit_doc_to_displacy(doc_1)
displacy.render(displacy_data, manual=True, style="ent")
doc_2 = collections[0].text_docs[1]
displacy_data = medkit_doc_to_displacy(doc_2)
displacy.render(displacy_data, manual=True, style="ent")

As expected, the “insuffisance rénale” entity was only found in the original report but properly ignored in the more recent report that copy-pasted it.