Comparative evaluation of drug recognition methods#

This pipeline compares two tools (NER1 and NER2) for recognizing drug names in clinical texts, compares their performance, and outputs two texts annotated with the tool evaluated as the best performer.

Overview of the pipeline#

pipeline

Data preparation#

Download two clinical texts, with drug entites manually annotated.

import os
import tarfile
import tempfile
from pathlib import Path

# path to local data
extract_to = Path(tempfile.mkdtemp())
data_tarfile = Path.cwd() / "data.tar.gz"

# download and extract
tarfile.open(name=data_tarfile, mode="r|gz").extractall(extract_to)
data_dir = extract_to / "data"

print(f"Data dir: {data_dir}")
Data dir: /tmp/tmpxoxyf_1e/data

Read text documents with medkit

from medkit.core.text import TextDocument

doc_dir = data_dir / "mtsamplesen" / "annotated_doc"
docs = TextDocument.from_dir(path=doc_dir, pattern='[A-Z0-9].txt', encoding='utf-8')

print(docs[0].text)
DISCHARGE DIAGNOSES:
1. Gram-negative rod bacteremia, final identification and susceptibilities still pending.
2. History of congenital genitourinary abnormalities with multiple surgeries before the 5th grade.
3. History of urinary tract infections of pyelonephritis.

OPERATIONS PERFORMED: Chest x-ray July 24, 2007, that was normal. Transesophageal echocardiogram July 27, 2007, that was normal. No evidence of vegetations. CT scan of the abdomen and pelvis July 27, 2007, that revealed multiple small cysts in the liver, the largest measuring 9 mm. There were 2-3 additional tiny cysts in the right lobe. The remainder of the CT scan was normal.

HISTORY OF PRESENT ILLNESS: Briefly, the patient is a 26-year-old white female with a history of fevers. For further details of the admission, please see the previously dictated history and physical.

HOSPITAL COURSE: Gram-negative rod bacteremia. The patient was admitted to the hospital with suspicion of endocarditis given the fact that she had fever, septicemia, and Osler nodes on her fingers. The patient had a transthoracic echocardiogram as an outpatient, which was equivocal, but a transesophageal echocardiogram here in the hospital was normal with no evidence of vegetations. The microbiology laboratory stated that the Gram-negative rod appeared to be anaerobic, thus raising the possibility of organisms like bacteroides. The patient does have a history of congenital genitourinary abnormalities which were surgically corrected before the fifth grade. We did a CT scan of the abdomen and pelvis, which only showed some benign appearing cysts in the liver. There was nothing remarkable as far as her kidneys, ureters, or bladder were concerned. I spoke with Dr. Leclerc of infectious diseases, and Dr. Leclerc asked me to talk to the patient about any contact with animals, given the fact that we have had a recent outbreak of tularemia here in Utah. Much to my surprise, the patient told me that she had multiple pet rats at home, which she was constantly in contact with. I ordered tularemia and leptospirosis serologies on the advice of Dr. Leclerc, and as of the day after discharge, the results of the microbiology still are not back yet. The patient, however, appeared to be responding well to levofloxacin. I gave her a 2-week course of 750 mg a day of levofloxacin, and I have instructed her to follow up with Dr. Leclerc in the meantime. Hopefully by then we will have a final identification and susceptibility on the organism and the tularemia and leptospirosis serologies will return. A thought of ours was to add doxycycline, but again the patient clinically appeared to be responding to the levofloxacin. In addition, I told the patient that it would be my recommendation to get rid of the rats. I told her that if indeed the rats were carriers of infection and she received a zoonotic infection from exposure to the rats, that she could be in ongoing continuing danger and her children could also potentially be exposed to a potentially lethal infection. I told her very clearly that she should, indeed, get rid of the animals. The patient seemed reluctant to do so at first, but I believe with some coercion from her family, that she finally came to the realization that this was a recommendation worth following.

DISPOSITION

DISCHARGE INSTRUCTIONS: Activity is as tolerated. Diet is as tolerated.

MEDICATIONS: Levaquin 750 mg daily x14 days.

Followup is with Dr. Leclerc of infectious diseases. I gave the patient the phone number to call on Monday for an appointment. Additional followup is also with Dr. Leclerc, her primary care physician. Please note that 40 minutes was spent in the discharge.

Pipeline definition#

Create and run a three-step doc pipeline that:

  1. Split sentences in texts

  2. Run PII detection for deidentification

  3. Recognize drug entities with NER1: a dictionnary-based approach named UMLSMatcher

  4. Recognize drug entities with NER2: a Transformer-based approach, see https://huggingface.co/samrawal/bert-large-uncased_med-ner

Sentence tokenizer#

from medkit.text.segmentation import SentenceTokenizer

# By default, SentenceTokenizer will use a list of punctuation chars to detect sentences.
sentence_tokenizer = SentenceTokenizer(
    # Label of the segments created and returned by the operation
    output_label="sentence",
    # Keep the punctuation character inside the sentence segments
    keep_punct=True,
    # Also split on newline chars, not just punctuation characters
    split_on_newlines=True,
)

PII detector#

from medkit.text.deid import PIIDetector

pii_detector = PIIDetector(name="deid")
Collecting en-core-web-lg==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.7.1/en_core_web_lg-3.7.1-py3-none-any.whl (587.7 MB)
?25l     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/587.7 MB ? eta -:--:--
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.5/587.7 MB 14.8 MB/s eta 0:00:40
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.4/587.7 MB 20.1 MB/s eta 0:00:30
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.5/587.7 MB 24.8 MB/s eta 0:00:24
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4.1/587.7 MB 30.1 MB/s eta 0:00:20
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.2/587.7 MB 35.8 MB/s eta 0:00:17
     ╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.9/587.7 MB 42.9 MB/s eta 0:00:14
     ╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12.7/587.7 MB 72.1 MB/s eta 0:00:08
     ━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 17.6/587.7 MB 115.0 MB/s eta 0:00:05
     ━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 22.2/587.7 MB 134.3 MB/s eta 0:00:05
     ━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 27.4/587.7 MB 140.2 MB/s eta 0:00:04
     ━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 33.0/587.7 MB 162.8 MB/s eta 0:00:04
     ━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 38.0/587.7 MB 155.5 MB/s eta 0:00:04
     ━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 43.4/587.7 MB 152.8 MB/s eta 0:00:04
     ━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 48.6/587.7 MB 157.0 MB/s eta 0:00:04
     ━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 54.3/587.7 MB 160.4 MB/s eta 0:00:04
     ━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 59.6/587.7 MB 161.0 MB/s eta 0:00:04
     ━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 64.7/587.7 MB 153.2 MB/s eta 0:00:04
     ━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 70.1/587.7 MB 152.3 MB/s eta 0:00:04
     ━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 75.3/587.7 MB 156.2 MB/s eta 0:00:04
     ━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 80.7/587.7 MB 155.9 MB/s eta 0:00:04
     ━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 86.0/587.7 MB 156.4 MB/s eta 0:00:04
     ━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 91.4/587.7 MB 157.3 MB/s eta 0:00:04
     ━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 96.6/587.7 MB 155.6 MB/s eta 0:00:04
     ━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 102.3/587.7 MB 160.5 MB/s eta 0:00:04
     ━━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 108.0/587.7 MB 167.5 MB/s eta 0:00:03
     ━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 113.7/587.7 MB 165.6 MB/s eta 0:00:03
     ━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 118.7/587.7 MB 157.3 MB/s eta 0:00:03
     ━━━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 123.6/587.7 MB 141.8 MB/s eta 0:00:04
     ━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 128.8/587.7 MB 145.1 MB/s eta 0:00:04
     ━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 134.1/587.7 MB 154.1 MB/s eta 0:00:03
     ━━━━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 139.0/587.7 MB 145.0 MB/s eta 0:00:04
     ━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━ 143.6/587.7 MB 139.2 MB/s eta 0:00:04
     ━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━ 147.8/587.7 MB 125.6 MB/s eta 0:00:04
     ━━━━━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━ 152.6/587.7 MB 128.4 MB/s eta 0:00:04
     ━━━━━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━ 156.6/587.7 MB 125.6 MB/s eta 0:00:04
     ━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━ 161.3/587.7 MB 123.7 MB/s eta 0:00:04
     ━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━ 166.7/587.7 MB 141.5 MB/s eta 0:00:03
     ━━━━━━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━ 172.4/587.7 MB 162.4 MB/s eta 0:00:03
     ━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━ 176.9/587.7 MB 150.3 MB/s eta 0:00:03
     ━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━ 182.0/587.7 MB 139.0 MB/s eta 0:00:03
     ━━━━━━━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━ 187.6/587.7 MB 155.0 MB/s eta 0:00:03
     ━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━ 191.9/587.7 MB 143.5 MB/s eta 0:00:03
     ━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━ 197.6/587.7 MB 144.2 MB/s eta 0:00:03
     ━━━━━━━━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━ 203.1/587.7 MB 162.6 MB/s eta 0:00:03
     ━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━ 208.5/587.7 MB 157.8 MB/s eta 0:00:03
     ━━━━━━━━━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━ 214.7/587.7 MB 166.5 MB/s eta 0:00:03
     ━━━━━━━━━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━ 219.7/587.7 MB 162.3 MB/s eta 0:00:03
     ━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━ 224.7/587.7 MB 144.7 MB/s eta 0:00:03
     ━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━ 230.1/587.7 MB 152.0 MB/s eta 0:00:03
     ━━━━━━━━━━━━━━╸━━━━━━━━━━━━━━━━━━━━━━ 235.8/587.7 MB 167.8 MB/s eta 0:00:03
     ━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━ 241.0/587.7 MB 157.4 MB/s eta 0:00:03
     ━━━━━━━━━━━━━━━╸━━━━━━━━━━━━━━━━━━━━━ 247.0/587.7 MB 163.5 MB/s eta 0:00:03
     ━━━━━━━━━━━━━━━╸━━━━━━━━━━━━━━━━━━━━━ 252.2/587.7 MB 162.8 MB/s eta 0:00:03
     ━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━ 257.8/587.7 MB 157.1 MB/s eta 0:00:03
     ━━━━━━━━━━━━━━━━╸━━━━━━━━━━━━━━━━━━━━ 263.6/587.7 MB 165.0 MB/s eta 0:00:02
     ━━━━━━━━━━━━━━━━╸━━━━━━━━━━━━━━━━━━━━ 269.6/587.7 MB 173.6 MB/s eta 0:00:02
     ━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━ 275.4/587.7 MB 171.3 MB/s eta 0:00:02
     ━━━━━━━━━━━━━━━━━╸━━━━━━━━━━━━━━━━━━━ 281.1/587.7 MB 167.9 MB/s eta 0:00:02
     ━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━ 286.9/587.7 MB 167.6 MB/s eta 0:00:02
     ━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━ 292.6/587.7 MB 168.2 MB/s eta 0:00:02
     ━━━━━━━━━━━━━━━━━━╸━━━━━━━━━━━━━━━━━━ 298.5/587.7 MB 169.5 MB/s eta 0:00:02
     ━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━ 304.2/587.7 MB 167.8 MB/s eta 0:00:02
     ━━━━━━━━━━━━━━━━━━━╸━━━━━━━━━━━━━━━━━ 309.8/587.7 MB 165.0 MB/s eta 0:00:02
     ━━━━━━━━━━━━━━━━━━━╸━━━━━━━━━━━━━━━━━ 315.6/587.7 MB 166.1 MB/s eta 0:00:02
     ━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━ 321.5/587.7 MB 171.2 MB/s eta 0:00:02
     ━━━━━━━━━━━━━━━━━━━━╸━━━━━━━━━━━━━━━━ 327.5/587.7 MB 174.1 MB/s eta 0:00:02
     ━━━━━━━━━━━━━━━━━━━━╸━━━━━━━━━━━━━━━━ 333.2/587.7 MB 169.4 MB/s eta 0:00:02
     ━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━ 339.3/587.7 MB 172.9 MB/s eta 0:00:02
     ━━━━━━━━━━━━━━━━━━━━━╸━━━━━━━━━━━━━━━ 345.3/587.7 MB 177.6 MB/s eta 0:00:02
     ━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━ 351.0/587.7 MB 171.2 MB/s eta 0:00:02
     ━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━ 357.0/587.7 MB 173.3 MB/s eta 0:00:02
     ━━━━━━━━━━━━━━━━━━━━━━╸━━━━━━━━━━━━━━ 363.2/587.7 MB 180.2 MB/s eta 0:00:02
     ━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━ 369.1/587.7 MB 177.7 MB/s eta 0:00:02
     ━━━━━━━━━━━━━━━━━━━━━━━╸━━━━━━━━━━━━━ 375.3/587.7 MB 176.6 MB/s eta 0:00:02
     ━━━━━━━━━━━━━━━━━━━━━━━╸━━━━━━━━━━━━━ 381.1/587.7 MB 175.2 MB/s eta 0:00:02
     ━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━ 386.4/587.7 MB 161.2 MB/s eta 0:00:02
     ━━━━━━━━━━━━━━━━━━━━━━━━╸━━━━━━━━━━━━ 392.5/587.7 MB 169.5 MB/s eta 0:00:02
     ━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━ 398.8/587.7 MB 179.1 MB/s eta 0:00:02
     ━━━━━━━━━━━━━━━━━━━━━━━━━╸━━━━━━━━━━━ 405.1/587.7 MB 182.8 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━╸━━━━━━━━━━━ 411.0/587.7 MB 178.9 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━ 417.0/587.7 MB 173.1 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━╸━━━━━━━━━━ 422.8/587.7 MB 170.5 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━╸━━━━━━━━━━ 428.6/587.7 MB 169.9 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━ 434.8/587.7 MB 174.9 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━╸━━━━━━━━━ 441.1/587.7 MB 184.2 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━ 447.2/587.7 MB 181.1 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸━━━━━━━━ 453.1/587.7 MB 173.2 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸━━━━━━━━ 459.4/587.7 MB 176.6 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━ 465.6/587.7 MB 182.6 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸━━━━━━━ 471.2/587.7 MB 172.6 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━ 476.8/587.7 MB 163.1 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━ 482.3/587.7 MB 163.1 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸━━━━━━ 487.7/587.7 MB 159.4 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━ 493.3/587.7 MB 159.8 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━ 498.4/587.7 MB 157.0 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸━━━━━ 503.5/587.7 MB 146.2 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━ 509.2/587.7 MB 157.8 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━ 514.7/587.7 MB 163.3 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸━━━━ 520.4/587.7 MB 165.9 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━ 525.6/587.7 MB 159.6 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━ 530.5/587.7 MB 143.3 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸━━━ 534.4/587.7 MB 132.8 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸━━━ 538.7/587.7 MB 120.2 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━ 544.1/587.7 MB 137.5 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸━━ 549.3/587.7 MB 155.5 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸━━ 554.3/587.7 MB 146.5 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━ 559.2/587.7 MB 141.2 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━ 563.6/587.7 MB 138.8 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸━ 568.6/587.7 MB 136.4 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺ 574.0/587.7 MB 151.6 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺ 579.5/587.7 MB 159.0 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 584.0/587.7 MB 142.3 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 587.0/587.7 MB 117.7 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 587.7/587.7 MB 112.6 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 587.7/587.7 MB 112.6 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 587.7/587.7 MB 112.6 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 587.7/587.7 MB 112.6 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 587.7/587.7 MB 112.6 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 587.7/587.7 MB 112.6 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 587.7/587.7 MB 112.6 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 587.7/587.7 MB 112.6 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 587.7/587.7 MB 112.6 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 587.7/587.7 MB 112.6 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 587.7/587.7 MB 112.6 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 587.7/587.7 MB 112.6 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 587.7/587.7 MB 112.6 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 587.7/587.7 MB 112.6 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 587.7/587.7 MB 112.6 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 587.7/587.7 MB 112.6 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 587.7/587.7 MB 112.6 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 587.7/587.7 MB 112.6 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 587.7/587.7 MB 112.6 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 587.7/587.7 MB 112.6 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 587.7/587.7 MB 112.6 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 587.7/587.7 MB 112.6 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 587.7/587.7 MB 112.6 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 587.7/587.7 MB 112.6 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 587.7/587.7 MB 112.6 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 587.7/587.7 MB 112.6 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 587.7/587.7 MB 112.6 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 587.7/587.7 MB 112.6 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 587.7/587.7 MB 112.6 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 587.7/587.7 MB 112.6 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 587.7/587.7 MB 112.6 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 587.7/587.7 MB 112.6 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 587.7/587.7 MB 112.6 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 587.7/587.7 MB 112.6 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 587.7/587.7 MB 112.6 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 587.7/587.7 MB 112.6 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 587.7/587.7 MB 112.6 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 587.7/587.7 MB 112.6 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 587.7/587.7 MB 112.6 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 587.7/587.7 MB 112.6 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 587.7/587.7 MB 112.6 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 587.7/587.7 MB 112.6 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 587.7/587.7 MB 112.6 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 587.7/587.7 MB 112.6 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 587.7/587.7 MB 112.6 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 587.7/587.7 MB 112.6 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 587.7/587.7 MB 112.6 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 587.7/587.7 MB 112.6 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 587.7/587.7 MB 112.6 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 587.7/587.7 MB 112.6 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 587.7/587.7 MB 112.6 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 587.7/587.7 MB 112.6 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 587.7/587.7 MB 112.6 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 587.7/587.7 MB 112.6 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 587.7/587.7 MB 112.6 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 587.7/587.7 MB 112.6 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 587.7/587.7 MB 112.6 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 587.7/587.7 MB 112.6 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 587.7/587.7 MB 112.6 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 587.7/587.7 MB 112.6 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 587.7/587.7 MB 112.6 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 587.7/587.7 MB 112.6 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 587.7/587.7 MB 112.6 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 587.7/587.7 MB 112.6 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 587.7/587.7 MB 112.6 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 587.7/587.7 MB 112.6 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 587.7/587.7 MB 112.6 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 587.7/587.7 MB 112.6 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 587.7/587.7 MB 112.6 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 587.7/587.7 MB 4.0 MB/s eta 0:00:00
?25h
Requirement already satisfied: spacy<3.8.0,>=3.7.2 in /home/runner/.local/share/hatch/env/virtual/medkit-lib/KiEdgqfH/docs/lib/python3.11/site-packages (from en-core-web-lg==3.7.1) (3.7.6)
Requirement already satisfied: spacy-legacy<3.1.0,>=3.0.11 in /home/runner/.local/share/hatch/env/virtual/medkit-lib/KiEdgqfH/docs/lib/python3.11/site-packages (from spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (3.0.12)
Requirement already satisfied: spacy-loggers<2.0.0,>=1.0.0 in /home/runner/.local/share/hatch/env/virtual/medkit-lib/KiEdgqfH/docs/lib/python3.11/site-packages (from spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (1.0.5)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /home/runner/.local/share/hatch/env/virtual/medkit-lib/KiEdgqfH/docs/lib/python3.11/site-packages (from spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (1.0.10)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /home/runner/.local/share/hatch/env/virtual/medkit-lib/KiEdgqfH/docs/lib/python3.11/site-packages (from spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (2.0.8)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /home/runner/.local/share/hatch/env/virtual/medkit-lib/KiEdgqfH/docs/lib/python3.11/site-packages (from spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (3.0.9)
Requirement already satisfied: thinc<8.3.0,>=8.2.2 in /home/runner/.local/share/hatch/env/virtual/medkit-lib/KiEdgqfH/docs/lib/python3.11/site-packages (from spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (8.2.5)
Requirement already satisfied: wasabi<1.2.0,>=0.9.1 in /home/runner/.local/share/hatch/env/virtual/medkit-lib/KiEdgqfH/docs/lib/python3.11/site-packages (from spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (1.1.3)
Requirement already satisfied: srsly<3.0.0,>=2.4.3 in /home/runner/.local/share/hatch/env/virtual/medkit-lib/KiEdgqfH/docs/lib/python3.11/site-packages (from spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (2.4.8)
Requirement already satisfied: catalogue<2.1.0,>=2.0.6 in /home/runner/.local/share/hatch/env/virtual/medkit-lib/KiEdgqfH/docs/lib/python3.11/site-packages (from spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (2.0.10)
Requirement already satisfied: weasel<0.5.0,>=0.1.0 in /home/runner/.local/share/hatch/env/virtual/medkit-lib/KiEdgqfH/docs/lib/python3.11/site-packages (from spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (0.4.1)
Requirement already satisfied: typer<1.0.0,>=0.3.0 in /home/runner/.local/share/hatch/env/virtual/medkit-lib/KiEdgqfH/docs/lib/python3.11/site-packages (from spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (0.12.5)
Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /home/runner/.local/share/hatch/env/virtual/medkit-lib/KiEdgqfH/docs/lib/python3.11/site-packages (from spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (4.66.5)
Requirement already satisfied: requests<3.0.0,>=2.13.0 in /home/runner/.local/share/hatch/env/virtual/medkit-lib/KiEdgqfH/docs/lib/python3.11/site-packages (from spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (2.32.3)
Requirement already satisfied: pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4 in /home/runner/.local/share/hatch/env/virtual/medkit-lib/KiEdgqfH/docs/lib/python3.11/site-packages (from spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (2.9.1)
Requirement already satisfied: jinja2 in /home/runner/.local/share/hatch/env/virtual/medkit-lib/KiEdgqfH/docs/lib/python3.11/site-packages (from spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (3.1.4)
Requirement already satisfied: setuptools in /home/runner/.local/share/hatch/env/virtual/medkit-lib/KiEdgqfH/docs/lib/python3.11/site-packages (from spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (69.5.1)
Requirement already satisfied: packaging>=20.0 in /home/runner/.local/share/hatch/env/virtual/medkit-lib/KiEdgqfH/docs/lib/python3.11/site-packages (from spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (24.1)
Requirement already satisfied: langcodes<4.0.0,>=3.2.0 in /home/runner/.local/share/hatch/env/virtual/medkit-lib/KiEdgqfH/docs/lib/python3.11/site-packages (from spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (3.4.0)
Requirement already satisfied: numpy>=1.19.0 in /home/runner/.local/share/hatch/env/virtual/medkit-lib/KiEdgqfH/docs/lib/python3.11/site-packages (from spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (1.26.4)
Requirement already satisfied: language-data>=1.2 in /home/runner/.local/share/hatch/env/virtual/medkit-lib/KiEdgqfH/docs/lib/python3.11/site-packages (from langcodes<4.0.0,>=3.2.0->spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (1.2.0)
Requirement already satisfied: annotated-types>=0.6.0 in /home/runner/.local/share/hatch/env/virtual/medkit-lib/KiEdgqfH/docs/lib/python3.11/site-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (0.7.0)
Requirement already satisfied: pydantic-core==2.23.3 in /home/runner/.local/share/hatch/env/virtual/medkit-lib/KiEdgqfH/docs/lib/python3.11/site-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (2.23.3)
Requirement already satisfied: typing-extensions>=4.6.1 in /home/runner/.local/share/hatch/env/virtual/medkit-lib/KiEdgqfH/docs/lib/python3.11/site-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (4.12.2)
Requirement already satisfied: charset-normalizer<4,>=2 in /home/runner/.local/share/hatch/env/virtual/medkit-lib/KiEdgqfH/docs/lib/python3.11/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /home/runner/.local/share/hatch/env/virtual/medkit-lib/KiEdgqfH/docs/lib/python3.11/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (3.8)
Requirement already satisfied: urllib3<3,>=1.21.1 in /home/runner/.local/share/hatch/env/virtual/medkit-lib/KiEdgqfH/docs/lib/python3.11/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (2.2.2)
Requirement already satisfied: certifi>=2017.4.17 in /home/runner/.local/share/hatch/env/virtual/medkit-lib/KiEdgqfH/docs/lib/python3.11/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (2024.8.30)
Requirement already satisfied: blis<0.8.0,>=0.7.8 in /home/runner/.local/share/hatch/env/virtual/medkit-lib/KiEdgqfH/docs/lib/python3.11/site-packages (from thinc<8.3.0,>=8.2.2->spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (0.7.11)
Requirement already satisfied: confection<1.0.0,>=0.0.1 in /home/runner/.local/share/hatch/env/virtual/medkit-lib/KiEdgqfH/docs/lib/python3.11/site-packages (from thinc<8.3.0,>=8.2.2->spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (0.1.5)
Requirement already satisfied: click>=8.0.0 in /home/runner/.local/share/hatch/env/virtual/medkit-lib/KiEdgqfH/docs/lib/python3.11/site-packages (from typer<1.0.0,>=0.3.0->spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (8.1.7)
Requirement already satisfied: shellingham>=1.3.0 in /home/runner/.local/share/hatch/env/virtual/medkit-lib/KiEdgqfH/docs/lib/python3.11/site-packages (from typer<1.0.0,>=0.3.0->spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (1.5.4)
Requirement already satisfied: rich>=10.11.0 in /home/runner/.local/share/hatch/env/virtual/medkit-lib/KiEdgqfH/docs/lib/python3.11/site-packages (from typer<1.0.0,>=0.3.0->spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (13.8.0)
Requirement already satisfied: cloudpathlib<1.0.0,>=0.7.0 in /home/runner/.local/share/hatch/env/virtual/medkit-lib/KiEdgqfH/docs/lib/python3.11/site-packages (from weasel<0.5.0,>=0.1.0->spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (0.19.0)
Requirement already satisfied: smart-open<8.0.0,>=5.2.1 in /home/runner/.local/share/hatch/env/virtual/medkit-lib/KiEdgqfH/docs/lib/python3.11/site-packages (from weasel<0.5.0,>=0.1.0->spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (7.0.4)
Requirement already satisfied: MarkupSafe>=2.0 in /home/runner/.local/share/hatch/env/virtual/medkit-lib/KiEdgqfH/docs/lib/python3.11/site-packages (from jinja2->spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (2.1.5)
Requirement already satisfied: marisa-trie>=0.7.7 in /home/runner/.local/share/hatch/env/virtual/medkit-lib/KiEdgqfH/docs/lib/python3.11/site-packages (from language-data>=1.2->langcodes<4.0.0,>=3.2.0->spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (1.2.0)
Requirement already satisfied: markdown-it-py>=2.2.0 in /home/runner/.local/share/hatch/env/virtual/medkit-lib/KiEdgqfH/docs/lib/python3.11/site-packages (from rich>=10.11.0->typer<1.0.0,>=0.3.0->spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (3.0.0)
Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /home/runner/.local/share/hatch/env/virtual/medkit-lib/KiEdgqfH/docs/lib/python3.11/site-packages (from rich>=10.11.0->typer<1.0.0,>=0.3.0->spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (2.18.0)
Requirement already satisfied: wrapt in /home/runner/.local/share/hatch/env/virtual/medkit-lib/KiEdgqfH/docs/lib/python3.11/site-packages (from smart-open<8.0.0,>=5.2.1->weasel<0.5.0,>=0.1.0->spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (1.16.0)
Requirement already satisfied: mdurl~=0.1 in /home/runner/.local/share/hatch/env/virtual/medkit-lib/KiEdgqfH/docs/lib/python3.11/site-packages (from markdown-it-py>=2.2.0->rich>=10.11.0->typer<1.0.0,>=0.3.0->spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (0.1.2)
Installing collected packages: en-core-web-lg
Successfully installed en-core-web-lg-3.7.1
[notice] A new release of pip is available: 24.0 -> 24.2
[notice] To update, run: pip install --upgrade pip
βœ” Download and installation successful
You can now load the package via spacy.load('en_core_web_lg')
⚠ Restart to reload dependencies
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.

Dictionnary-based drug recognizer#

import shutil
from medkit.text.ner import UMLSMatcher

umls_data_dir = data_dir / "UMLS" / "2023AB" / "META"
umls_cache_dir = Path.cwd() / ".umls_cache"
shutil.rmtree(umls_cache_dir, ignore_errors=True)

umls_matcher = UMLSMatcher(
    # Directory containing the UMLS files with terms and concepts
    umls_dir=umls_data_dir,
    # Language to use (English)
    language="ENG",
    # Where to store the temp term database of the matcher
    cache_dir=umls_cache_dir,
    # Semantic groups to consider
    semgroups=["CHEM"],
    # Don't be case-sensitive
    lowercase=True,
    # Convert special chars to ASCII before matching
    normalize_unicode=True,
    name="NER1"
)
/home/runner/.local/share/hatch/env/virtual/medkit-lib/KiEdgqfH/docs/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
  0%|          | 0.00/13.9k [00:00<?, ?B/s]
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 13.9k/13.9k [00:00<00:00, 22.4MB/s]

Transformer-based drug recognizer#

from medkit.text.ner.hf_entity_matcher import HFEntityMatcher

# an alternate model: "Clinical-AI-Apollo/Medical-NER"
bert_matcher = HFEntityMatcher(
    model="samrawal/bert-large-uncased_med-ner", name="NER2"
)
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
Cell In[6], line 4
      1 from medkit.text.ner.hf_entity_matcher import HFEntityMatcher
      3 # an alternate model: "Clinical-AI-Apollo/Medical-NER"
----> 4 bert_matcher = HFEntityMatcher(
      5     model="samrawal/bert-large-uncased_med-ner", name="NER2"
      6 )

File ~/work/medkit/medkit/medkit/text/ner/hf_entity_matcher.py:91, in HFEntityMatcher.__init__(self, model, aggregation_strategy, attrs_to_copy, device, batch_size, hf_auth_token, cache_dir, name, uid)
     84     msg = (
     85         f"Model {self.model} is not associated to a"
     86         " token-classification/ner task and cannot be used with"
     87         " HFEntityMatcher"
     88     )
     89     raise ValueError(msg)
---> 91 self._pipeline = transformers.pipeline(
     92     task="token-classification",
     93     model=self.model,
     94     aggregation_strategy=aggregation_strategy,
     95     pipeline_class=transformers.TokenClassificationPipeline,
     96     device=device,
     97     batch_size=batch_size,
     98     token=hf_auth_token,
     99     model_kwargs={"cache_dir": cache_dir},
    100 )

File ~/.local/share/hatch/env/virtual/medkit-lib/KiEdgqfH/docs/lib/python3.11/site-packages/transformers/pipelines/__init__.py:895, in pipeline(task, model, config, tokenizer, feature_extractor, image_processor, framework, revision, use_fast, token, device, device_map, torch_dtype, trust_remote_code, model_kwargs, pipeline_class, **kwargs)
    893 if isinstance(model, str) or framework is None:
    894     model_classes = {"tf": targeted_task["tf"], "pt": targeted_task["pt"]}
--> 895     framework, model = infer_framework_load_model(
    896         model,
    897         model_classes=model_classes,
    898         config=config,
    899         framework=framework,
    900         task=task,
    901         **hub_kwargs,
    902         **model_kwargs,
    903     )
    905 model_config = model.config
    906 hub_kwargs["_commit_hash"] = model.config._commit_hash

File ~/.local/share/hatch/env/virtual/medkit-lib/KiEdgqfH/docs/lib/python3.11/site-packages/transformers/pipelines/base.py:286, in infer_framework_load_model(model, config, model_classes, task, framework, **model_kwargs)
    280     logger.warning(
    281         "Model might be a PyTorch model (ending with `.bin`) but PyTorch is not available. "
    282         "Trying to load the model with Tensorflow."
    283     )
    285 try:
--> 286     model = model_class.from_pretrained(model, **kwargs)
    287     if hasattr(model, "eval"):
    288         model = model.eval()

File ~/.local/share/hatch/env/virtual/medkit-lib/KiEdgqfH/docs/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py:564, in _BaseAutoModelClass.from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs)
    562 elif type(config) in cls._model_mapping.keys():
    563     model_class = _get_model_class(config, cls._model_mapping)
--> 564     return model_class.from_pretrained(
    565         pretrained_model_name_or_path, *model_args, config=config, **hub_kwargs, **kwargs
    566     )
    567 raise ValueError(
    568     f"Unrecognized configuration class {config.__class__} for this kind of AutoModel: {cls.__name__}.\n"
    569     f"Model type should be one of {', '.join(c.__name__ for c in cls._model_mapping.keys())}."
    570 )

File ~/.local/share/hatch/env/virtual/medkit-lib/KiEdgqfH/docs/lib/python3.11/site-packages/transformers/modeling_utils.py:3579, in PreTrainedModel.from_pretrained(cls, pretrained_model_name_or_path, config, cache_dir, ignore_mismatched_sizes, force_download, local_files_only, token, revision, use_safetensors, *model_args, **kwargs)
   3576     else:
   3577         # This repo has no safetensors file of any kind, we switch to PyTorch.
   3578         filename = _add_variant(WEIGHTS_NAME, variant)
-> 3579         resolved_archive_file = cached_file(
   3580             pretrained_model_name_or_path, filename, **cached_file_kwargs
   3581         )
   3582 if resolved_archive_file is None and filename == _add_variant(WEIGHTS_NAME, variant):
   3583     # Maybe the checkpoint is sharded, we try to grab the index name in this case.
   3584     resolved_archive_file = cached_file(
   3585         pretrained_model_name_or_path,
   3586         _add_variant(WEIGHTS_INDEX_NAME, variant),
   3587         **cached_file_kwargs,
   3588     )

File ~/.local/share/hatch/env/virtual/medkit-lib/KiEdgqfH/docs/lib/python3.11/site-packages/transformers/utils/hub.py:402, in cached_file(path_or_repo_id, filename, cache_dir, force_download, resume_download, proxies, token, revision, local_files_only, subfolder, repo_type, user_agent, _raise_exceptions_for_gated_repo, _raise_exceptions_for_missing_entries, _raise_exceptions_for_connection_errors, _commit_hash, **deprecated_kwargs)
    399 user_agent = http_user_agent(user_agent)
    400 try:
    401     # Load from URL or cache if already cached
--> 402     resolved_file = hf_hub_download(
    403         path_or_repo_id,
    404         filename,
    405         subfolder=None if len(subfolder) == 0 else subfolder,
    406         repo_type=repo_type,
    407         revision=revision,
    408         cache_dir=cache_dir,
    409         user_agent=user_agent,
    410         force_download=force_download,
    411         proxies=proxies,
    412         resume_download=resume_download,
    413         token=token,
    414         local_files_only=local_files_only,
    415     )
    416 except GatedRepoError as e:
    417     resolved_file = _get_cache_file_to_return(path_or_repo_id, full_filename, cache_dir, revision)

File ~/.local/share/hatch/env/virtual/medkit-lib/KiEdgqfH/docs/lib/python3.11/site-packages/huggingface_hub/utils/_deprecation.py:101, in _deprecate_arguments.<locals>._inner_deprecate_positional_args.<locals>.inner_f(*args, **kwargs)
     99         message += "\n\n" + custom_message
    100     warnings.warn(message, FutureWarning)
--> 101 return f(*args, **kwargs)

File ~/.local/share/hatch/env/virtual/medkit-lib/KiEdgqfH/docs/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py:114, in validate_hf_hub_args.<locals>._inner_fn(*args, **kwargs)
    111 if check_use_auth_token:
    112     kwargs = smoothly_deprecate_use_auth_token(fn_name=fn.__name__, has_token=has_token, kwargs=kwargs)
--> 114 return fn(*args, **kwargs)

File ~/.local/share/hatch/env/virtual/medkit-lib/KiEdgqfH/docs/lib/python3.11/site-packages/huggingface_hub/file_download.py:1240, in hf_hub_download(repo_id, filename, subfolder, repo_type, revision, library_name, library_version, cache_dir, local_dir, user_agent, force_download, proxies, etag_timeout, token, local_files_only, headers, endpoint, legacy_cache_layout, resume_download, force_filename, local_dir_use_symlinks)
   1220     return _hf_hub_download_to_local_dir(
   1221         # Destination
   1222         local_dir=local_dir,
   (...)
   1237         local_files_only=local_files_only,
   1238     )
   1239 else:
-> 1240     return _hf_hub_download_to_cache_dir(
   1241         # Destination
   1242         cache_dir=cache_dir,
   1243         # File info
   1244         repo_id=repo_id,
   1245         filename=filename,
   1246         repo_type=repo_type,
   1247         revision=revision,
   1248         # HTTP info
   1249         endpoint=endpoint,
   1250         etag_timeout=etag_timeout,
   1251         headers=headers,
   1252         proxies=proxies,
   1253         token=token,
   1254         # Additional options
   1255         local_files_only=local_files_only,
   1256         force_download=force_download,
   1257     )

File ~/.local/share/hatch/env/virtual/medkit-lib/KiEdgqfH/docs/lib/python3.11/site-packages/huggingface_hub/file_download.py:1389, in _hf_hub_download_to_cache_dir(cache_dir, repo_id, filename, repo_type, revision, endpoint, etag_timeout, headers, proxies, token, local_files_only, force_download)
   1387 Path(lock_path).parent.mkdir(parents=True, exist_ok=True)
   1388 with WeakFileLock(lock_path):
-> 1389     _download_to_tmp_and_move(
   1390         incomplete_path=Path(blob_path + ".incomplete"),
   1391         destination_path=Path(blob_path),
   1392         url_to_download=url_to_download,
   1393         proxies=proxies,
   1394         headers=headers,
   1395         expected_size=expected_size,
   1396         filename=filename,
   1397         force_download=force_download,
   1398     )
   1399     _create_symlink(blob_path, pointer_path, new_blob=True)
   1401 return pointer_path

File ~/.local/share/hatch/env/virtual/medkit-lib/KiEdgqfH/docs/lib/python3.11/site-packages/huggingface_hub/file_download.py:1915, in _download_to_tmp_and_move(incomplete_path, destination_path, url_to_download, proxies, headers, expected_size, filename, force_download)
   1912         _check_disk_space(expected_size, incomplete_path.parent)
   1913         _check_disk_space(expected_size, destination_path.parent)
-> 1915     http_get(
   1916         url_to_download,
   1917         f,
   1918         proxies=proxies,
   1919         resume_size=resume_size,
   1920         headers=headers,
   1921         expected_size=expected_size,
   1922     )
   1924 logger.info(f"Download complete. Moving file to {destination_path}")
   1925 _chmod_and_move(incomplete_path, destination_path)

File ~/.local/share/hatch/env/virtual/medkit-lib/KiEdgqfH/docs/lib/python3.11/site-packages/huggingface_hub/file_download.py:549, in http_get(url, temp_file, proxies, resume_size, headers, expected_size, displayed_filename, _nb_retries, _tqdm_bar)
    547 new_resume_size = resume_size
    548 try:
--> 549     for chunk in r.iter_content(chunk_size=DOWNLOAD_CHUNK_SIZE):
    550         if chunk:  # filter out keep-alive new chunks
    551             progress.update(len(chunk))

File ~/.local/share/hatch/env/virtual/medkit-lib/KiEdgqfH/docs/lib/python3.11/site-packages/requests/models.py:820, in Response.iter_content.<locals>.generate()
    818 if hasattr(self.raw, "stream"):
    819     try:
--> 820         yield from self.raw.stream(chunk_size, decode_content=True)
    821     except ProtocolError as e:
    822         raise ChunkedEncodingError(e)

File ~/.local/share/hatch/env/virtual/medkit-lib/KiEdgqfH/docs/lib/python3.11/site-packages/urllib3/response.py:1060, in HTTPResponse.stream(self, amt, decode_content)
   1058 else:
   1059     while not is_fp_closed(self._fp) or len(self._decoded_buffer) > 0:
-> 1060         data = self.read(amt=amt, decode_content=decode_content)
   1062         if data:
   1063             yield data

File ~/.local/share/hatch/env/virtual/medkit-lib/KiEdgqfH/docs/lib/python3.11/site-packages/urllib3/response.py:949, in HTTPResponse.read(self, amt, decode_content, cache_content)
    946     if len(self._decoded_buffer) >= amt:
    947         return self._decoded_buffer.get(amt)
--> 949 data = self._raw_read(amt)
    951 flush_decoder = amt is None or (amt != 0 and not data)
    953 if not data and len(self._decoded_buffer) == 0:

File ~/.local/share/hatch/env/virtual/medkit-lib/KiEdgqfH/docs/lib/python3.11/site-packages/urllib3/response.py:873, in HTTPResponse._raw_read(self, amt, read1)
    870 fp_closed = getattr(self._fp, "closed", False)
    872 with self._error_catcher():
--> 873     data = self._fp_read(amt, read1=read1) if not fp_closed else b""
    874     if amt is not None and amt != 0 and not data:
    875         # Platform-specific: Buggy versions of Python.
    876         # Close the connection when no data is returned
   (...)
    881         # not properly close the connection in all cases. There is
    882         # no harm in redundantly calling close.
    883         self._fp.close()

File ~/.local/share/hatch/env/virtual/medkit-lib/KiEdgqfH/docs/lib/python3.11/site-packages/urllib3/response.py:856, in HTTPResponse._fp_read(self, amt, read1)
    853     return self._fp.read1(amt) if amt is not None else self._fp.read1()
    854 else:
    855     # StringIO doesn't like amt=None
--> 856     return self._fp.read(amt) if amt is not None else self._fp.read()

File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/http/client.py:473, in HTTPResponse.read(self, amt)
    470 if self.length is not None and amt > self.length:
    471     # clip the read to the "end of response"
    472     amt = self.length
--> 473 s = self.fp.read(amt)
    474 if not s and amt:
    475     # Ideally, we would raise IncompleteRead if the content-length
    476     # wasn't satisfied, but it might break compatibility.
    477     self._close_conn()

File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/socket.py:706, in SocketIO.readinto(self, b)
    704 while True:
    705     try:
--> 706         return self._sock.recv_into(b)
    707     except timeout:
    708         self._timeout_occurred = True

File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/ssl.py:1314, in SSLSocket.recv_into(self, buffer, nbytes, flags)
   1310     if flags != 0:
   1311         raise ValueError(
   1312           "non-zero flags not allowed in calls to recv_into() on %s" %
   1313           self.__class__)
-> 1314     return self.read(nbytes, buffer)
   1315 else:
   1316     return super().recv_into(buffer, nbytes, flags)

File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/ssl.py:1166, in SSLSocket.read(self, len, buffer)
   1164 try:
   1165     if buffer is not None:
-> 1166         return self._sslobj.read(len, buffer)
   1167     else:
   1168         return self._sslobj.read(len)

KeyboardInterrupt: 

Pipeline assembly#

from medkit.core import DocPipeline, Pipeline, PipelineStep

pipeline = Pipeline(
    steps=[
        PipelineStep(sentence_tokenizer, input_keys=["full_text"], output_keys=["sentence"]),
        PipelineStep(pii_detector, input_keys=["sentence"], output_keys=["sentence_"]),
        PipelineStep(umls_matcher, input_keys=["sentence_"], output_keys=["ner1_drug"]),
        PipelineStep(bert_matcher, input_keys=["sentence_"], output_keys=["ner2_drug"]),
    ],
    input_keys=["full_text"],
    output_keys=["sentence_", "ner1_drug", "ner2_drug"],
)

doc_pipeline = DocPipeline(pipeline=pipeline)
doc_pipeline.run(docs)

Performance evaluation#

from medkit.io.brat import BratInputConverter

# Load text with annotations in medkit (our ground truth)
brat_converter = BratInputConverter()
ref_docs = brat_converter.load(doc_dir)

# Display selected drug annotations
for ann in ref_docs[0].anns.get(label="Drug"):
    print(f"{ann.text} in {ann.spans}")
## Compute some stats
print(f"Number of documents: {len(docs)}")
    
for i, doc in enumerate(docs):
    print(f"Document {doc.uid}:")

    # On annotations made by NER1 and NER2
    sentence_nb = len(doc.anns.get(label="sentence"))
    print(f"\t{sentence_nb} sentences,")
    ner1_drug_nb = len(doc.anns.get(label="chemical"))
    print(f"\t{ner1_drug_nb} drugs found with NER1,")  
    ner2_drug_nb = len(doc.anns.get(label="m"))
    print(f"\t{ner2_drug_nb} drugs found with NER2,")

    # On the manual annotation (our ground truth)
    gt_nb = len(ref_docs[i].anns.get(label="Drug"))
    print(f"\t{gt_nb} drugs manually annotated.")
## Evaluate performance metrics of the NER1 and NER2 tools
from medkit.text.metrics.ner import SeqEvalEvaluator
import pandas as pd

def results_to_df(_results, _title):
    results_list = list(_results.items())
    arranged_results = {"Entities": ['P', 'R', 'F1']}
    accuracy = round(results_list[4][1], 2)

    for i in range(5, len(results_list), 4):
        key = results_list[i][0][:-10]
        arranged_results[key] = [round(results_list[n][1], 2) for n in [i, i + 1, i + 2]]

    df = pd.DataFrame(arranged_results, index=[f"{_title} (acc={accuracy})", '', '']).T
    return df

predicted_entities1=[]
predicted_entities2=[]
dfs = []

for doc in docs:
    predicted_entities1.append(doc.anns.get(label="chemical"))
    predicted_entities2.append(doc.anns.get(label="m"))

# Annotations of NER1 are labelled as 'chemical', NER2 as 'm', but as 'Drug' in the ground truth
# The following dic enables remappings various labels of the same type of entites
remapping= {"chemical": "Drug", "m": "Drug"}
evaluator = SeqEvalEvaluator(return_metrics_by_label=True, average='weighted', labels_remapping=remapping) 
# eval of NER2
results1 = evaluator.compute(ref_docs, predicted_entities1)
dfs.append(results_to_df(_results=results1, _title="NER1"))
#print(results_to_df(_results=results1, _title="umls_matcher"))
# eval of NER2
results2 = evaluator.compute(ref_docs, predicted_entities2)
dfs.append(results_to_df(_results=results2, _title="NER2"))

print(pd.concat(dfs, axis=1))
## Read new unannotated documents
## Write annotations of tool NER2 in the brat format
from medkit.io.brat import BratOutputConverter

in_path = data_dir / "mtsamplesen" / "unannotated_doc"
# reload raw documents
final_docs = TextDocument.from_dir(
    path=Path(in_path),
    pattern='[A-Z0-9].txt',
    encoding='utf-8',
)
# simplified pipeline, with only the best NER tool (NER2)
pipeline2 = Pipeline(
    steps=[
        PipelineStep(
            sentence_tokenizer,
            input_keys=["full_text"],
            output_keys=["sentence"],
        ),
        PipelineStep(
            pii_detector,
            input_keys=["sentence"],
            output_keys=["sentence_"],
        ),
        PipelineStep(
            bert_matcher,
            input_keys=["sentence_"],
            output_keys=["ner2_drug"],
        ),
    ],
    input_keys=["full_text"],
    output_keys=["ner2_drug"],
)

doc_pipeline2 = DocPipeline(pipeline=pipeline2)
doc_pipeline2.run(final_docs)

# filter annotations to keep only drug annotations
# sensitive information can also be removed here
output_docs = [
    TextDocument(text=doc.text, anns=doc.anns.get(label="m"))
    for doc in final_docs
]

# Define Output Converter with default params,
# transfer all annotations and attributes
brat_output_converter = BratOutputConverter()
out_path = data_dir / "mtsamplesen" / "ner2_out"

# save the annotation with the best tool (considering F1 only) in `out_path`
brat_output_converter.save(
    output_docs, 
    dir_path=out_path,
    doc_names=["ner2_6", "ner2_7"],
)

Annotations of the discharge summary 6.txt, displayed with Brat

Annotations of the discharge summary 7.txt (partial view), displayed with Brat