Creating a custom text operation#

If you want to initialize a custom text operation from a simple user-defined function, you can take a look to the following examples.

Note

For more details about public APIs, refer to create_text_operation().

Filtering annotations#

In this example, Jane wants to detect some entities (problems) from a raw text.

1. Create medkit document#

from medkit.core.text import TextDocument

text = "The patient has asthma and is using ventoline. The patient has diabetes"
doc = TextDocument(text=text)

2. Init medkit operations#

Jane would like to reuse a collegue’s file containing a list of regular expression rules for detecting entities. To this purpose, she had to split text into sentences before using the RegexpMatcher component.

from medkit.text.segmentation import SentenceTokenizer

sentence_tokenizer = SentenceTokenizer()

In real life, Jane should load the rules from a path using this instruction:

regexp_rules = RegexpMatcher.load_rules(path_to_rules_file)

But for this example, it is simpler for us to define this set of rules manually.

from medkit.text.ner import RegexpMatcher, RegexpMatcherRule

regexp_rules = [
       RegexpMatcherRule(regexp=r"\basthma\b", label="problem"),
       RegexpMatcherRule(regexp=r"\bventoline\b", label="treatment"),
       RegexpMatcherRule(regexp=r"\bdiabetes\b", label="problem")
       ]
regexp_matcher = RegexpMatcher(rules=regexp_rules)

3. Define filter operation#

As RegexpMatcher is based on her collegue’s file, Jane would like to add a filter operation so that only entities which are problems will be returned.

For that, she has to define her own filter function and use medkit tools to instantiate this custom operation.

from medkit.core.text import Entity

def keep_entities_with_label_problem(entity):
    return entity.label == "problem"

from medkit.core.text import CustomTextOpType, create_text_operation

filter_operation = create_text_operation(function=keep_entities_with_label_problem, function_type=CustomTextOpType.FILTER)

# Same behavior as
# filter_operation = create_text_operation(
#   name="keep_entities_with_label_problem",
#   function=keep_entities_with_label_problem,
#   function_type=CustomTextOpType.FILTER)

4. Construct and run the pipeline#

from medkit.core import Pipeline, PipelineStep

steps=[
    PipelineStep(input_keys=["raw_text"], output_keys=["sentences"], operation=sentence_tokenizer),
    PipelineStep(input_keys=["sentences"], output_keys=["entities"], operation=regexp_matcher),
    PipelineStep(input_keys=["entities"], output_keys=["problems"], operation=filter_operation)
]

pipeline = Pipeline(
       steps=steps,
       input_keys=["raw_text"],
       output_keys=["problems"]
)

entities = pipeline.run([doc.raw_segment])

for entity in entities:
    print(entity)

In this scenario, 2 entities with problem label are returned.

To compare with the intermediate results generated by regexpmatcher, we’ll use the entities intermediate key. There are 3 results.

IMPORTANT: the following code is only for demo purpose, all pipeline steps are executed, we just select what pipeline outputs

pipeline = Pipeline(
    steps=steps,
    input_keys=["raw_text"],
    output_keys=["entities"]
)

entities = pipeline.run([doc.raw_segment])

for entity in entities:
    print(entity)

Creating new annotations#

In this example, Jane wants to pre-process the text before detecting entities.

1. Create medkit document#

from medkit.core.text import TextDocument

text = """IRM : Lésion de la CPMI périphérique,
aspect distendu du LCA, kyste poplité."""

doc = TextDocument(text=text)

2. Define custom function#

Jane wants to use a dictionary to convert all abbreviations into their long text. To make it, she may define a custom function and use medkit span_utils to preserve spans during text modifications.

import re
from typing import Dict
from medkit.core.text import Segment, span_utils


# Providing the dictionary of abbreviation mapping
abbrv_mapping = {
    "IRM" : "Imagerie par Résonance Magnétique",
    "CPMI" : "Corne Postérieure du Ménisque Interne",
    "LCA" : "Ligament Croisé Antérieur",
}

# Defining custom function
def translate_abbreviations(segment, abbrv_mapping):
    ranges = []
    replacement_texts = []

    regexp = '|'.join('%s' % abbrv for abbrv in abbrv_mapping.keys())

    # Detect abbreviations
    for mo in re.finditer(regexp, segment.text):
        ranges.append([mo.start(), mo.end()])
        replacement_texts.append(abbrv_mapping[mo.group()])

    # Replace abbreviations by their text (and preserving spans)
    text, spans = span_utils.replace(
        text=segment.text,
        spans=segment.spans,
        ranges=ranges,
        replacement_texts=replacement_texts
    )

    return Segment(label="long_text", text=text, spans=spans)


from medkit.core.text import CustomTextOpType, create_text_operation

# Create the medkit operation from our custom function
preprocessing_operation = create_text_operation(
    function=translate_abbreviations,
    function_type=CustomTextOpType.CREATE_ONE_TO_N,
    name="translate_abbreviations",
    args={"abbrv_mapping":abbrv_mapping}
)

3. Run the operation#

After executing the operation on the document raw text, we can observe that the output segment is composed of:

  • a text with abbreviations replaced by their long text,

  • spans which is a mix of modified spans (for replaced parts of text) and original spans (for not replaced text).

segments = preprocessing_operation.run([doc.raw_segment])

for segment in segments:
    print(f"Text: {segment.text}\n")
    print(f"Spans:")
    for span in segment.spans:
        print(f"- {span}")

Extracting annotations#

In this example, Jane wants to count detected UMLS cui on a set of documents.

1. Loading text documents#

In this example, we use translated .uid documents. For more info, you may refer to medkit.tools.mtsamples.

from medkit.tools.mtsamples import load_mtsamples

docs = load_mtsamples(nb_max=10)

print(docs[0].text)

2. Init our operations#

Let’s initialize same operations as above (i.e., sentence tokenizer, then regexp matcher with default rules) without the filter operation.

from medkit.text.segmentation import SentenceTokenizer

sentence_tokenizer = SentenceTokenizer()
from medkit.text.ner import RegexpMatcher

regexp_matcher = RegexpMatcher()

3. Defining an extraction function#

The extraction function is defined with a label parameter for filtering entities. Our custom operation allows to retrieve only attributes from entity with disorder label.

import re
from typing import List
from medkit.core.text import Entity, UMLSNormAttribute

# Defining custom function for extracting umls normalization attributes from entity
def extract_umls_attributes_from_entity(entity, label):
    return [attr for attr in entity.attrs.get_norms() if entity.label == label and isinstance(attr, UMLSNormAttribute) ]


from medkit.core.text import CustomTextOpType, create_text_operation

attr_extraction_operation = create_text_operation(
    function=extract_umls_attributes_from_entity,
    function_type=CustomTextOpType.EXTRACT_ONE_TO_N,
    args={"label":'disorder'}
)

4. Defining and running our pipeline#

When running the pipeline on the set of documents, the output is a list of umls normalization attributes.

from medkit.core import Pipeline, PipelineStep

steps=[
    PipelineStep(input_keys=["raw_text"], output_keys=["sentences"], operation=sentence_tokenizer),
    PipelineStep(input_keys=["sentences"], output_keys=["entities"], operation=regexp_matcher),
    PipelineStep(input_keys=["entities"], output_keys=["umls_attributes"], operation=attr_extraction_operation),
]

pipeline = Pipeline(
       steps=steps,
       input_keys=["raw_text"],
       output_keys=["umls_attributes"] 
)
attrs = pipeline.run([doc.raw_segment for doc in docs])
attrs[:5]

5. Analyzing data#

Now, Jane can analyze the number of cuis detected on her set of documents.

import pandas as pd
df = pd.DataFrame.from_records([attr.to_dict() for attr in attrs], columns=["cui", "umls_version"])
print(df)
df.value_counts(subset="cui")