Using EDS-NLP with medkit#
EDS-NLP provides a set of spaCy components that are used to extract information from clinical notes written in French. Because medkit is spaCy-compatible, using EDS-NLP within medkit is supported, as we will see.
To follow this tutorial, you will need to install medkit spaCy support and EDS-NLP with
pip install 'medkit-lib[edsnlp]'
Running an EDS-NLP spaCy pipeline on entire documents#
We will need a sample text document to annotate:
from medkit.core.text import TextDocument
text = """COMPTE RENDU D'HOSPITALISATION
Monsieur Jean Dupont a été hospitalisé du 11/08/2019 au 17/08/2019 pour attaque d'asthme
ANTÉCÉDENTS
Peut-ĂŞtre atteint de Covid19 en aout 2020"""
doc = TextDocument(text)
and a spaCy pipeline with a few EDS-NLP components:
import spacy
nlp = spacy.blank("eds")
# General-purpose components
nlp.add_pipe("eds.normalizer")
nlp.add_pipe("eds.sentences")
# Entity extraction
nlp.add_pipe("eds.covid")
nlp.add_pipe("eds.dates")
# Context detection
nlp.add_pipe("eds.negation")
nlp.add_pipe("eds.hypothesis")
The eds.normalizer
and eds.sentences
components do some pre-processing,
eds.covid
and eds.dates
perform entity matching and create some spaCy
entities and spans, and eds.negation
and eds.hypothesis
attach some context
attributes to these entities and spans.
To be used within medkit, the pipeline could be wrapped into a generic
SpacyDocPipeline
operation. But medkit also provides
a dedicated EDSNLPDocPipeline
operation, with some additional support
for specific EDS-NLP components:
from medkit.text.spacy.edsnlp import EDSNLPDocPipeline
eds_nlp_pipeline = EDSNLPDocPipeline(nlp)
The operation is executed by applying its run()
method on a list of documents:
eds_nlp_pipeline.run([doc])
Let’s look at the entities and segments that were found:
for entity in doc.anns.entities:
print(f"{entity.label}: {entity.text!r}")
for segment in doc.anns.segments:
print(f"{segment.label}: {segment.text!r}")
Here are the attributes attached to the "covid"
entity:
entity = doc.anns.get_entities(label="covid")[0]
for attr in entity.attrs:
print(f"{attr.label}={attr.value}")
and the attributes of the first "dates"
segment:
date_seg = doc.anns.get_segments(label="dates")[0]
for attr in date_seg.attrs:
print(f"{attr.label}={attr.value}")
Let’s now examine more closely the "date"
attribute:
date_seg = doc.anns.get_segments(label="dates")[0]
date_attr = date_seg.attrs.get(label="date")[0]
date_attr
This attribute is an instance of DateAttribute
, a
subclass of Attribute
.It has year
, month
, day
(etc)
fields containing the different parts of the date that was detected, as well as
a normalized string representation in its value
field:
date_attr.value
One of the benefits of using EDSNLPDocPipeline
instead of
SpacyDocPipeline
is that some special EDS-NLP
attributes are automatically converted to a corresponding
Attribute
subclass.
Here are the supported EDS-NLP attributes values and the corresponding medkit classes:
AdicapCode
(created byeds.adicap
):medkit.text.ner.ADICAPNormAttribute
TNM
(created byeds.tnm
):medkit.text.ner.tnm_attribute.TNMAttribute
AbsoluteDate
(created byeds.dates
):medkit.text.ner.DateAttribute
RelativeDate
(created byeds.dates
):medkit.text.ner.RelativeDateAttribute
Duration
(created byeds.dates
):medkit.text.ner.DurationAttribute
Note
The transformations performed by EDSNLPDocPipeline
can be overridden
or extended with the medkit_attribute_factories
init parameter. For a list of
all the default transformations, see
DEFAULT_ATTRIBUTE_FACTORIES
and corresponding
functions in medkit.text.spacy.edsnlp
.
Running an EDL-NLP spaCy pipeline at the annotation level#
So far, we have wrapped a spaCy pipeline and executed it on an entire document
with EDSNLPDocPipeline
. But it is also possible to run the spaCy
pipeline on text annotations instead of a document with
EDSNLPPipeline
. To illustrate this, let’s create a medkit pipeline
using pure medkit operations for sentence tokenization and entity matching, and
EDS-NLP spaCy components for covid entity matching:
from medkit.core import Pipeline, PipelineStep
from medkit.text.ner import RegexpMatcher, RegexpMatcherRule
from medkit.text.segmentation import SentenceTokenizer
from medkit.text.spacy.edsnlp import EDSNLPPipeline
sentence_tokenizer = SentenceTokenizer()
matcher = RegexpMatcher(rules=[RegexpMatcherRule(regexp=r"\basthme\b", label="asthme")])
nlp = spacy.blank("eds")
nlp.add_pipe("eds.covid")
eds_nlp_pipeline = EDSNLPPipeline(nlp)
pipeline = Pipeline(
steps=[
PipelineStep(operation=sentence_tokenizer, input_keys=["full_text"], output_keys=["sentences"]),
PipelineStep(operation=matcher, input_keys=["sentences"], output_keys=["entities"]),
PipelineStep(operation=eds_nlp_pipeline, input_keys=["sentences"], output_keys=["entities"]),
],
input_keys=["full_text"],
output_keys=["entities"],
)
doc = TextDocument(text)
entities = pipeline.run([doc.raw_segment])
for entity in entities:
print(f"{entity.label}: {entity.text!r}")
For more information about advanced usage of EDSNLPDocPipeline
and
EDSNLPPipeline
, you may refer to the API doc of
medkit.text.spacy.edsnlp
.