Conversions to and from spaCy
#
medkit
can load spaCy
documents with entities, attributes (custom extensions) and groups of spans,
and convert documents back to spaCy
easily.
In this example, we will show how to import spaCy
documents into medkit
and how to convert medkit
documents into spaCy
documents.
We use some spaCy
concepts, more information can be found in the official spacy documentation.
Note
For this example, you should download the French spaCy
model.
You can download it using:
import spacy.cli
spacy.cli.download("fr_core_news_sm")
Consider the following spaCy
document:
import spacy
from spacy.tokens import Span as SpacySpan
# Load French tokenizer, tagger, parser and NER
nlp = spacy.load("fr_core_news_sm")
# Create a spacy document
text = """Parcours patient:
Marie habite à Brest. Elle a été transférée."""
spacy_doc = nlp(text)
# Spacy adds entities, here we add a span 'SECTION' as an example
spacy_doc.spans["SECTION"] = [SpacySpan(spacy_doc, 0, 2, "header")]
# Adding a custom attribute
# We need to define the extension before setting its value on an entity.
# Let's define an attribute called 'country'
if not SpacySpan.has_extension("country"):
SpacySpan.set_extension("country", default=None)
# Now, we can set the country in the 'LOC' entity
for e in spacy_doc.ents:
if e.label_ == 'LOC':
e._.set("country", 'France')
Description of the spaCy
document:
Entities
from spacy import displacy
displacy.render(spacy_doc, style="ent")
Spans
displacy.render(spacy_doc, style="span", options={"spans_key": "SECTION"})
The spacy document has 2 entities and 1 span group called SECTION
.
The entity ‘LOC’ has 1 attribute called country
.
Let’s see how to convert this spacy doc in a TextDocument
with annotations.
Load a spaCy
Doc into a list of TextDocuments#
The class SpacyInputConverter
is in charge of converting
spaCy
Docs into a list of TextDocuments.
By default, it loads all entities, span groups and extension attributes for each SpacyDoc object,
but you can use the entities
, span_groups
and attrs
parameters to specify which items should be converted,
based on their labels.
Tip
You can enable provenance tracing by assigning a ProvTracer
object
to the SpacyInputConverter with the set_prov_tracer
method.
Note
Span groups in medkit
In spaCy
, the spans are grouped with a key and each span can have its own label.
To remain compatible, medkit
uses the key as the span label
and the spacy label is stored as name in its metadata.
from medkit.io.spacy import SpacyInputConverter
# Define default Input Converter
spacy_input_converter = SpacyInputConverter()
# Load spacy doc into a list of documents
docs = spacy_input_converter.load([spacy_doc])
medkit_doc = docs[0]
Description of the resulting Text document
print(f"The medkit doc has {len(medkit_doc.anns)} annotations.")
print(f"The medkit doc has {len(medkit_doc.anns.get_entities())} entities.")
print(f"The medkit doc has {len(medkit_doc.anns.get_segments())} segment.")
What about ‘LOC’ entity?
entity = medkit_doc.anns.get(label="LOC")[0]
attributes = entity.attrs.get(label="country")
print(f"Entity label={entity.label}, Entity text={entity.text}")
print("Attributes loaded from spacy")
print(attributes)
Visualizing Medkit annotations
As explained in other tutorials, we can display medkit
entities using displacy
,
a visualizer developed by spaCy
.
You can use the medkit_doc_to_displacy()
function to format medkit
entities.
from medkit.text.spacy.displacy_utils import medkit_doc_to_displacy
# getting entities in displacy format (default config)
entities_data = medkit_doc_to_displacy(medkit_doc)
displacy.render(entities_data, style="ent",manual=True)
Convert TextDocuments to a spaCy
Doc#
Likewise, it is possible to convert a list of TextDocument to spaCy
using SpacyOutputConverter
.
You will need to provide a nlp
object that tokenizes and generates
the document with the raw text as reference. By default, it converts
all medkit
annotations and attributes to spaCy
, but you can use
anns_labels
and attrs
parameters to specify which items should be converted.
from medkit.io.spacy import SpacyOutputConverter
# define Output Converter with default params
spacy_output_converter = SpacyOutputConverter(nlp=nlp)
# Convert a list of TextDocument
spacy_docs = spacy_output_converter.convert([medkit_doc])
spacy_doc = spacy_docs[0]
# Explore new spacy doc
print("Text of spacy doc from TextDocument:\n",spacy_doc.text)
Description of the resulting Spacy document
Entities imported from
medkit
displacy.render(spacy_doc, style="ent")
Spans imported from
medkit
displacy.render(spacy_doc, style="span",options={"spans_key": "SECTION"})
What about ‘LOC’ entity?
entity = [e for e in spacy_doc.ents if e.label_ == 'LOC'][0]
attribute = entity._.get('country')
print(f"Entity label={entity.label_}. Entity text={entity.text}")
print("Attribute imported from medkit")
print(f"The attr `country` was imported? : {attribute is not None}, value={entity._.get('country')}")