medkit.text.spacy.spacy_utils#

Functions#

extract_anns_and_attrs_from_spacy_doc(...)

Given a spacy document, convert selected entities or spans into Segments.

build_spacy_doc_from_medkit_doc(→ spacy.tokens.Doc)

Create a Spacy Doc from a TextDocument.

build_spacy_doc_from_medkit_segment(→ spacy.tokens.Doc)

Create a Spacy Doc from a Segment.

Module Contents#

medkit.text.spacy.spacy_utils.extract_anns_and_attrs_from_spacy_doc(spacy_doc: spacy.tokens.Doc, medkit_source_ann: medkit.core.text.Segment | None = None, entities: list[str] | None = None, span_groups: list[str] | None = None, attrs: list[str] | None = None, attribute_factories: dict[str, Callable[[spacy.tokens.Span, str], medkit.core.Attribute]] | None = None, rebuild_medkit_anns_and_attrs: bool = False) tuple[list[medkit.core.text.Segment], dict[str, list[medkit.core.Attribute]]]#

Given a spacy document, convert selected entities or spans into Segments.

Extract attributes for each annotation in the document.

Parameters:
spacy_docDoc

A Spacy Doc with spans to be converted

medkit_source_annSegment, optional

Segment used to rebuild spans referencing the original text

entitieslist of str, optional

Labels of entities to be extracted If None (default) all new entities will be extracted as annotations

span_groupslist of str, optional

Name of span groups to be extracted If None (default) all new spans will be extracted as annotations

attrslist of str, optional

Name of custom attributes to extract from the annotations that will be included. If None (default) all the custom attributes will be extracted

attribute_factoriesdict of str to Callable, optional

Mapping of factories in charge of converting spacy attributes to medkit attributes. Factories will receive a spacy span and an attribute label when called. The key in the mapping is the attribute label.

rebuild_medkit_anns_and_attrsbool, default=False

If True the annotations and attributes with medkit ids will become new annotations/attributes with new ids. If False (default) the annotations and attributes with medkit ids are not rebuilt, only new annotations and attributes are returned

Returns:
annotations: list of Segment

Segments extracted from the spacy Doc object

attributes_by_ann: dict of str to list of Attribute

Attributes extracted for each annotation, the key is a medkit uid

Raises:
ValueError

Raises when the given medkit source and the spacy doc do not have the same medkit uid

medkit.text.spacy.spacy_utils.build_spacy_doc_from_medkit_doc(nlp: spacy.Language, medkit_doc: medkit.core.text.TextDocument, labels_anns: list[str] | None = None, attrs: list[str] | None = None, include_medkit_info: bool = True) spacy.tokens.Doc#

Create a Spacy Doc from a TextDocument.

Parameters:
nlp:

Language object with the loaded pipeline from Spacy

medkit_doc:

TextDocument to convert

labels_anns:

Labels of annotations to include in the spacy document. If None (default) all the annotations will be included.

attrs:

Labels of attributes to add in the annotations that will be included. If None (default) all the attributes will be added as custom attributes in each annotation included.

include_medkit_info:

If True, medkitID is included as an extension in the Doc object to identify the medkit source annotation. If False, no information about IDs is included

Returns:
Doc:

A Spacy Doc with the selected annotations included.

medkit.text.spacy.spacy_utils.build_spacy_doc_from_medkit_segment(nlp: spacy.Language, segment: medkit.core.text.Segment, annotations: list[medkit.core.text.Segment] | None = None, attrs: list[str] | None = None, include_medkit_info: bool = True) spacy.tokens.Doc#

Create a Spacy Doc from a Segment.

Parameters:
nlp:

Language object with the loaded pipeline from Spacy

segment:

Segment to convert, this annotation contains the text to create the spacy doc

annotations:

List of annotations in segment to include

attrs:

Labels of attributes to add in the annotations that will be included. If None (default) all the attributes will be added as custom attributes in each annotation included.

include_medkit_info:

If True, medkitID is included as an extension in the Doc object to identify the medkit source annotation. If False, no information about IDs is included.

Returns:
Doc:

A Spacy Doc with the selected annotations included.