medkit.core.text

Contents

medkit.core.text#

Submodules#

Classes#

Entity

Text entity referencing part of an TextDocument.

Relation

Relation between two text entities.

Segment

Text segment referencing part of an TextDocument.

TextAnnotation

Base abstract class for all text annotations.

TextAnnotationContainer

Manage a list of text annotations belonging to a text document.

TextDocument

Document holding text annotations.

EntityAttributeContainer

Manage a list of attributes attached to a text entity.

EntityNormAttribute

Normalization attribute linking an entity to an ID in a knowledge base.

ContextOperation

Abstract operation for context detection.

CustomTextOpType

Supported function types for creating custom text operations.

NEROperation

Abstract operation for detecting entities.

SegmentationOperation

Abstract operation for segmenting text.

AnySpan

Helper class that provides a standard way to create an ABC using

ModifiedSpan

Slice of text not present in the original text.

Span

Slice of text extracted from the original text.

UMLSNormAttribute

Normalization attribute linking an entity to a CUI in the UMLS knowledge base.

Functions#

create_text_operation(→ _CustomTextOperation)

Instantiate a custom text operation from a user-defined function.

Package Contents#

class medkit.core.text.Entity(label: str, text: str, spans: list[medkit.core.text.span.AnySpan], attrs: list[medkit.core.attribute.Attribute] | None = None, metadata: dict[str, Any] | None = None, uid: str | None = None, store: medkit.core.store.Store | None = None, attr_container_class: type[medkit.core.text.entity_attribute_container.EntityAttributeContainer] = EntityAttributeContainer)#

Bases: Segment

Text entity referencing part of an TextDocument.

Attributes:
uidstr

The entity identifier.

labelstr

The label for this entity (e.g., DISEASE)

textstr

Text of the entity.

spanslist of AnySpan

List of spans indicating which parts of the entity text correspond to which part of the document’s full text.

attrsEntityAttributeContainer

Attributes of the entity. Stored in a :class:{~medkit.core.EntityAttributeContainer} but can be passed as a list at init.

metadatadict of str to Any

The metadata of the entity

keysset of str

Pipeline output keys to which the entity belongs to.

attrs: medkit.core.text.entity_attribute_container.EntityAttributeContainer#
class medkit.core.text.Relation(label: str, source_id: str, target_id: str, attrs: list[medkit.core.attribute.Attribute] | None = None, metadata: dict[str, Any] | None = None, uid: str | None = None, store: medkit.core.store.Store | None = None, attr_container_class: type[medkit.core.attribute_container.AttributeContainer] = AttributeContainer)#

Bases: TextAnnotation

Relation between two text entities.

Attributes:
uidstr

The identifier of the relation

labelstr

The relation label

source_idstr

The identifier of the entity from which the relation is defined

target_idstr

The identifier of the entity to which the relation is defined

attrsAttributeContainer

The attributes of the relation

metadatadict of str to Any

The metadata of the relation

keysset of str

Pipeline output keys to which the relation belongs to

source_id: str#
target_id: str#
to_dict() dict[str, Any]#
classmethod from_dict(relation_dict: dict[str, Any]) typing_extensions.Self#

Create a Relation from a dict.

Parameters:
relation_dictdict of str to Any

A dictionary from a serialized relation as generated by to_dict()

class medkit.core.text.Segment(label: str, text: str, spans: list[medkit.core.text.span.AnySpan], attrs: list[medkit.core.attribute.Attribute] | None = None, metadata: dict[str, Any] | None = None, uid: str | None = None, store: medkit.core.store.Store | None = None, attr_container_class: type[medkit.core.attribute_container.AttributeContainer] = AttributeContainer)#

Bases: TextAnnotation

Text segment referencing part of an TextDocument.

Attributes:
uidstr

The segment identifier.

labelstr

The label for this segment (e.g., SENTENCE)

textstr

Text of the segment.

spanslist of AnySpan

List of spans indicating which parts of the segment text correspond to which part of the document’s full text.

attrsAttributeContainer

Attributes of the segment. Stored in a :class:{~medkit.core.AttributeContainer} but can be passed as a list at init.

metadatadict of str to Any

The metadata of the segment

keysset of str

Pipeline output keys to which the segment belongs to.

spans: list[medkit.core.text.span.AnySpan]#
text: str#
length#
to_dict() dict[str, Any]#
classmethod from_dict(segment_dict: dict[str, Any]) typing_extensions.Self#

Create a Segment from a dict.

Parameters:
segment_dictdict of str to Any

A dictionary from a serialized segment as generated by to_dict()

class medkit.core.text.TextAnnotation(label: str, attrs: list[medkit.core.attribute.Attribute] | None = None, metadata: dict[str, Any] | None = None, uid: str | None = None, attr_container_class: type[medkit.core.attribute_container.AttributeContainer] = AttributeContainer)#

Bases: abc.ABC, medkit.core.dict_conv.SubclassMapping

Base abstract class for all text annotations.

Attributes:
uidstr

Unique identifier of the annotation.

labelstr

The label for this annotation (e.g., SENTENCE)

attrsAttributeContainer

Attributes of the annotation. Stored in a :class:{~medkit.core.AttributeContainer} but can be passed as a list at init.

metadatadict of str to Any

The metadata of the annotation

keysset of str

Pipeline output keys to which the annotation belongs to.

uid: str#
label: str#
attrs: medkit.core.attribute_container.AttributeContainer#
metadata: dict[str, Any]#
keys: set[str]#
classmethod __init_subclass__()#
classmethod from_dict(ann_dict: dict[str, Any]) typing_extensions.Self#
abstract to_dict() dict[str, Any]#
class medkit.core.text.TextAnnotationContainer(doc_id: str, raw_segment: medkit.core.text.annotation.Segment)#

Bases: medkit.core.annotation_container.AnnotationContainer[medkit.core.text.annotation.TextAnnotation]

Manage a list of text annotations belonging to a text document.

This behaves more or less like a list: calling len() and iterating are supported. Additional filtering is available through the get() method.

Also provides retrieval of entities, segments, relations, and handling of raw segment.

raw_segment#
_segment_ids: list[str] = []#
_entity_ids: list[str] = []#
_relation_ids: list[str] = []#
_relation_ids_by_source_id: dict[str, list[str]]#
property segments: list[medkit.core.text.annotation.Segment]#

Return the list of segments.

property entities: list[medkit.core.text.annotation.Entity]#

Return the list of entities.

property relations: list[medkit.core.text.annotation.Relation]#

Return the list of relations.

add(ann: medkit.core.text.annotation.TextAnnotation)#

Attach an annotation to the document.

Parameters:
annAnnotationType

Annotation to add.

Raises:
ValueError

If the annotation is already attached to the document (based on annotation.uid)

get(*, label: str | None = None, key: str | None = None) list[medkit.core.text.annotation.TextAnnotation]#

Return a list of the annotations of the document.

Parameters:
labelstr, optional

Label to use to filter annotations.

keystr, optional

Key to use to filter annotations.

get_by_id(uid) medkit.core.text.annotation.TextAnnotation#

Return the annotation corresponding to a specific identifier.

Parameters:
uidstr

Identifier of the annotation to return.

get_segments(*, label: str | None = None, key: str | None = None) list[medkit.core.text.annotation.Segment]#

Return a list of the segments of the document (not including entities).

Parameters:
labelstr, optional

Label to use to filter segments.

keystr, optional

Key to use to filter segments.

get_entities(*, label: str | None = None, key: str | None = None) list[medkit.core.text.annotation.Entity]#

Return a list of the entities of the document.

Parameters:
labelstr, optional

Label to use to filter entities.

keystr, optional

Key to use to filter entities.

get_relations(*, label: str | None = None, key: str | None = None, source_id: str | None = None) list[medkit.core.text.annotation.Relation]#

Return a list of the relations of the document.

Parameters:
labelstr, optional

Label to use to filter relations.

keystr, optional

Key to use to filter relations.

source_idstr, optional

Identifier of the source entity to use to filter relations.

class medkit.core.text.TextDocument(text: str, anns: Sequence[medkit.core.text.annotation.TextAnnotation] | None = None, attrs: Sequence[medkit.core.Attribute] | None = None, metadata: dict[str, Any] | None = None, uid: str | None = None)#

Bases: medkit.core.dict_conv.SubclassMapping

Document holding text annotations.

Annotations must be subclasses of TextAnnotation.

Attributes:
uidstr

Unique identifier of the document.

textstr

Full document text.

annsTextAnnotationContainer

Annotations of the document. Stored in an TextAnnotationContainer but can be passed as a list at init.

attrsAttributeContainer

Attributes of the document. Stored in an AttributeContainer but can be passed as a list at init

metadatadict of str to Any

Document metadata.

raw_segmentSegment

Auto-generated segment containing the full unprocessed document text. To get the raw text as an annotation to pass to processing operations:

Examples

>>> doc = TextDocument(text="hello")
>>> raw_text = doc.anns.get(label=TextDocument.RAW_LABEL)[0]
RAW_LABEL: ClassVar[str] = 'RAW_TEXT'#
uid: str#
anns: medkit.core.text.annotation_container.TextAnnotationContainer#
attrs: medkit.core.AttributeContainer#
metadata: dict[str, Any]#
raw_segment: medkit.core.text.annotation.Segment#
classmethod _generate_raw_segment(text: str, doc_id: str) medkit.core.text.annotation.Segment#
property text: str#
classmethod __init_subclass__()#
to_dict(with_anns: bool = True) dict[str, Any]#
classmethod from_dict(doc_dict: dict[str, Any]) typing_extensions.Self#

Create a TextDocument from a dict.

Parameters:
doc_dictdict of str to Any

A dictionary from a serialized TextDocument as generated by to_dict()

classmethod from_file(path: os.PathLike, encoding: str = 'utf-8') typing_extensions.Self#

Create a document from a text file.

Parameters:
pathPath

Path of the text file

encodingstr, default=”utf-8”

Text encoding to use

Returns:
TextDocument

Text document with contents of path as text. The file path is included in the document metadata.

classmethod from_dir(path: os.PathLike, pattern: str = '*.txt', encoding: str = 'utf-8') list[typing_extensions.Self]#

Create documents from text files in a directory.

Parameters:
pathPath

Path of the directory containing text files

patternstr

Glob pattern to match text files in path

encodingstr

Text encoding to use

Returns:
list of TextDocument

Text documents with contents of each file as text

get_snippet(segment: medkit.core.text.annotation.Segment, max_extend_length: int) str#

Return a portion of the original text containing the annotation.

Parameters:
segmentSegment

The annotation

max_extend_lengthint

Maximum number of characters to use around the annotation

Returns:
str

A portion of the text around the annotation

class medkit.core.text.EntityAttributeContainer(owner_id: str)#

Bases: medkit.core.attribute_container.AttributeContainer

Manage a list of attributes attached to a text entity.

This behaves more or less like a list: calling len() and iterating are supported. Additional filtering is available through the get() method.

Also provides retrieval of normalization attributes.

_norm_ids: list[str] = []#
property norms: list[medkit.core.text.entity_norm_attribute.EntityNormAttribute]#

Return the list of normalization attributes.

add(attr: medkit.core.attribute.Attribute)#

Attach an attribute to the annotation.

Parameters:
attrAttribute

Attribute to add.

Raises:
ValueError

If the attribute is already attached to the annotation (based on attr.uid).

get_norms() list[medkit.core.text.entity_norm_attribute.EntityNormAttribute]#

Return a list of the normalization attributes of the annotation.

class medkit.core.text.EntityNormAttribute(kb_name: str | None, kb_id: Any | None, kb_version: str | None = None, term: str | None = None, score: float | None = None, metadata: dict[str, Any] | None = None, uid: str | None = None)#

Bases: medkit.core.attribute.Attribute

Normalization attribute linking an entity to an ID in a knowledge base.

Attributes:
uidstr

Identifier of the attribute

labelstr

The attribute label, always set to EntityNormAttribute.LABEL

valueAny

String representation of the normalization, containing kb_id, along with kb_name if available (ex: “umls:C0011849”). For special cases where only term is available, it is used as value.

kb_namestr, optional

Name of the knowledge base (ex: “icd”). Should always be provided except in special cases when we just want to store a normalized term.

kb_idAny, optional

ID in the knowledge base to which the annotation should be linked. Should always be provided except in special cases when we just want to store a normalized term.

kb_versionstr, optional

Optional version of the knowledge base.

termstr, optional

Optional normalized version of the entity text.

scorefloat, optional

Optional score reflecting confidence of this link.

metadatadict of str to Any

Metadata of the attribute

kb_name: str | None#
kb_id: Any | None#
kb_version: str | None#
term: str | None#
score: float | None#
LABEL: ClassVar[str] = 'NORMALIZATION'#

Label used for all normalization attributes

to_brat() str#

Return a value compatible with the brat format.

to_spacy() str#

Return a value compatible with spaCy.

to_dict() dict[str, Any]#
classmethod from_dict(data_dict: dict[str, Any]) typing_extensions.Self#

Create an Attribute from a dict.

Parameters:
attribute_dict: dict of str to Any

A dictionary from a serialized Attribute as generated by to_dict()

class medkit.core.text.ContextOperation(uid: str | None = None, name: str | None = None, **kwargs)#

Bases: medkit.core.operation.Operation

Abstract operation for context detection.

It uses a list of segments as input for running the operation and creates attributes that are directly appended to these segments.

abstract run(segments: list[medkit.core.text.annotation.Segment]) None#
class medkit.core.text.CustomTextOpType#

Bases: enum.IntEnum

Supported function types for creating custom text operations.

CREATE_ONE_TO_N = 1#

Take 1 data item, return N new data items.

EXTRACT_ONE_TO_N = 2#

Take 1 data item, return N existing data items

FILTER = 3#

Take 1 data item, return True or False.

class medkit.core.text.NEROperation(uid: str | None = None, name: str | None = None, **kwargs)#

Bases: medkit.core.operation.Operation

Abstract operation for detecting entities.

It uses a list of segments as input and produces a list of detected entities.

abstract run(segments: list[medkit.core.text.annotation.Segment]) list[medkit.core.text.annotation.Entity]#
class medkit.core.text.SegmentationOperation(uid: str | None = None, name: str | None = None, **kwargs)#

Bases: medkit.core.operation.Operation

Abstract operation for segmenting text.

It uses a list of segments as input and produces a list of new segments.

abstract run(segments: list[medkit.core.text.annotation.Segment]) list[medkit.core.text.annotation.Segment]#
medkit.core.text.create_text_operation(function: Callable, function_type: CustomTextOpType, name: str | None = None, args: dict | None = None) _CustomTextOperation#

Instantiate a custom text operation from a user-defined function.

Parameters:
functionCallable

User-defined function

function_typeCustomTextOpType

Type of function. Supported values are defined in CustomTextOpType

namestr, optional

Name of the operation used for provenance info (default: function name)

argsstr, optional

Dictionary containing the arguments of the function if any.

Returns:
_CustomTextOperation

An instance of a custom text operation

class medkit.core.text.AnySpan#

Bases: abc.ABC, medkit.core.dict_conv.SubclassMapping

Helper class that provides a standard way to create an ABC using inheritance.

length: int#
classmethod __init_subclass__()#
classmethod from_dict(ann_dict: dict[str, Any]) typing_extensions.Self#
abstract to_dict() dict[str, Any]#
class medkit.core.text.ModifiedSpan#

Bases: AnySpan

Slice of text not present in the original text.

Parameters:
lengthint

Number of characters

replaced_spanslist of Span

Slices of the original text that this span is replacing

length: int#
replaced_spans: list[Span]#
to_dict() dict[str, Any]#
classmethod from_dict(modified_span_dict: dict[str, Any]) typing_extensions.Self#

Create a Modified from a dict.

Parameters:
modified_span_dictdict of str to Any

A dictionary from a serialized ModifiedSpan as generated by to_dict()

class medkit.core.text.Span#

Bases: AnySpan

Slice of text extracted from the original text.

Parameters:
startint

Index of the first character in the original text

endint

Index of the last character in the original text, plus one

start: int#
end: int#
property length#
to_dict() dict[str, Any]#
overlaps(other: Span)#

Test if 2 spans reference at least one character in common.

classmethod from_dict(span_dict: dict[str, Any]) typing_extensions.Self#

Create a Span from a dict.

Parameters:
span_dict: dict

A dictionary from a serialized span as generated by to_dict()

class medkit.core.text.UMLSNormAttribute(cui: str, umls_version: str, term: str | None = None, score: float | None = None, sem_types: list[str] | None = None, metadata: dict[str, Any] | None = None, uid: str | None = None)#

Bases: medkit.core.text.entity_norm_attribute.EntityNormAttribute

Normalization attribute linking an entity to a CUI in the UMLS knowledge base.

Attributes:
uidstr

Identifier of the attribute

labelstr

The attribute label, always set to EntityNormAttribute.LABEL

valueAny

CUI prefixed with “umls:” (ex: “umls:C0011849”)

kb_namestr, optional

Name of the knowledge base. Always “umls”

kb_idAny, optional

CUI (Concept Unique Identifier) to which the annotation should be linked

cuistr

Convenience alias of kb_id

kb_versionstr, optional

Version of the UMLS database (ex: “202AB”)

umls_versionstr

Convenience alias of kb_version

termstr, optional

Optional normalized version of the entity text

scorefloat, optional

Optional score reflecting confidence of this link

sem_typeslist of str, optional

Optional IDs of semantic types of the CUI (ex: [“T047”])

metadatadict of str to Any

Metadata of the attribute

sem_types: list[str] | None = None#
property cui#
property umls_version#
to_dict() dict[str, Any]#
classmethod from_dict(data: dict[str, Any]) typing_extensions.Self#

Create an Attribute from a dict.

Parameters:
attribute_dict: dict of str to Any

A dictionary from a serialized Attribute as generated by to_dict()