medkit.core.text

medkit.core.text#

Submodules#

Classes#

`Entity`	Text entity referencing part of an `TextDocument`.
`Relation`	Relation between two text entities.
`Segment`	Text segment referencing part of an `TextDocument`.
`TextAnnotation`	Base abstract class for all text annotations.
`TextAnnotationContainer`	Manage a list of text annotations belonging to a text document.
`TextDocument`	Document holding text annotations.
`EntityAttributeContainer`	Manage a list of attributes attached to a text entity.
`EntityNormAttribute`	Normalization attribute linking an entity to an ID in a knowledge base.
`ContextOperation`	Abstract operation for context detection.
`CustomTextOpType`	Supported function types for creating custom text operations.
`NEROperation`	Abstract operation for detecting entities.
`SegmentationOperation`	Abstract operation for segmenting text.
`AnySpan`	Helper class that provides a standard way to create an ABC using
`ModifiedSpan`	Slice of text not present in the original text.
`Span`	Slice of text extracted from the original text.
`UMLSNormAttribute`	Normalization attribute linking an entity to a CUI in the UMLS knowledge base.

Functions#

create_text_operation(→ _CustomTextOperation)

Instantiate a custom text operation from a user-defined function.

Package Contents#

class medkit.core.text.Entity(label: str, text: str, spans: list[medkit.core.text.span.AnySpan], attrs: list[medkit.core.attribute.Attribute] | None = None, metadata: dict[str, Any] | None = None, uid: str | None = None, store: medkit.core.store.Store | None = None, attr_container_class: type[medkit.core.text.entity_attribute_container.EntityAttributeContainer] = EntityAttributeContainer)#

Bases: Segment

Text entity referencing part of an TextDocument.

Attributes:

uidstr: The entity identifier.
labelstr: The label for this entity (e.g., DISEASE)
textstr: Text of the entity.
spanslist of AnySpan: List of spans indicating which parts of the entity text correspond to which part of the document’s full text.
attrsEntityAttributeContainer: Attributes of the entity. Stored in a :class:{~medkit.core.EntityAttributeContainer} but can be passed as a list at init.
metadatadict of str to Any: The metadata of the entity
keysset of str: Pipeline output keys to which the entity belongs to.

attrs: medkit.core.text.entity_attribute_container.EntityAttributeContainer#

class medkit.core.text.Relation(label: str, source_id: str, target_id: str, attrs: list[medkit.core.attribute.Attribute] | None = None, metadata: dict[str, Any] | None = None, uid: str | None = None, store: medkit.core.store.Store | None = None, attr_container_class: type[medkit.core.attribute_container.AttributeContainer] = AttributeContainer)#

Bases: TextAnnotation

Relation between two text entities.

Attributes:

uidstr: The identifier of the relation
labelstr: The relation label
source_idstr: The identifier of the entity from which the relation is defined
target_idstr: The identifier of the entity to which the relation is defined
attrsAttributeContainer: The attributes of the relation
metadatadict of str to Any: The metadata of the relation
keysset of str: Pipeline output keys to which the relation belongs to

source_id: str#

target_id: str#

to_dict() → dict[str, Any]#

classmethod from_dict(relation_dict: dict[str, Any]) → typing_extensions.Self#

Create a Relation from a dict.

Parameters:

relation_dictdict of str to Any: A dictionary from a serialized relation as generated by to_dict()

class medkit.core.text.Segment(label: str, text: str, spans: list[medkit.core.text.span.AnySpan], attrs: list[medkit.core.attribute.Attribute] | None = None, metadata: dict[str, Any] | None = None, uid: str | None = None, store: medkit.core.store.Store | None = None, attr_container_class: type[medkit.core.attribute_container.AttributeContainer] = AttributeContainer)#

Bases: TextAnnotation

Text segment referencing part of an TextDocument.

Attributes:

uidstr: The segment identifier.
labelstr: The label for this segment (e.g., SENTENCE)
textstr: Text of the segment.
spanslist of AnySpan: List of spans indicating which parts of the segment text correspond to which part of the document’s full text.
attrsAttributeContainer: Attributes of the segment. Stored in a :class:{~medkit.core.AttributeContainer} but can be passed as a list at init.
metadatadict of str to Any: The metadata of the segment
keysset of str: Pipeline output keys to which the segment belongs to.

spans: list[medkit.core.text.span.AnySpan]#

text: str#

length#

to_dict() → dict[str, Any]#

classmethod from_dict(segment_dict: dict[str, Any]) → typing_extensions.Self#

Create a Segment from a dict.

Parameters:

segment_dictdict of str to Any: A dictionary from a serialized segment as generated by to_dict()

class medkit.core.text.TextAnnotation(label: str, attrs: list[medkit.core.attribute.Attribute] | None = None, metadata: dict[str, Any] | None = None, uid: str | None = None, attr_container_class: type[medkit.core.attribute_container.AttributeContainer] = AttributeContainer)#

Bases: abc.ABC, medkit.core.dict_conv.SubclassMapping

Base abstract class for all text annotations.

Attributes:

uidstr: Unique identifier of the annotation.
labelstr: The label for this annotation (e.g., SENTENCE)
attrsAttributeContainer: Attributes of the annotation. Stored in a :class:{~medkit.core.AttributeContainer} but can be passed as a list at init.
metadatadict of str to Any: The metadata of the annotation
keysset of str: Pipeline output keys to which the annotation belongs to.

uid: str#

label: str#

attrs: medkit.core.attribute_container.AttributeContainer#

metadata: dict[str, Any]#

keys: set[str]#

classmethod __init_subclass__()#

classmethod from_dict(ann_dict: dict[str, Any]) → typing_extensions.Self#

abstract to_dict() → dict[str, Any]#

class medkit.core.text.TextAnnotationContainer(doc_id: str, raw_segment: medkit.core.text.annotation.Segment)#

Bases: medkit.core.annotation_container.AnnotationContainer[medkit.core.text.annotation.TextAnnotation]

Manage a list of text annotations belonging to a text document.

This behaves more or less like a list: calling len() and iterating are supported. Additional filtering is available through the get() method.

Also provides retrieval of entities, segments, relations, and handling of raw segment.

raw_segment#

_segment_ids: list[str] = []#

_entity_ids: list[str] = []#

_relation_ids: list[str] = []#

_relation_ids_by_source_id: dict[str, list[str]]#

property segments: list[medkit.core.text.annotation.Segment]#: Return the list of segments.

property entities: list[medkit.core.text.annotation.Entity]#: Return the list of entities.

property relations: list[medkit.core.text.annotation.Relation]#: Return the list of relations.

add(ann: medkit.core.text.annotation.TextAnnotation)#

Attach an annotation to the document.

Parameters:

annAnnotationType: Annotation to add.

Raises:

ValueError: If the annotation is already attached to the document (based on annotation.uid)

get(*, label: str | None = None, key: str | None = None) → list[medkit.core.text.annotation.TextAnnotation]#

Return a list of the annotations of the document.

Parameters:

labelstr, optional: Label to use to filter annotations.
keystr, optional: Key to use to filter annotations.

get_by_id(uid) → medkit.core.text.annotation.TextAnnotation#

Return the annotation corresponding to a specific identifier.

Parameters:

uidstr: Identifier of the annotation to return.

get_segments(*, label: str | None = None, key: str | None = None) → list[medkit.core.text.annotation.Segment]#

Return a list of the segments of the document (not including entities).

Parameters:

labelstr, optional: Label to use to filter segments.
keystr, optional: Key to use to filter segments.

get_entities(*, label: str | None = None, key: str | None = None) → list[medkit.core.text.annotation.Entity]#

Return a list of the entities of the document.

Parameters:

labelstr, optional: Label to use to filter entities.
keystr, optional: Key to use to filter entities.

get_relations(*, label: str | None = None, key: str | None = None, source_id: str | None = None) → list[medkit.core.text.annotation.Relation]#

Return a list of the relations of the document.

Parameters:

labelstr, optional: Label to use to filter relations.
keystr, optional: Key to use to filter relations.
source_idstr, optional: Identifier of the source entity to use to filter relations.

class medkit.core.text.TextDocument(text: str, anns: Sequence[medkit.core.text.annotation.TextAnnotation] | None = None, attrs: Sequence[medkit.core.Attribute] | None = None, metadata: dict[str, Any] | None = None, uid: str | None = None)#

Bases: medkit.core.dict_conv.SubclassMapping

Document holding text annotations.

Annotations must be subclasses of TextAnnotation.

Attributes:

uidstr: Unique identifier of the document.
textstr: Full document text.
annsTextAnnotationContainer: Annotations of the document. Stored in an TextAnnotationContainer but can be passed as a list at init.
attrsAttributeContainer: Attributes of the document. Stored in an AttributeContainer but can be passed as a list at init
metadatadict of str to Any: Document metadata.
raw_segmentSegment: Auto-generated segment containing the full unprocessed document text. To get the raw text as an annotation to pass to processing operations:

Examples

>>> doc = TextDocument(text="hello")
>>> raw_text = doc.anns.get(label=TextDocument.RAW_LABEL)[0]

RAW_LABEL: ClassVar[str] = 'RAW_TEXT'#

uid: str#

anns: medkit.core.text.annotation_container.TextAnnotationContainer#

attrs: medkit.core.AttributeContainer#

metadata: dict[str, Any]#

raw_segment: medkit.core.text.annotation.Segment#

classmethod _generate_raw_segment(text: str, doc_id: str) → medkit.core.text.annotation.Segment#

property text: str#

classmethod __init_subclass__()#

to_dict(with_anns: bool = True) → dict[str, Any]#

classmethod from_dict(doc_dict: dict[str, Any]) → typing_extensions.Self#

Create a TextDocument from a dict.

Parameters:

doc_dictdict of str to Any: A dictionary from a serialized TextDocument as generated by to_dict()

classmethod from_file(path: os.PathLike, encoding: str = 'utf-8') → typing_extensions.Self#

Create a document from a text file.

Parameters:

pathPath: Path of the text file
encodingstr, default=”utf-8”: Text encoding to use

Returns:

TextDocument: Text document with contents of path as text. The file path is included in the document metadata.

classmethod from_dir(path: os.PathLike, pattern: str = '*.txt', encoding: str = 'utf-8') → list[typing_extensions.Self]#

Create documents from text files in a directory.

Parameters:

pathPath: Path of the directory containing text files
patternstr: Glob pattern to match text files in path
encodingstr: Text encoding to use

Returns:

list of TextDocument: Text documents with contents of each file as text

get_snippet(segment: medkit.core.text.annotation.Segment, max_extend_length: int) → str#

Return a portion of the original text containing the annotation.

Parameters:

segmentSegment: The annotation
max_extend_lengthint: Maximum number of characters to use around the annotation

Returns:

str: A portion of the text around the annotation

class medkit.core.text.EntityAttributeContainer(owner_id: str)#

Bases: medkit.core.attribute_container.AttributeContainer

Manage a list of attributes attached to a text entity.

This behaves more or less like a list: calling len() and iterating are supported. Additional filtering is available through the get() method.

Also provides retrieval of normalization attributes.

_norm_ids: list[str] = []#

property norms: list[medkit.core.text.entity_norm_attribute.EntityNormAttribute]#: Return the list of normalization attributes.

add(attr: medkit.core.attribute.Attribute)#

Attach an attribute to the annotation.

Parameters:

attrAttribute: Attribute to add.

Raises:

ValueError: If the attribute is already attached to the annotation (based on attr.uid).

get_norms() → list[medkit.core.text.entity_norm_attribute.EntityNormAttribute]#: Return a list of the normalization attributes of the annotation.

Bases: medkit.core.attribute.Attribute

Normalization attribute linking an entity to an ID in a knowledge base.

Attributes:

uidstr: Identifier of the attribute
labelstr: The attribute label, always set to EntityNormAttribute.LABEL
valueAny: String representation of the normalization, containing kb_id, along with kb_name if available (ex: “umls:C0011849”). For special cases where only term is available, it is used as value.
kb_namestr, optional: Name of the knowledge base (ex: “icd”). Should always be provided except in special cases when we just want to store a normalized term.
kb_idAny, optional: ID in the knowledge base to which the annotation should be linked. Should always be provided except in special cases when we just want to store a normalized term.
kb_versionstr, optional: Optional version of the knowledge base.
termstr, optional: Optional normalized version of the entity text.
scorefloat, optional: Optional score reflecting confidence of this link.
metadatadict of str to Any: Metadata of the attribute

kb_name: str | None#

kb_id: Any | None#

kb_version: str | None#

term: str | None#

score: float | None#

LABEL: ClassVar[str] = 'NORMALIZATION'#: Label used for all normalization attributes

to_brat() → str#: Return a value compatible with the brat format.

to_spacy() → str#: Return a value compatible with spaCy.

to_dict() → dict[str, Any]#

classmethod from_dict(data_dict: dict[str, Any]) → typing_extensions.Self#

Create an Attribute from a dict.

Parameters:

attribute_dict: dict of str to Any: A dictionary from a serialized Attribute as generated by to_dict()

class medkit.core.text.ContextOperation(uid: str | None = None, name: str | None = None, **kwargs)#

Bases: medkit.core.operation.Operation

Abstract operation for context detection.

It uses a list of segments as input for running the operation and creates attributes that are directly appended to these segments.

abstract run(segments: list[medkit.core.text.annotation.Segment]) → None#

class medkit.core.text.CustomTextOpType#

Bases: enum.IntEnum

Supported function types for creating custom text operations.

CREATE_ONE_TO_N = 1#: Take 1 data item, return N new data items.

EXTRACT_ONE_TO_N = 2#: Take 1 data item, return N existing data items

FILTER = 3#: Take 1 data item, return True or False.

class medkit.core.text.NEROperation(uid: str | None = None, name: str | None = None, **kwargs)#

Bases: medkit.core.operation.Operation

Abstract operation for detecting entities.

It uses a list of segments as input and produces a list of detected entities.

abstract run(segments: list[medkit.core.text.annotation.Segment]) → list[medkit.core.text.annotation.Entity]#

class medkit.core.text.SegmentationOperation(uid: str | None = None, name: str | None = None, **kwargs)#

Bases: medkit.core.operation.Operation

Abstract operation for segmenting text.

It uses a list of segments as input and produces a list of new segments.

abstract run(segments: list[medkit.core.text.annotation.Segment]) → list[medkit.core.text.annotation.Segment]#

medkit.core.text.create_text_operation(function: Callable, function_type: CustomTextOpType, name: str | None = None, args: dict | None = None) → _CustomTextOperation#

Instantiate a custom text operation from a user-defined function.

Parameters:

functionCallable: User-defined function
function_typeCustomTextOpType: Type of function. Supported values are defined in CustomTextOpType
namestr, optional: Name of the operation used for provenance info (default: function name)
argsstr, optional: Dictionary containing the arguments of the function if any.

Returns:

_CustomTextOperation: An instance of a custom text operation

class medkit.core.text.AnySpan#

Bases: abc.ABC, medkit.core.dict_conv.SubclassMapping

Helper class that provides a standard way to create an ABC using inheritance.

length: int#

classmethod __init_subclass__()#

classmethod from_dict(ann_dict: dict[str, Any]) → typing_extensions.Self#

abstract to_dict() → dict[str, Any]#

class medkit.core.text.ModifiedSpan#

Bases: AnySpan

Slice of text not present in the original text.

Parameters:

lengthint: Number of characters
replaced_spanslist of Span: Slices of the original text that this span is replacing

length: int#

replaced_spans: list[Span]#

to_dict() → dict[str, Any]#

classmethod from_dict(modified_span_dict: dict[str, Any]) → typing_extensions.Self#

Create a Modified from a dict.

Parameters:

modified_span_dictdict of str to Any: A dictionary from a serialized ModifiedSpan as generated by to_dict()

class medkit.core.text.Span#

Bases: AnySpan

Slice of text extracted from the original text.

Parameters:

startint: Index of the first character in the original text
endint: Index of the last character in the original text, plus one

start: int#

end: int#

property length#

to_dict() → dict[str, Any]#

overlaps(other: Span)#: Test if 2 spans reference at least one character in common.

classmethod from_dict(span_dict: dict[str, Any]) → typing_extensions.Self#

Create a Span from a dict.

Parameters:

span_dict: dict: A dictionary from a serialized span as generated by to_dict()

Bases: medkit.core.text.entity_norm_attribute.EntityNormAttribute

Normalization attribute linking an entity to a CUI in the UMLS knowledge base.

Attributes:

uidstr: Identifier of the attribute
labelstr: The attribute label, always set to EntityNormAttribute.LABEL
valueAny: CUI prefixed with “umls:” (ex: “umls:C0011849”)
kb_namestr, optional: Name of the knowledge base. Always “umls”
kb_idAny, optional: CUI (Concept Unique Identifier) to which the annotation should be linked
cuistr: Convenience alias of kb_id
kb_versionstr, optional: Version of the UMLS database (ex: “202AB”)
umls_versionstr: Convenience alias of kb_version
termstr, optional: Optional normalized version of the entity text
scorefloat, optional: Optional score reflecting confidence of this link
sem_typeslist of str, optional: Optional IDs of semantic types of the CUI (ex: [“T047”])
metadatadict of str to Any: Metadata of the attribute

sem_types: list[str] | None = None#

property cui#

property umls_version#

to_dict() → dict[str, Any]#

classmethod from_dict(data: dict[str, Any]) → typing_extensions.Self#

Create an Attribute from a dict.

Parameters:

attribute_dict: dict of str to Any: A dictionary from a serialized Attribute as generated by to_dict()