medkit.core.text#
Submodules#
- medkit.core.text.annotation
- medkit.core.text.annotation_container
- medkit.core.text.document
- medkit.core.text.entity_attribute_container
- medkit.core.text.entity_norm_attribute
- medkit.core.text.operation
- medkit.core.text.span
- medkit.core.text.span_utils
- medkit.core.text.umls_norm_attribute
- medkit.core.text.utils
Classes#
Text entity referencing part of an |
|
Relation between two text entities. |
|
Text segment referencing part of an |
|
Base abstract class for all text annotations. |
|
Manage a list of text annotations belonging to a text document. |
|
Document holding text annotations. |
|
Manage a list of attributes attached to a text entity. |
|
Normalization attribute linking an entity to an ID in a knowledge base. |
|
Abstract operation for context detection. |
|
Supported function types for creating custom text operations. |
|
Abstract operation for detecting entities. |
|
Abstract operation for segmenting text. |
|
Helper class that provides a standard way to create an ABC using |
|
Slice of text not present in the original text. |
|
Slice of text extracted from the original text. |
|
Normalization attribute linking an entity to a CUI in the UMLS knowledge base. |
Functions#
|
Instantiate a custom text operation from a user-defined function. |
Package Contents#
- class medkit.core.text.Entity(label: str, text: str, spans: list[medkit.core.text.span.AnySpan], attrs: list[medkit.core.attribute.Attribute] | None = None, metadata: dict[str, Any] | None = None, uid: str | None = None, store: medkit.core.store.Store | None = None, attr_container_class: type[medkit.core.text.entity_attribute_container.EntityAttributeContainer] = EntityAttributeContainer)#
Bases:
Segment
Text entity referencing part of an
TextDocument
.- Attributes:
- uidstr
The entity identifier.
- labelstr
The label for this entity (e.g., DISEASE)
- textstr
Text of the entity.
- spanslist of AnySpan
List of spans indicating which parts of the entity text correspond to which part of the document’s full text.
- attrsEntityAttributeContainer
Attributes of the entity. Stored in a :class:{~medkit.core.EntityAttributeContainer} but can be passed as a list at init.
- metadatadict of str to Any
The metadata of the entity
- keysset of str
Pipeline output keys to which the entity belongs to.
- class medkit.core.text.Relation(label: str, source_id: str, target_id: str, attrs: list[medkit.core.attribute.Attribute] | None = None, metadata: dict[str, Any] | None = None, uid: str | None = None, store: medkit.core.store.Store | None = None, attr_container_class: type[medkit.core.attribute_container.AttributeContainer] = AttributeContainer)#
Bases:
TextAnnotation
Relation between two text entities.
- Attributes:
- uidstr
The identifier of the relation
- labelstr
The relation label
- source_idstr
The identifier of the entity from which the relation is defined
- target_idstr
The identifier of the entity to which the relation is defined
- attrsAttributeContainer
The attributes of the relation
- metadatadict of str to Any
The metadata of the relation
- keysset of str
Pipeline output keys to which the relation belongs to
- source_id: str#
- target_id: str#
- to_dict() dict[str, Any] #
- classmethod from_dict(relation_dict: dict[str, Any]) typing_extensions.Self #
Create a Relation from a dict.
- Parameters:
- relation_dictdict of str to Any
A dictionary from a serialized relation as generated by to_dict()
- class medkit.core.text.Segment(label: str, text: str, spans: list[medkit.core.text.span.AnySpan], attrs: list[medkit.core.attribute.Attribute] | None = None, metadata: dict[str, Any] | None = None, uid: str | None = None, store: medkit.core.store.Store | None = None, attr_container_class: type[medkit.core.attribute_container.AttributeContainer] = AttributeContainer)#
Bases:
TextAnnotation
Text segment referencing part of an
TextDocument
.- Attributes:
- uidstr
The segment identifier.
- labelstr
The label for this segment (e.g., SENTENCE)
- textstr
Text of the segment.
- spanslist of AnySpan
List of spans indicating which parts of the segment text correspond to which part of the document’s full text.
- attrsAttributeContainer
Attributes of the segment. Stored in a :class:{~medkit.core.AttributeContainer} but can be passed as a list at init.
- metadatadict of str to Any
The metadata of the segment
- keysset of str
Pipeline output keys to which the segment belongs to.
- spans: list[medkit.core.text.span.AnySpan]#
- text: str#
- length#
- to_dict() dict[str, Any] #
- classmethod from_dict(segment_dict: dict[str, Any]) typing_extensions.Self #
Create a Segment from a dict.
- Parameters:
- segment_dictdict of str to Any
A dictionary from a serialized segment as generated by to_dict()
- class medkit.core.text.TextAnnotation(label: str, attrs: list[medkit.core.attribute.Attribute] | None = None, metadata: dict[str, Any] | None = None, uid: str | None = None, attr_container_class: type[medkit.core.attribute_container.AttributeContainer] = AttributeContainer)#
Bases:
abc.ABC
,medkit.core.dict_conv.SubclassMapping
Base abstract class for all text annotations.
- Attributes:
- uidstr
Unique identifier of the annotation.
- labelstr
The label for this annotation (e.g., SENTENCE)
- attrsAttributeContainer
Attributes of the annotation. Stored in a :class:{~medkit.core.AttributeContainer} but can be passed as a list at init.
- metadatadict of str to Any
The metadata of the annotation
- keysset of str
Pipeline output keys to which the annotation belongs to.
- uid: str#
- label: str#
- metadata: dict[str, Any]#
- keys: set[str]#
- classmethod __init_subclass__()#
- classmethod from_dict(ann_dict: dict[str, Any]) typing_extensions.Self #
- abstract to_dict() dict[str, Any] #
- class medkit.core.text.TextAnnotationContainer(doc_id: str, raw_segment: medkit.core.text.annotation.Segment)#
Bases:
medkit.core.annotation_container.AnnotationContainer
[medkit.core.text.annotation.TextAnnotation
]Manage a list of text annotations belonging to a text document.
This behaves more or less like a list: calling len() and iterating are supported. Additional filtering is available through the get() method.
Also provides retrieval of entities, segments, relations, and handling of raw segment.
- raw_segment#
- _segment_ids: list[str] = []#
- _entity_ids: list[str] = []#
- _relation_ids: list[str] = []#
- _relation_ids_by_source_id: dict[str, list[str]]#
- property segments: list[medkit.core.text.annotation.Segment]#
Return the list of segments.
- property entities: list[medkit.core.text.annotation.Entity]#
Return the list of entities.
- property relations: list[medkit.core.text.annotation.Relation]#
Return the list of relations.
- add(ann: medkit.core.text.annotation.TextAnnotation)#
Attach an annotation to the document.
- Parameters:
- annAnnotationType
Annotation to add.
- Raises:
- ValueError
If the annotation is already attached to the document (based on annotation.uid)
- get(*, label: str | None = None, key: str | None = None) list[medkit.core.text.annotation.TextAnnotation] #
Return a list of the annotations of the document.
- Parameters:
- labelstr, optional
Label to use to filter annotations.
- keystr, optional
Key to use to filter annotations.
- get_by_id(uid) medkit.core.text.annotation.TextAnnotation #
Return the annotation corresponding to a specific identifier.
- Parameters:
- uidstr
Identifier of the annotation to return.
- get_segments(*, label: str | None = None, key: str | None = None) list[medkit.core.text.annotation.Segment] #
Return a list of the segments of the document (not including entities).
- Parameters:
- labelstr, optional
Label to use to filter segments.
- keystr, optional
Key to use to filter segments.
- get_entities(*, label: str | None = None, key: str | None = None) list[medkit.core.text.annotation.Entity] #
Return a list of the entities of the document.
- Parameters:
- labelstr, optional
Label to use to filter entities.
- keystr, optional
Key to use to filter entities.
- get_relations(*, label: str | None = None, key: str | None = None, source_id: str | None = None) list[medkit.core.text.annotation.Relation] #
Return a list of the relations of the document.
- Parameters:
- labelstr, optional
Label to use to filter relations.
- keystr, optional
Key to use to filter relations.
- source_idstr, optional
Identifier of the source entity to use to filter relations.
- class medkit.core.text.TextDocument(text: str, anns: Sequence[medkit.core.text.annotation.TextAnnotation] | None = None, attrs: Sequence[medkit.core.Attribute] | None = None, metadata: dict[str, Any] | None = None, uid: str | None = None)#
Bases:
medkit.core.dict_conv.SubclassMapping
Document holding text annotations.
Annotations must be subclasses of TextAnnotation.
- Attributes:
- uidstr
Unique identifier of the document.
- textstr
Full document text.
- annsTextAnnotationContainer
Annotations of the document. Stored in an
TextAnnotationContainer
but can be passed as a list at init.- attrsAttributeContainer
Attributes of the document. Stored in an
AttributeContainer
but can be passed as a list at init- metadatadict of str to Any
Document metadata.
- raw_segmentSegment
Auto-generated segment containing the full unprocessed document text. To get the raw text as an annotation to pass to processing operations:
Examples
>>> doc = TextDocument(text="hello") >>> raw_text = doc.anns.get(label=TextDocument.RAW_LABEL)[0]
- RAW_LABEL: ClassVar[str] = 'RAW_TEXT'#
- uid: str#
- metadata: dict[str, Any]#
- raw_segment: medkit.core.text.annotation.Segment#
- classmethod _generate_raw_segment(text: str, doc_id: str) medkit.core.text.annotation.Segment #
- property text: str#
- classmethod __init_subclass__()#
- to_dict(with_anns: bool = True) dict[str, Any] #
- classmethod from_dict(doc_dict: dict[str, Any]) typing_extensions.Self #
Create a TextDocument from a dict.
- Parameters:
- doc_dictdict of str to Any
A dictionary from a serialized TextDocument as generated by to_dict()
- classmethod from_file(path: os.PathLike, encoding: str = 'utf-8') typing_extensions.Self #
Create a document from a text file.
- Parameters:
- pathPath
Path of the text file
- encodingstr, default=”utf-8”
Text encoding to use
- Returns:
- TextDocument
Text document with contents of path as text. The file path is included in the document metadata.
- classmethod from_dir(path: os.PathLike, pattern: str = '*.txt', encoding: str = 'utf-8') list[typing_extensions.Self] #
Create documents from text files in a directory.
- Parameters:
- pathPath
Path of the directory containing text files
- patternstr
Glob pattern to match text files in path
- encodingstr
Text encoding to use
- Returns:
- list of TextDocument
Text documents with contents of each file as text
- get_snippet(segment: medkit.core.text.annotation.Segment, max_extend_length: int) str #
Return a portion of the original text containing the annotation.
- Parameters:
- segmentSegment
The annotation
- max_extend_lengthint
Maximum number of characters to use around the annotation
- Returns:
- str
A portion of the text around the annotation
- class medkit.core.text.EntityAttributeContainer(owner_id: str)#
Bases:
medkit.core.attribute_container.AttributeContainer
Manage a list of attributes attached to a text entity.
This behaves more or less like a list: calling len() and iterating are supported. Additional filtering is available through the get() method.
Also provides retrieval of normalization attributes.
- _norm_ids: list[str] = []#
- property norms: list[medkit.core.text.entity_norm_attribute.EntityNormAttribute]#
Return the list of normalization attributes.
- add(attr: medkit.core.attribute.Attribute)#
Attach an attribute to the annotation.
- Parameters:
- attrAttribute
Attribute to add.
- Raises:
- ValueError
If the attribute is already attached to the annotation (based on attr.uid).
- get_norms() list[medkit.core.text.entity_norm_attribute.EntityNormAttribute] #
Return a list of the normalization attributes of the annotation.
- class medkit.core.text.EntityNormAttribute(kb_name: str | None, kb_id: Any | None, kb_version: str | None = None, term: str | None = None, score: float | None = None, metadata: dict[str, Any] | None = None, uid: str | None = None)#
Bases:
medkit.core.attribute.Attribute
Normalization attribute linking an entity to an ID in a knowledge base.
- Attributes:
- uidstr
Identifier of the attribute
- labelstr
The attribute label, always set to
EntityNormAttribute.LABEL
- valueAny
String representation of the normalization, containing kb_id, along with kb_name if available (ex: “umls:C0011849”). For special cases where only term is available, it is used as value.
- kb_namestr, optional
Name of the knowledge base (ex: “icd”). Should always be provided except in special cases when we just want to store a normalized term.
- kb_idAny, optional
ID in the knowledge base to which the annotation should be linked. Should always be provided except in special cases when we just want to store a normalized term.
- kb_versionstr, optional
Optional version of the knowledge base.
- termstr, optional
Optional normalized version of the entity text.
- scorefloat, optional
Optional score reflecting confidence of this link.
- metadatadict of str to Any
Metadata of the attribute
- kb_name: str | None#
- kb_id: Any | None#
- kb_version: str | None#
- term: str | None#
- score: float | None#
- LABEL: ClassVar[str] = 'NORMALIZATION'#
Label used for all normalization attributes
- to_brat() str #
Return a value compatible with the brat format.
- to_spacy() str #
Return a value compatible with spaCy.
- to_dict() dict[str, Any] #
- classmethod from_dict(data_dict: dict[str, Any]) typing_extensions.Self #
Create an Attribute from a dict.
- Parameters:
- attribute_dict: dict of str to Any
A dictionary from a serialized Attribute as generated by to_dict()
- class medkit.core.text.ContextOperation(uid: str | None = None, name: str | None = None, **kwargs)#
Bases:
medkit.core.operation.Operation
Abstract operation for context detection.
It uses a list of segments as input for running the operation and creates attributes that are directly appended to these segments.
- abstract run(segments: list[medkit.core.text.annotation.Segment]) None #
- class medkit.core.text.CustomTextOpType#
Bases:
enum.IntEnum
Supported function types for creating custom text operations.
- CREATE_ONE_TO_N = 1#
Take 1 data item, return N new data items.
- EXTRACT_ONE_TO_N = 2#
Take 1 data item, return N existing data items
- FILTER = 3#
Take 1 data item, return True or False.
- class medkit.core.text.NEROperation(uid: str | None = None, name: str | None = None, **kwargs)#
Bases:
medkit.core.operation.Operation
Abstract operation for detecting entities.
It uses a list of segments as input and produces a list of detected entities.
- abstract run(segments: list[medkit.core.text.annotation.Segment]) list[medkit.core.text.annotation.Entity] #
- class medkit.core.text.SegmentationOperation(uid: str | None = None, name: str | None = None, **kwargs)#
Bases:
medkit.core.operation.Operation
Abstract operation for segmenting text.
It uses a list of segments as input and produces a list of new segments.
- abstract run(segments: list[medkit.core.text.annotation.Segment]) list[medkit.core.text.annotation.Segment] #
- medkit.core.text.create_text_operation(function: Callable, function_type: CustomTextOpType, name: str | None = None, args: dict | None = None) _CustomTextOperation #
Instantiate a custom text operation from a user-defined function.
- Parameters:
- functionCallable
User-defined function
- function_typeCustomTextOpType
Type of function. Supported values are defined in
CustomTextOpType
- namestr, optional
Name of the operation used for provenance info (default: function name)
- argsstr, optional
Dictionary containing the arguments of the function if any.
- Returns:
- _CustomTextOperation
An instance of a custom text operation
- class medkit.core.text.AnySpan#
Bases:
abc.ABC
,medkit.core.dict_conv.SubclassMapping
Helper class that provides a standard way to create an ABC using inheritance.
- length: int#
- classmethod __init_subclass__()#
- classmethod from_dict(ann_dict: dict[str, Any]) typing_extensions.Self #
- abstract to_dict() dict[str, Any] #
- class medkit.core.text.ModifiedSpan#
Bases:
AnySpan
Slice of text not present in the original text.
- Parameters:
- lengthint
Number of characters
- replaced_spanslist of Span
Slices of the original text that this span is replacing
- length: int#
- to_dict() dict[str, Any] #
- classmethod from_dict(modified_span_dict: dict[str, Any]) typing_extensions.Self #
Create a Modified from a dict.
- Parameters:
- modified_span_dictdict of str to Any
A dictionary from a serialized ModifiedSpan as generated by to_dict()
- class medkit.core.text.Span#
Bases:
AnySpan
Slice of text extracted from the original text.
- Parameters:
- startint
Index of the first character in the original text
- endint
Index of the last character in the original text, plus one
- start: int#
- end: int#
- property length#
- to_dict() dict[str, Any] #
- classmethod from_dict(span_dict: dict[str, Any]) typing_extensions.Self #
Create a Span from a dict.
- Parameters:
- span_dict: dict
A dictionary from a serialized span as generated by to_dict()
- class medkit.core.text.UMLSNormAttribute(cui: str, umls_version: str, term: str | None = None, score: float | None = None, sem_types: list[str] | None = None, metadata: dict[str, Any] | None = None, uid: str | None = None)#
Bases:
medkit.core.text.entity_norm_attribute.EntityNormAttribute
Normalization attribute linking an entity to a CUI in the UMLS knowledge base.
- Attributes:
- uidstr
Identifier of the attribute
- labelstr
The attribute label, always set to
EntityNormAttribute.LABEL
- valueAny
CUI prefixed with “umls:” (ex: “umls:C0011849”)
- kb_namestr, optional
Name of the knowledge base. Always “umls”
- kb_idAny, optional
CUI (Concept Unique Identifier) to which the annotation should be linked
- cuistr
Convenience alias of kb_id
- kb_versionstr, optional
Version of the UMLS database (ex: “202AB”)
- umls_versionstr
Convenience alias of kb_version
- termstr, optional
Optional normalized version of the entity text
- scorefloat, optional
Optional score reflecting confidence of this link
- sem_typeslist of str, optional
Optional IDs of semantic types of the CUI (ex: [“T047”])
- metadatadict of str to Any
Metadata of the attribute
- sem_types: list[str] | None = None#
- property cui#
- property umls_version#
- to_dict() dict[str, Any] #
- classmethod from_dict(data: dict[str, Any]) typing_extensions.Self #
Create an Attribute from a dict.
- Parameters:
- attribute_dict: dict of str to Any
A dictionary from a serialized Attribute as generated by to_dict()