medkit.core.text.document#

Classes#

TextDocument

Document holding text annotations.

Module Contents#

class medkit.core.text.document.TextDocument(text: str, anns: Sequence[medkit.core.text.annotation.TextAnnotation] | None = None, attrs: Sequence[medkit.core.Attribute] | None = None, metadata: dict[str, Any] | None = None, uid: str | None = None)#

Bases: medkit.core.dict_conv.SubclassMapping

Document holding text annotations.

Annotations must be subclasses of TextAnnotation.

Attributes:
uidstr

Unique identifier of the document.

textstr

Full document text.

annsTextAnnotationContainer

Annotations of the document. Stored in an TextAnnotationContainer but can be passed as a list at init.

attrsAttributeContainer

Attributes of the document. Stored in an AttributeContainer but can be passed as a list at init

metadatadict of str to Any

Document metadata.

raw_segmentSegment

Auto-generated segment containing the full unprocessed document text. To get the raw text as an annotation to pass to processing operations:

Examples

>>> doc = TextDocument(text="hello")
>>> raw_text = doc.anns.get(label=TextDocument.RAW_LABEL)[0]
RAW_LABEL: ClassVar[str] = 'RAW_TEXT'#
uid: str#
anns: medkit.core.text.annotation_container.TextAnnotationContainer#
attrs: medkit.core.AttributeContainer#
metadata: dict[str, Any]#
raw_segment: medkit.core.text.annotation.Segment#
classmethod _generate_raw_segment(text: str, doc_id: str) medkit.core.text.annotation.Segment#
property text: str#
classmethod __init_subclass__()#
to_dict(with_anns: bool = True) dict[str, Any]#
classmethod from_dict(doc_dict: dict[str, Any]) typing_extensions.Self#

Create a TextDocument from a dict.

Parameters:
doc_dictdict of str to Any

A dictionary from a serialized TextDocument as generated by to_dict()

classmethod from_file(path: os.PathLike, encoding: str = 'utf-8') typing_extensions.Self#

Create a document from a text file.

Parameters:
pathPath

Path of the text file

encodingstr, default=”utf-8”

Text encoding to use

Returns:
TextDocument

Text document with contents of path as text. The file path is included in the document metadata.

classmethod from_dir(path: os.PathLike, pattern: str = '*.txt', encoding: str = 'utf-8') list[typing_extensions.Self]#

Create documents from text files in a directory.

Parameters:
pathPath

Path of the directory containing text files

patternstr

Glob pattern to match text files in path

encodingstr

Text encoding to use

Returns:
list of TextDocument

Text documents with contents of each file as text

get_snippet(segment: medkit.core.text.annotation.Segment, max_extend_length: int) str#

Return a portion of the original text containing the annotation.

Parameters:
segmentSegment

The annotation

max_extend_lengthint

Maximum number of characters to use around the annotation

Returns:
str

A portion of the text around the annotation