medkit.core.text.document#
Classes#
Document holding text annotations. |
Module Contents#
- class medkit.core.text.document.TextDocument(text: str, anns: Sequence[medkit.core.text.annotation.TextAnnotation] | None = None, attrs: Sequence[medkit.core.Attribute] | None = None, metadata: dict[str, Any] | None = None, uid: str | None = None)#
Bases:
medkit.core.dict_conv.SubclassMapping
Document holding text annotations.
Annotations must be subclasses of TextAnnotation.
- Attributes:
- uidstr
Unique identifier of the document.
- textstr
Full document text.
- annsTextAnnotationContainer
Annotations of the document. Stored in an
TextAnnotationContainer
but can be passed as a list at init.- attrsAttributeContainer
Attributes of the document. Stored in an
AttributeContainer
but can be passed as a list at init- metadatadict of str to Any
Document metadata.
- raw_segmentSegment
Auto-generated segment containing the full unprocessed document text. To get the raw text as an annotation to pass to processing operations:
Examples
>>> doc = TextDocument(text="hello") >>> raw_text = doc.anns.get(label=TextDocument.RAW_LABEL)[0]
- RAW_LABEL: ClassVar[str] = 'RAW_TEXT'#
- uid: str#
- metadata: dict[str, Any]#
- raw_segment: medkit.core.text.annotation.Segment#
- classmethod _generate_raw_segment(text: str, doc_id: str) medkit.core.text.annotation.Segment #
- property text: str#
- classmethod __init_subclass__()#
- to_dict(with_anns: bool = True) dict[str, Any] #
- classmethod from_dict(doc_dict: dict[str, Any]) typing_extensions.Self #
Create a TextDocument from a dict.
- Parameters:
- doc_dictdict of str to Any
A dictionary from a serialized TextDocument as generated by to_dict()
- classmethod from_file(path: os.PathLike, encoding: str = 'utf-8') typing_extensions.Self #
Create a document from a text file.
- Parameters:
- pathPath
Path of the text file
- encodingstr, default=”utf-8”
Text encoding to use
- Returns:
- TextDocument
Text document with contents of path as text. The file path is included in the document metadata.
- classmethod from_dir(path: os.PathLike, pattern: str = '*.txt', encoding: str = 'utf-8') list[typing_extensions.Self] #
Create documents from text files in a directory.
- Parameters:
- pathPath
Path of the directory containing text files
- patternstr
Glob pattern to match text files in path
- encodingstr
Text encoding to use
- Returns:
- list of TextDocument
Text documents with contents of each file as text
- get_snippet(segment: medkit.core.text.annotation.Segment, max_extend_length: int) str #
Return a portion of the original text containing the annotation.
- Parameters:
- segmentSegment
The annotation
- max_extend_lengthint
Maximum number of characters to use around the annotation
- Returns:
- str
A portion of the text around the annotation