medkit.text.postprocessing.document_splitter

medkit.text.postprocessing.document_splitter#

Classes#

DocumentSplitter

Split text documents using its segments as a reference.

Module Contents#

class medkit.text.postprocessing.document_splitter.DocumentSplitter(segment_label: str, entity_labels: list[str] | None = None, attr_labels: list[str] | None = None, relation_labels: list[str] | None = None, name: str | None = None, uid: str | None = None)#

Bases: medkit.core.Operation

Split text documents using its segments as a reference.

The resulting ‘mini-documents’ contain the entities belonging to each segment along with their attributes.

This operation can be used to create datasets from medkit text documents.

Parameters:

segment_labelstr: Label of the segments to use as references for the splitter
entity_labelslist of str, optional: Labels of entities to be included in the mini documents. If None, all entities from the document will be included.
attr_labelslist of str, optional: Labels of the attributes to be included into the new annotations. If None, all attributes will be included.
relation_labelslist of str, optional: Labels of relations to be included in the mini documents. If None, all relations will be included.
namestr, optional: Name describing the splitter (default to the class name).
uidstr, Optional: Identifier of the operation

init_args#

segment_label#

entity_labels#

attr_labels#

relation_labels#

run(docs: list[medkit.core.text.TextDocument]) → list[medkit.core.text.TextDocument]#

Split docs into mini documents.

Parameters:

docs: list of TextDocument: List of text documents to split

Returns:

list of TextDocument: List of documents created from the selected segments

_create_segment_doc(segment: medkit.core.text.Segment, entities: list[medkit.core.text.Entity], relations: list[medkit.core.text.Relation], doc_source: medkit.core.text.TextDocument) → medkit.core.text.TextDocument#

Create a TextDocument from a segment and its entities.

The original zone of the segment becomes the text of the document.

Parameters:

segmentSegment: Segment to use as reference for the new document
entitieslist of Entity: Entities inside the segment
relationslist of Relation: Relations inside the segment
doc_sourceTextDocument: Initial document from which annotations where extracted

Returns:

TextDocument: A new document with entities, the metadata includes the original span and metadata

_filter_attrs_from_ann(ann: medkit.core.text.TextAnnotation) → list[medkit.core.Attribute]#: Filter attributes from an annotation using ‘attr_labels’.