medkit.text.postprocessing.document_splitter#

Classes#

DocumentSplitter

Split text documents using its segments as a reference.

Module Contents#

class medkit.text.postprocessing.document_splitter.DocumentSplitter(segment_label: str, entity_labels: list[str] | None = None, attr_labels: list[str] | None = None, relation_labels: list[str] | None = None, name: str | None = None, uid: str | None = None)#

Bases: medkit.core.Operation

Split text documents using its segments as a reference.

The resulting ‘mini-documents’ contain the entities belonging to each segment along with their attributes.

This operation can be used to create datasets from medkit text documents.

Parameters:
segment_labelstr

Label of the segments to use as references for the splitter

entity_labelslist of str, optional

Labels of entities to be included in the mini documents. If None, all entities from the document will be included.

attr_labelslist of str, optional

Labels of the attributes to be included into the new annotations. If None, all attributes will be included.

relation_labelslist of str, optional

Labels of relations to be included in the mini documents. If None, all relations will be included.

namestr, optional

Name describing the splitter (default to the class name).

uidstr, Optional

Identifier of the operation

init_args#
segment_label#
entity_labels#
attr_labels#
relation_labels#
run(docs: list[medkit.core.text.TextDocument]) list[medkit.core.text.TextDocument]#

Split docs into mini documents.

Parameters:
docs: list of TextDocument

List of text documents to split

Returns:
list of TextDocument

List of documents created from the selected segments

_create_segment_doc(segment: medkit.core.text.Segment, entities: list[medkit.core.text.Entity], relations: list[medkit.core.text.Relation], doc_source: medkit.core.text.TextDocument) medkit.core.text.TextDocument#

Create a TextDocument from a segment and its entities.

The original zone of the segment becomes the text of the document.

Parameters:
segmentSegment

Segment to use as reference for the new document

entitieslist of Entity

Entities inside the segment

relationslist of Relation

Relations inside the segment

doc_sourceTextDocument

Initial document from which annotations where extracted

Returns:
TextDocument

A new document with entities, the metadata includes the original span and metadata

_filter_attrs_from_ann(ann: medkit.core.text.TextAnnotation) list[medkit.core.Attribute]#

Filter attributes from an annotation using ‘attr_labels’.