medkit.text.postprocessing.document_splitter#
Classes#
Split text documents using its segments as a reference. |
Module Contents#
- class medkit.text.postprocessing.document_splitter.DocumentSplitter(segment_label: str, entity_labels: list[str] | None = None, attr_labels: list[str] | None = None, relation_labels: list[str] | None = None, name: str | None = None, uid: str | None = None)#
Bases:
medkit.core.Operation
Split text documents using its segments as a reference.
The resulting ‘mini-documents’ contain the entities belonging to each segment along with their attributes.
This operation can be used to create datasets from medkit text documents.
- Parameters:
- segment_labelstr
Label of the segments to use as references for the splitter
- entity_labelslist of str, optional
Labels of entities to be included in the mini documents. If None, all entities from the document will be included.
- attr_labelslist of str, optional
Labels of the attributes to be included into the new annotations. If None, all attributes will be included.
- relation_labelslist of str, optional
Labels of relations to be included in the mini documents. If None, all relations will be included.
- namestr, optional
Name describing the splitter (default to the class name).
- uidstr, Optional
Identifier of the operation
- init_args#
- segment_label#
- entity_labels#
- attr_labels#
- relation_labels#
- run(docs: list[medkit.core.text.TextDocument]) list[medkit.core.text.TextDocument] #
Split docs into mini documents.
- Parameters:
- docs: list of TextDocument
List of text documents to split
- Returns:
- list of TextDocument
List of documents created from the selected segments
- _create_segment_doc(segment: medkit.core.text.Segment, entities: list[medkit.core.text.Entity], relations: list[medkit.core.text.Relation], doc_source: medkit.core.text.TextDocument) medkit.core.text.TextDocument #
Create a TextDocument from a segment and its entities.
The original zone of the segment becomes the text of the document.
- Parameters:
- segmentSegment
Segment to use as reference for the new document
- entitieslist of Entity
Entities inside the segment
- relationslist of Relation
Relations inside the segment
- doc_sourceTextDocument
Initial document from which annotations where extracted
- Returns:
- TextDocument
A new document with entities, the metadata includes the original span and metadata
- _filter_attrs_from_ann(ann: medkit.core.text.TextAnnotation) list[medkit.core.Attribute] #
Filter attributes from an annotation using ‘attr_labels’.