medkit.text.postprocessing#

Submodules#

Classes#

AttributeDuplicator

Annotator to copy attributes from a source segment to its nested segments.

DocumentSplitter

Split text documents using its segments as a reference.

Functions#

compute_nested_segments(...)

Return source segments aligned with its nested segments.

filter_overlapping_entities(...)

Filter a list of entities and remove overlaps.

Package Contents#

medkit.text.postprocessing.compute_nested_segments(source_segments: list[medkit.core.text.Segment], target_segments: list[medkit.core.text.Segment]) list[tuple[medkit.core.text.Segment, list[medkit.core.text.Segment]]]#

Return source segments aligned with its nested segments.

Only nested segments fully contained in the source_segments are returned.

Parameters:
source_segmentslist of Segment

List of source segments

target_segmentslist of Segment

List of segments to align

Returns:
list of tuple

List of aligned segments

class medkit.text.postprocessing.AttributeDuplicator(attr_labels: list[str], uid: str | None = None)#

Bases: medkit.core.Operation

Annotator to copy attributes from a source segment to its nested segments.

For each attribute to be duplicated, a new attribute is created in the nested segment.

Parameters:
attr_labelslist of str

Labels of the attributes to copy

uidstr, optional

Identifier of the annotator

attr_labels#
init_args#
run(source_segments: list[medkit.core.text.Segment], target_segments: list[medkit.core.text.Segment])#

Add attributes from source segments to all nested segments.

The nested segments are chosen among the target_segments based on their spans.

Parameters:
source_segmentslist of Segment

List of segments with attributes to copy

target_segmentslist of Segment

List of segments target

_duplicate_attr(attr: medkit.core.Attribute, target: medkit.core.text.Segment)#
class medkit.text.postprocessing.DocumentSplitter(segment_label: str, entity_labels: list[str] | None = None, attr_labels: list[str] | None = None, relation_labels: list[str] | None = None, name: str | None = None, uid: str | None = None)#

Bases: medkit.core.Operation

Split text documents using its segments as a reference.

The resulting ‘mini-documents’ contain the entities belonging to each segment along with their attributes.

This operation can be used to create datasets from medkit text documents.

Parameters:
segment_labelstr

Label of the segments to use as references for the splitter

entity_labelslist of str, optional

Labels of entities to be included in the mini documents. If None, all entities from the document will be included.

attr_labelslist of str, optional

Labels of the attributes to be included into the new annotations. If None, all attributes will be included.

relation_labelslist of str, optional

Labels of relations to be included in the mini documents. If None, all relations will be included.

namestr, optional

Name describing the splitter (default to the class name).

uidstr, Optional

Identifier of the operation

init_args#
segment_label#
entity_labels#
attr_labels#
relation_labels#
run(docs: list[medkit.core.text.TextDocument]) list[medkit.core.text.TextDocument]#

Split docs into mini documents.

Parameters:
docs: list of TextDocument

List of text documents to split

Returns:
list of TextDocument

List of documents created from the selected segments

_create_segment_doc(segment: medkit.core.text.Segment, entities: list[medkit.core.text.Entity], relations: list[medkit.core.text.Relation], doc_source: medkit.core.text.TextDocument) medkit.core.text.TextDocument#

Create a TextDocument from a segment and its entities.

The original zone of the segment becomes the text of the document.

Parameters:
segmentSegment

Segment to use as reference for the new document

entitieslist of Entity

Entities inside the segment

relationslist of Relation

Relations inside the segment

doc_sourceTextDocument

Initial document from which annotations where extracted

Returns:
TextDocument

A new document with entities, the metadata includes the original span and metadata

_filter_attrs_from_ann(ann: medkit.core.text.TextAnnotation) list[medkit.core.Attribute]#

Filter attributes from an annotation using ‘attr_labels’.

medkit.text.postprocessing.filter_overlapping_entities(entities: list[medkit.core.text.Entity]) list[medkit.core.text.Entity]#

Filter a list of entities and remove overlaps.

This method may be useful for the creation of data for named entity recognition, where a part of text can only contain one entity per ‘word’. When an overlap is detected, the longest entity is preferred.

Parameters:
entitieslist of Entity

Entities to filter

Returns:
list of Entity

Filtered entities