medkit.text.postprocessing

medkit.text.postprocessing#

Submodules#

Classes#

`AttributeDuplicator`	Annotator to copy attributes from a source segment to its nested segments.
`DocumentSplitter`	Split text documents using its segments as a reference.

Functions#

`compute_nested_segments`(...)	Return source segments aligned with its nested segments.
`filter_overlapping_entities`(...)	Filter a list of entities and remove overlaps.

Package Contents#

medkit.text.postprocessing.compute_nested_segments(source_segments: list[medkit.core.text.Segment], target_segments: list[medkit.core.text.Segment]) → list[tuple[medkit.core.text.Segment, list[medkit.core.text.Segment]]]#

Return source segments aligned with its nested segments.

Only nested segments fully contained in the source_segments are returned.

Parameters:

source_segmentslist of Segment: List of source segments
target_segmentslist of Segment: List of segments to align

Returns:

list of tuple: List of aligned segments

class medkit.text.postprocessing.AttributeDuplicator(attr_labels: list[str], uid: str | None = None)#

Bases: medkit.core.Operation

Annotator to copy attributes from a source segment to its nested segments.

For each attribute to be duplicated, a new attribute is created in the nested segment.

Parameters:

attr_labelslist of str: Labels of the attributes to copy
uidstr, optional: Identifier of the annotator

attr_labels#

init_args#

run(source_segments: list[medkit.core.text.Segment], target_segments: list[medkit.core.text.Segment])#

Add attributes from source segments to all nested segments.

The nested segments are chosen among the target_segments based on their spans.

Parameters:

source_segmentslist of Segment: List of segments with attributes to copy
target_segmentslist of Segment: List of segments target

_duplicate_attr(attr: medkit.core.Attribute, target: medkit.core.text.Segment)#

Bases: medkit.core.Operation

Split text documents using its segments as a reference.

The resulting ‘mini-documents’ contain the entities belonging to each segment along with their attributes.

This operation can be used to create datasets from medkit text documents.

Parameters:

segment_labelstr: Label of the segments to use as references for the splitter
entity_labelslist of str, optional: Labels of entities to be included in the mini documents. If None, all entities from the document will be included.
attr_labelslist of str, optional: Labels of the attributes to be included into the new annotations. If None, all attributes will be included.
relation_labelslist of str, optional: Labels of relations to be included in the mini documents. If None, all relations will be included.
namestr, optional: Name describing the splitter (default to the class name).
uidstr, Optional: Identifier of the operation

init_args#

segment_label#

entity_labels#

attr_labels#

relation_labels#

run(docs: list[medkit.core.text.TextDocument]) → list[medkit.core.text.TextDocument]#

Split docs into mini documents.

Parameters:

docs: list of TextDocument: List of text documents to split

Returns:

list of TextDocument: List of documents created from the selected segments

_create_segment_doc(segment: medkit.core.text.Segment, entities: list[medkit.core.text.Entity], relations: list[medkit.core.text.Relation], doc_source: medkit.core.text.TextDocument) → medkit.core.text.TextDocument#

Create a TextDocument from a segment and its entities.

The original zone of the segment becomes the text of the document.

Parameters:

segmentSegment: Segment to use as reference for the new document
entitieslist of Entity: Entities inside the segment
relationslist of Relation: Relations inside the segment
doc_sourceTextDocument: Initial document from which annotations where extracted

Returns:

TextDocument: A new document with entities, the metadata includes the original span and metadata

_filter_attrs_from_ann(ann: medkit.core.text.TextAnnotation) → list[medkit.core.Attribute]#: Filter attributes from an annotation using ‘attr_labels’.

medkit.text.postprocessing.filter_overlapping_entities(entities: list[medkit.core.text.Entity]) → list[medkit.core.text.Entity]#

Filter a list of entities and remove overlaps.

This method may be useful for the creation of data for named entity recognition, where a part of text can only contain one entity per ‘word’. When an overlap is detected, the longest entity is preferred.

Parameters:

entitieslist of Entity: Entities to filter

Returns:

list of Entity: Filtered entities