medkit.text.postprocessing#
Submodules#
Classes#
Annotator to copy attributes from a source segment to its nested segments. |
|
Split text documents using its segments as a reference. |
Functions#
Return source segments aligned with its nested segments. |
|
Filter a list of entities and remove overlaps. |
Package Contents#
- medkit.text.postprocessing.compute_nested_segments(source_segments: list[medkit.core.text.Segment], target_segments: list[medkit.core.text.Segment]) list[tuple[medkit.core.text.Segment, list[medkit.core.text.Segment]]] #
Return source segments aligned with its nested segments.
Only nested segments fully contained in the source_segments are returned.
- Parameters:
- source_segmentslist of Segment
List of source segments
- target_segmentslist of Segment
List of segments to align
- Returns:
- list of tuple
List of aligned segments
- class medkit.text.postprocessing.AttributeDuplicator(attr_labels: list[str], uid: str | None = None)#
Bases:
medkit.core.Operation
Annotator to copy attributes from a source segment to its nested segments.
For each attribute to be duplicated, a new attribute is created in the nested segment.
- Parameters:
- attr_labelslist of str
Labels of the attributes to copy
- uidstr, optional
Identifier of the annotator
- attr_labels#
- init_args#
- run(source_segments: list[medkit.core.text.Segment], target_segments: list[medkit.core.text.Segment])#
Add attributes from source segments to all nested segments.
The nested segments are chosen among the target_segments based on their spans.
- Parameters:
- source_segmentslist of Segment
List of segments with attributes to copy
- target_segmentslist of Segment
List of segments target
- _duplicate_attr(attr: medkit.core.Attribute, target: medkit.core.text.Segment)#
- class medkit.text.postprocessing.DocumentSplitter(segment_label: str, entity_labels: list[str] | None = None, attr_labels: list[str] | None = None, relation_labels: list[str] | None = None, name: str | None = None, uid: str | None = None)#
Bases:
medkit.core.Operation
Split text documents using its segments as a reference.
The resulting ‘mini-documents’ contain the entities belonging to each segment along with their attributes.
This operation can be used to create datasets from medkit text documents.
- Parameters:
- segment_labelstr
Label of the segments to use as references for the splitter
- entity_labelslist of str, optional
Labels of entities to be included in the mini documents. If None, all entities from the document will be included.
- attr_labelslist of str, optional
Labels of the attributes to be included into the new annotations. If None, all attributes will be included.
- relation_labelslist of str, optional
Labels of relations to be included in the mini documents. If None, all relations will be included.
- namestr, optional
Name describing the splitter (default to the class name).
- uidstr, Optional
Identifier of the operation
- init_args#
- segment_label#
- entity_labels#
- attr_labels#
- relation_labels#
- run(docs: list[medkit.core.text.TextDocument]) list[medkit.core.text.TextDocument] #
Split docs into mini documents.
- Parameters:
- docs: list of TextDocument
List of text documents to split
- Returns:
- list of TextDocument
List of documents created from the selected segments
- _create_segment_doc(segment: medkit.core.text.Segment, entities: list[medkit.core.text.Entity], relations: list[medkit.core.text.Relation], doc_source: medkit.core.text.TextDocument) medkit.core.text.TextDocument #
Create a TextDocument from a segment and its entities.
The original zone of the segment becomes the text of the document.
- Parameters:
- segmentSegment
Segment to use as reference for the new document
- entitieslist of Entity
Entities inside the segment
- relationslist of Relation
Relations inside the segment
- doc_sourceTextDocument
Initial document from which annotations where extracted
- Returns:
- TextDocument
A new document with entities, the metadata includes the original span and metadata
- _filter_attrs_from_ann(ann: medkit.core.text.TextAnnotation) list[medkit.core.Attribute] #
Filter attributes from an annotation using ‘attr_labels’.
- medkit.text.postprocessing.filter_overlapping_entities(entities: list[medkit.core.text.Entity]) list[medkit.core.text.Entity] #
Filter a list of entities and remove overlaps.
This method may be useful for the creation of data for named entity recognition, where a part of text can only contain one entity per ‘word’. When an overlap is detected, the longest entity is preferred.
- Parameters:
- entitieslist of Entity
Entities to filter
- Returns:
- list of Entity
Filtered entities