Text Components#

This page contains all core text concepts of medkit.

For more details about public APIs, please refer to medkit.core.text.

Data Structures#

The TextDocument class implements the Document protocol. It allows to store subclasses of TextAnnotation, which implements the Annotation protocol.

classDiagram direction TB class Document~Annotation~{ <<protocol>> } class Annotation{ <<protocol>> } class TextDocument{ uid: str anns: TextAnnotationContainer } class TextAnnotation{ <<abstract>> uid: str label: str attrs: AttributeContainer } Document <|.. TextDocument: implements Annotation <|.. TextAnnotation: implements TextDocument *-- TextAnnotation: contains \n(TextAnnotationContainer)

Text document and text annotation#


TextDocument relies on TextAnnotationContainer to manage the annotations.

Given a text document named doc, one can:

  • browse segments, entities, and relations:

for entity in doc.anns.entities:

for segment in doc.anns.segments:

for relation in doc.anns.relations:
  • get and filter segments, entities and relations:

sentences_segments = doc.get_segments(label="sentences")
disorder_entities = doc.get_entities(label="disorder")

entity = ...
relations = doc.get_relations(label="before", source_id=entity.uid)

For more details on common interfaces provided by core components, please refer to Document.


For the text modality, TextDocument can only contain multiple TextAnnotation.

Three subclasses are defined Segment, Entity and Relation.

classDiagram direction TB class Annotation{ <<protocol>> } class TextAnnotation{ <<abstract>> } Annotation <|.. TextAnnotation: implements TextAnnotation <|-- Segment TextAnnotation <|-- Relation Segment <|-- Entity

Text annotation hierarchy#


Each text annotation class inherits from the common interfaces provided by the core component (cf. Annotation).

For more details about public APIs, please refer to medkit.core.text.annotation.


Text annotations can receive attributes, which will be instances of the core Attribute class.

Among attributes, medkit.core.text proposes EntityNormAttribute, to be used for normalization attributes, in order to have a common structure for normalization information, independently of the operation used to create it.


medkit relies on the concept of spans for following all text modifications made by the different operations.

medkit also proposes a set of utilities for manipulating these spans when implementing new operations.

For more details about public APIs, please refer to medkit.core.text.span and medkit.core.text.span_utils.

Text Utilities#

These utilities have some preconfigured patterns for preprocessing text documents without destruction. They are not designed to be used directly, but rather inside a cleaning operation.

For more details about public APIs, please refer to medkit.core.text.utils.

Abstract subclasses of Operation have been defined for text to ease the development of text operations according to run operations.

classDiagram Operation <|-- ContextOperation Operation <|-- DocOperation Operation <|-- NEROperation Operation <|-- SegmentationOperation Operation <|-- _CustomTextOperation

Operation hierarchy#

Internal class _CustomTextOperation has been implemented to allow user to call create_text_operation() for easier instantiation of custom text operations.

For more details about public APIs, please refer to medkit.core.text.operation.

