Text Components#

This page contains all core text concepts of medkit.

For more details about public APIs, please refer to medkit.core.text.

Data Structures#

The TextDocument class implements the Document protocol. It allows to store subclasses of TextAnnotation, which implements the Annotation protocol.

classDiagram direction TB class Document~Annotation~{ <<protocol>> } class Annotation{ <<protocol>> } class TextDocument{ uid: str anns: TextAnnotationContainer } class TextAnnotation{ <<abstract>> uid: str label: str attrs: AttributeContainer } Document <|.. TextDocument: implements Annotation <|.. TextAnnotation: implements TextDocument *-- TextAnnotation: contains \n(TextAnnotationContainer)

Text document and text annotation#

Document#

TextDocument relies on TextAnnotationContainer to manage the annotations.

Given a text document named doc, one can:

  • browse segments, entities, and relations:

for entity in doc.anns.entities:
    ...

for segment in doc.anns.segments:
    ...

for relation in doc.anns.relations:
    ...
  • get and filter segments, entities and relations:

sentences_segments = doc.get_segments(label="sentences")
disorder_entities = doc.get_entities(label="disorder")

entity = ...
relations = doc.get_relations(label="before", source_id=entity.uid)

For more details on common interfaces provided by core components, please refer to Document.

Annotations#

For the text modality, TextDocument can only contain multiple TextAnnotation.

Three subclasses are defined Segment, Entity and Relation.

classDiagram direction TB class Annotation{ <<protocol>> } class TextAnnotation{ <<abstract>> } Annotation <|.. TextAnnotation: implements TextAnnotation <|-- Segment TextAnnotation <|-- Relation Segment <|-- Entity

Text annotation hierarchy#

Note

Each text annotation class inherits from the common interfaces provided by the core component (cf. Annotation).

For more details about public APIs, please refer to medkit.core.text.annotation.

Attributes#

Text annotations can receive attributes, which will be instances of the core Attribute class.

Among attributes, medkit.core.text proposes EntityNormAttribute, to be used for normalization attributes, in order to have a common structure for normalization information, independently of the operation used to create it.

Spans#

medkit relies on the concept of spans for following all text modifications made by the different operations.

medkit also proposes a set of utilities for manipulating these spans when implementing new operations.

For more details about public APIs, please refer to medkit.core.text.span and medkit.core.text.span_utils.

See also

You may also take a look to the spans examples.

Text Utilities#

These utilities have some preconfigured patterns for preprocessing text documents without destruction. They are not designed to be used directly, but rather inside a cleaning operation.

For more details about public APIs, please refer to medkit.core.text.utils.

See also

medkit provides a EDSCleaner class, which combines all these utilities to clean French documents (related to EDS documents coming from PDF).

Operations#

Abstract subclasses of Operation have been defined for text to ease the development of text operations according to run operations.

classDiagram Operation <|-- ContextOperation Operation <|-- DocOperation Operation <|-- NEROperation Operation <|-- SegmentationOperation Operation <|-- _CustomTextOperation

Operation hierarchy#

Internal class _CustomTextOperation has been implemented to allow user to call create_text_operation() for easier instantiation of custom text operations.

For more details about public APIs, please refer to medkit.core.text.operation.

See also

Please refer to this example for examples of custom operation.